This is a brief guide to STATA for students in sociology 180
(updated
Remember to start a log file before you do any important work- otherwise your work won't be saved.
Remember that STATA has online help. Use it for the commands you will need to get more advice about options, etc.
Remember that you may need to bump up the memory allocated to STATA before you can use the dataset. From the STATA command line, type:
. set mem 200m
(this bumps the memory up to 200 MB, which gives us reasonable headroom considering that the CPS dataset is ~90 MB. If you are going to use the 2000 CPS alone, that dataset is about 10MB in size and you would want to set memory to at least 20MB)
Useful STATA commands:
1) describe. This tells you about your dataset. For instance:
. describe
Contains data
from E:\AAA Miker Data folder\March CPS files for class\version 2 with
occ90\Multiyear CPS.dta
obs: 896,445
vars: 51
size: 92,333,835 (64.8% of memory free)
-------------------------------------------------------------------------------------------------------------------
storage display value
variable name
type format label
variable label
-------------------------------------------------------------------------------------------------------------------
year
int %8.0g yearlbl Survey year
serial
long %12.0g seriallbl
Household serial number
hhwt
float %9.0g hhwtlbl Household weight
region
byte %27.0g regionlbl
Region and division
statefip
byte %57.0g statefiplbl State (FIPS code)
metro
byte %27.0g metrolbl
Metropolitan central city status
metarea
int %50.0g metarealbl Metropolitan area
ownershp
byte %21.0g ownershplbl Ownership of dwelling
hhincome
long %12.0g hhincomelbl Total household income
pubhous
byte %8.0g pubhouslbl Living in public housing
foodstmp
byte %8.0g foodstmplbl Food stamp recipiency
pernum
byte %8.0g pernumlbl Person number in sample unit
perwt
float %9.0g perwtlbl Person weight
momloc
byte %8.0g momloclbl Mother's location in the
household
poploc
byte %8.0g poploclbl Father's location in the
household
sploc
byte %8.0g sploclbl Spouse's location in household
famsize
byte %25.0g famsizelbl
Number of own family members in hh
nchild
byte %18.0g nchildlbl
Number of own children in household
nchlt5
byte %23.0g
nchlt5lbl
Number of own children under age 5 in hh
nsibs
byte %18.0g nsibslbl
Number of own siblings in household
relate
int %34.0g relatelbl
Relationship to
household head
age byte %19.0g
agelbl Age
sex byte %8.0g
sexlbl Sex
race
int %37.0g racelbl
Race
marst
byte %23.0g marstlbl
Marital status
popstat
byte %14.0g popstatlbl
Adult civilian, armed forces, or child
bpl long %27.0g
bpllbl Birthplace
yrimmig
int %11.0g yrimmiglbl
Year of
immigration
citizen
byte %31.0g citizenlbl
Citizenship status
mbpl
long %27.0g mbpllbl
Mother's birthplace
fbpl
long %27.0g fbpllbl
Father's birthplace
hispan
int %29.0g hispanlbl
Hispanic origin
educ99
byte %38.0g educ99lbl
Educational attainment, 1990
empstat
byte %30.0g empstatlbl
Employment status
occ1990
int %78.0g occ1990lbl
Occupation, 1990 basis
wkswork1
byte %8.0g
wkswork1lbl
Weeks worked last year
hrswork
byte %8.0g hrsworklbl
Hours worked last week
uhrswork
byte %13.0g uhrsworklbl
Usual hours worked per week (last yr)
hourwage
int %8.0g hourwagelbl
Hourly wage
union
byte %33.0g unionlbl
Union membership
inctot
long %12.0g inctotlbl
Total personal income
incwage
long %12.0g incwagelbl
Wage and salary income
incss
long %12.0g incsslbl
Social Security income
incwelfr
long %12.0g incwelfrlbl
Welfare (public assistance) income
vetstat
byte %10.0g vetstatlbl
Veteran status
vetlast
byte %26.0g vetlastlbl
Veteran's most recent period of service
disabwrk
byte %34.0g disabwrklbl
Work disability
health
byte %9.0g healthlbl
Health status
inclugh
byte %8.0g inclughlbl
Included in employer group health plan last year
himcaid
byte %8.0g himcaidlbl
Covered by Medicaid last year
ftotval
double %10.0g ftotvallbl
Total family income
-------------------------------------------------------------------------------------------------------------------
Sorted by:
Things that this tells us: There are 896,445 ‘observations’ in this dataset. Each observation is one person, and this represents the full number of individuals in the March Current Population Survey for survey years, see below. There are 51 ‘variables’ in the dataset, which is a small fraction of the number of variables available in the CPS. I will be adding a few variables to the dataset over time, so this number will grow. The size of the dataset is 92.3 million Bytes, or something like 90 MB. Since STATA wants to have all the data in memory, you need to allocate at least 200 MB, preferably 15-20MB to STATA before you can load and work on the dataset.
Some more info: the variables are listed in the left hand column; this is the list of variables you will see in your ‘variable’ window in STATA. The next column, storage type, tells you how STATA stores the variable. Byte, int, long and float are all numeric types, from lowest precision to highest. You’ll notice, for instance, that the variable ‘race’ is stored as ‘byte’, which means the values of the variable ‘race’ are integers. But why store race as 1,2, 3 etc instead of ‘Black’, ‘White’, etc? Well, 1,2,3 takes up less space. So how do you know which number corresponds to which race? The best way is to attach labels to the values, ipums has done for us. You can see which variables have value labels and which don’t in the description of the dataset above.
Every variable has a ‘variable description’. The variables and their descriptions are best located at the website www.ipums.org, where the data come from. Even more specifically, see
* ipums variable descriptions for CPS here: http://cps.ipums.org/cps-action/variableGroups.do
* and ipums introduction to the CPS methodology here: http://cps.ipums.org/cps/documentation.shtml
It’s important to look at the Data Dictionary because sometimes a value like -99 really means -99, and sometimes it means ‘missing value’. On variables like income there will be a ‘topcode’, which is the highest income that the Census Bureau will report in order to preserve confidentiality. That’s important stuff to know.
2) Tabulate. Tabulate gives you the breakdown on categorical data, such as:
. tabulate year
Survey year | Freq. Percent Cum.
------------+-----------------------------------
1962 | 71,741 8.00 8.00
1970 |
145,023 16.18 24.18
1980 |
181,488 20.25 44.43
1990 |
158,079 17.63 62.06
2000 |
133,710 14.92 76.98
2008 |
206,404 23.02 100.00
------------+-----------------------------------
Total |
896,445 100.00
Here you see that 896,445 individual observations are spread across 6 different survey years. If you want to limit yourself to 1 survey year, you have to specify that one year, otherwise you get all 6 mixed together. The CPS survey goes out into the field every month, so there is March CPS data for every year in this span, but I have provided a subset of the years (including earliest and most recent) to keep the size of the dataset manageable.
. tabulate race year
| Survey year
Race | 1962
1970 1980
1990 2000 2008 | Total
----------------------+------------------------------------------------------------------+----------
White | 64,266
127,659 158,274 135,652
113,475 164,142 | 763,468
Black/Negro | 6,849 16,038
17,711 16,036 13,626
23,864 | 94,124
American Indian/Aleut | 0 0 0
1,471 1,894 2,803 | 6,168
Asian or Pacific Isla | 0 0 0
4,362 4,715 0 | 9,077
Asian only | 0 0 0 0 0
9,617 | 9,617
Hawaiian/Pacific Isla | 0 0 0 0
0 893 | 893
Other (single) race, |
626 1,326 5,503 558 0 0 | 8,013
White-Black | 0 0 0 0 0
1,003 | 1,003
White-American Indian | 0 0 0
0 0 1,907 | 1,907
White-Asian | 0 0 0 0 0 893 | 893
White-Hawaiian/Pacifi | 0 0 0 0
0 253 | 253
Black-American Indian | 0 0 0 0 0 202 | 202
Black-Asian | 0 0 0 0 0 57 | 57
Black-Hawaiian/Pacifi | 0 0 0 0 0 8 | 8
American Indian-Asian | 0 0 0 0
0 18 | 18
Asian-Hawaiian/Pacifi | 0 0 0 0
0 233 | 233
White-Black-American | 0 0 0 0 0 173 | 173
White-Black-Asian | 0 0 0 0 0 8 | 8
White-American Indian | 0 0 0 0 0 22 | 22
White-Asian-Hawaiian/ | 0 0 0 0
0 240 | 240
White-Black-American | 0 0 0 0 0 4 | 4
Two or three races, u | 0 0 0 0 0 35 | 35
Four or five races, u | 0 0 0 0
0 29 | 29
----------------------+------------------------------------------------------------------+----------
Total | 71,741
145,023 181,488 158,079
133,710 206,404 | 896,445
This is a cross-tabulation, a tabulation of two variables. The first thing you will notice is that the different survey years coded race differently. That is typical: surveys change and adapt. The ipums documentation will generally show you how the variables have changed over time. For instance, the 1962, 1970, and 1980 CPS did not have a separate category for Asians. For 2008, the CPS adopted the newer census rules which let people choose more than one racial category, so they had to categorize a whole bunch of new multiracial combinations, all of which together constitute a fairly small percentage of the population. Out of 133,710 persons in the March 2000 survey, 113,475 are White, and 13,626 are Black. What about the Hispanics? Well, because of the funny way that the Census Bureau categorizes things, ‘Hispanic’ is not a ‘race’, so that the ‘Hispanics’ are hidden in this table, mostly under the ‘White’ category. There’s a separate question about Hispanic ancestry, called hispan.
The CPS is a nationally representative survey, which means you are supposed to be able to say things about the US non-institutional population as a whole, not just the 133,710 people in the survey. How do you generalize to the whole US population? You use the weights provided by the CPS. See section 3 below, on weights.
Here is a cross tabulation of Hispanicity by race, using only data from 2000.
. tabulate hispan race if
year==2000
| Race
Hispanic origin | White Black/Neg American
Asian or | Total
----------------------+--------------------------------------------+----------
Not Hispanic | 89,551
12,885 1,646 4,559 |
108,641
Mexican American | 6,337 29 73 8 | 6,447
Chicano/Chicana | 360 0 17 7 | 384
Mexican (Mexicano) | 7,970 55 109 21 | 8,155
Puerto Rican | 2,057 169 19 35 | 2,280
Cuban | 905 34 0 4 | 943
Other Spanish | 1,652 171 15 25 | 1,863
Central/South
Do not know | 461 2
0 8 | 471
N/A (and no response | 976 43 3 17 | 1,039
----------------------+--------------------------------------------+----------
Total | 113,475
13,626 1,894 4,715 |
133,710
It's sometimes helpful in a cross tabulation to get the row and column percentages, so:
. tabulate hispan race
if year==2000, row col
+-------------------+
| Key
|
|-------------------|
| frequency |
| row percentage |
| column percentage |
+-------------------+
| Race
Hispanic
origin | White Black/Neg American
Asian or | Total
----------------------+--------------------------------------------+----------
Not
Hispanic | 89,551 12,885
1,646 4,559 | 108,641
| 82.43 11.86 1.52 4.20 |
100.00
| 78.92 94.56
86.91 96.69 | 81.25
----------------------+--------------------------------------------+----------
Mexican
American | 6,337 29 73 8 | 6,447
| 98.29 0.45 1.13 0.12 |
100.00
| 5.58 0.21 3.85 0.17 | 4.82
----------------------+--------------------------------------------+----------
Chicano/Chicana | 360 0 17 7 | 384
| 93.75 0.00 4.43 1.82 |
100.00
| 0.32 0.00 0.90 0.15 | 0.29
----------------------+--------------------------------------------+----------
Mexican
(Mexicano) | 7,970 55 109
21 | 8,155
| 97.73 0.67 1.34 0.26 |
100.00
| 7.02 0.40 5.76 0.45 | 6.10
----------------------+--------------------------------------------+----------
Puerto Rican | 2,057 169 19 35 | 2,280
| 90.22 7.41 0.83 1.54 |
100.00
| 1.81 1.24 1.00 0.74 | 1.71
----------------------+--------------------------------------------+----------
Cuban | 905 34 0 4 | 943
| 95.97 3.61 0.00 0.42 |
100.00
| 0.80
0.25 0.00 0.08 | 0.71
----------------------+--------------------------------------------+----------
Other
Spanish | 1,652 171 15 25 | 1,863
| 88.67 9.18 0.81 1.34 |
100.00
| 1.46 1.25 0.79 0.53 | 1.39
----------------------+--------------------------------------------+----------
Central/South
| 91.94 6.83 0.34 0.89 |
100.00
| 2.83 1.75 0.63 0.66 | 2.61
----------------------+--------------------------------------------+----------
Do
not know | 461 2 0 8 | 471
| 97.88 0.42 0.00 1.70 |
100.00
| 0.41 0.01 0.00 0.17 | 0.35
----------------------+--------------------------------------------+----------
N/A (and no response |
976 43 3 17 | 1,039
| 93.94 4.14 0.29 1.64 |
100.00
| 0.86 0.32 0.16 0.36 | 0.78
----------------------+--------------------------------------------+----------
Total | 113,475 13,626
1,894 4,715 | 133,710
| 84.87 10.19
1.42 3.53 |
100.00
| 100.00 100.00
100.00 100.00 | 100.00
What do we learn from this? 78.92% of whites are not Hispanic. 98% of Mexicans and 90% of Puerto Ricans classify themselves as white.
3) The use of weights: Most commands in STATA can handle weights, such as the following command which gives the actual populations of the US, rather than the number of records in the dataset (as above).
. tabulate race if year==2000
Race | Freq. Percent Cum.
--------------------------------------+-----------------------------------
White | 113,475 84.87 84.87
Black/Negro | 13,626 10.19 95.06
American Indian/Aleut/Eskimo |
1,894 1.42 96.47
Asian or Pacific Islander |
4,715 3.53 100.00
--------------------------------------+-----------------------------------
Total | 133,710
100.00
. tabulate race if year==2000 [fweight= perwt_rounded]
Race | Freq. Percent Cum.
--------------------------------------+-----------------------------------
White
|224,806,952 82.02 82.02
Black/Negro |
35,508,668 12.96 94.98
American Indian/Aleut/Eskimo | 2,847,473 1.04 96.01
Asian or Pacific Islander | 10,924,728 3.99 100.00
--------------------------------------+-----------------------------------
Total
|274,087,821 100.00
What does this mean?
The total
One more point.
You’ll notice that in second panel above, the weighted data, that Blacks
represent 12.96% of the
4) Summarize. For numerical variables where the numbers really mean something, such as earnings, the summarize command provides mean (i.e. numerical average), standard deviation, minimum and maximum values.
summarize incwage if year==2000
Variable
| Obs Mean
Std. Dev. Min Max
-------------+--------------------------------------------------------
incwage |
103226 19462.59 28843.38 0
364302
What does this mean? Well, it means the average earned income for 1999 (the full year before the survey) was $19,463. This seems kind of low, doesn’t it? One of the most important things you have to do is interpret the output, and ask yourself whether the output makes sense. The number of observations here is 103,226, which is less than 133,710, the full number of individuals in the March 2000 CPS. But the survey includes lots of people who are too young or too old to work, not to speak of the adults who are unemployed, and so on! Perhaps the mean earnings are appear to be low because we’ve included in our sample population some people whose earnings must be zero. We will use the if command to tell STATA that we only want to examine a particular sub-population (see below for further explanation). Also, when we get around to comparing wages and income over time, it is important to take inflation into account (otherwise the numbers won’t really make sense).
5) Sorting. You can sort by variables (sex, race, veteran status etc) and then calculate earnings (the Sort and then By syntax works with most STATA commands)
. sort sex
. by sex: summarize incwage
if year==2000
> sex
= Male
Variable | Obs Mean
Std. Dev. Min Max
-------------+--------------------------------------------------------
incwage | 49353
25943.8 34862.55 0
364302
> sex
= Female
Variable | Obs
Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
incwage | 53873
13525.17 20172.65 0
333564
The average earnings of men in the sample is $25943.8 per year
compared to $13,525 for women, but remember we’re still including all sorts of
people here whose earnings must be zero.
6) Let's say you want to look at earnings, but only for people in their 20s. You would use the IF construction, as follows:
. by sex: summarize incwage
if year==2000 & age>19 & age<30
-> sex = Male
Variable
| Obs Mean
Std. Dev. Min Max
-------------+--------------------------------------------------------
incwage |
8351 19628.57 19126.9 0
257525
-> sex = Female
Variable
| Obs Mean
Std. Dev. Min Max
-------------+--------------------------------------------------------
incwage |
8808 13355.3 16022.36 0
333564
And you could further limit it by looking only at people who had positive earnings, so
. by sex: summarize incwage
if year==2000 & age>19 & age<30 & incwage>0
-> sex = Male
Variable
| Obs Mean
Std. Dev. Min Max
-------------+--------------------------------------------------------
incwage |
7348 22307.87 18868.08 8
257525
-> sex = Female
Variable
| Obs Mean
Std. Dev. Min Max
-------------+--------------------------------------------------------
incwage |
6921 16996.6 16273.32 5
333564
But notice here the low end of income spectrum for both men and women still includes peoples with yearly(!) wages of less than $10 for 1999. Those folks will skew the comparison quite a bit.
If you use the weight variable, you'll get a more accurate
average, and you'll know how many people in the
. by sex: summarize incwage
if year==2000 & age>19 & age<30 & incwage>0 [fweight=
perwt_rounded]
-------------------------------------------------------------------------------------------------------------------
-> sex = Male
Variable
| Obs Mean
Std. Dev. Min Max
-------------+--------------------------------------------------------
incwage |
15802234 22712.7 19420.26 8
257525
-------------------------------------------------------------------------------------------------------------------
-> sex = Female
Variable
| Obs Mean
Std. Dev. Min Max
-------------+--------------------------------------------------------
incwage |
14773668 17329.45 16514.31 5
333564
What this means is that there are 15.8 million
twentysomething men and 14.8 million twentysomething women with positive
incomes in the
. by
sex: summarize incwage if year==2000 & age>19 & age<30 &
incwage>0 & wkswork1>29
[fweight= perwt_rounded]
-------------------------------------------------------------------------------------------------------------------
->
sex = Male
Variable | Obs Mean
Std. Dev. Min Max
-------------+--------------------------------------------------------
incwage | 13486090
25412.34 19336.23 8
257525
-------------------------------------------------------------------------------------------------------------------
->
sex = Female
Variable | Obs Mean
Std. Dev. Min Max
-------------+--------------------------------------------------------
incwage | 11744135
20436.34 16906.75 5
333564
Note that in the above command, I said “year==2000” with a double equal sign. Why? Well, STATA has some special syntax for stuff that goes after the IF.
What is the syntax to use after the IF?
< less than
> greater than
<= less than or equal to
>= greater than or equal to
= = Equals (that's right, in comparisons STATA uses the double equal sign to mean 'equal')
~= Not Equal
& And. Puts two or more conditions together
| Or.
7) In order to generate a new variable, you would use generate. So
. generate perwt_rounded=round(perwt)
This operation
creates a new variable, perwt_rounded by rounding (to the nearest integer) an
existing variable, perwt.
. replace perwt_rounded=0
if perwt<0
(51
real changes made)
This
operation replaced all negative values of perwt_rounded to zero. Negative
population weights don’t make sense in surveys.
Note that in generate and replace, where values are being assigned to variables, you use a SINGLE Equals sign, rather than a double.
There are several CPS educational variables. The one that is most consistent across suvery years is educrec. None of the educational variables correspond to actual years of education, they are all categorical variables, so:
tabulate educrec year
Educational | Survey year
attainment recode
| 1962 1970 1980 1990 | Total
----------------------+--------------------------------------------+----------
NIU | 0
40,544 43,935 36,714 |
201,448
None or preschool | 1,167
1,030 1,114 755 | 5,073
Grades 1, 2, 3, or 4 | 3,360
3,114 2,847 1,895 |
13,689
Grades 5, 6, 7, or 8 | 19,806
22,742 18,145 11,385 |
86,482
Grade 9 | 5,651
7,678 8,490 5,883 |
37,845
Grade 10 | 5,984
8,742 9,807 6,787 |
42,882
Grade 11 | 4,315
6,648 8,060 6,234 |
37,213
Grade 12 | 19,743
33,438 48,926 44,509 |
227,306
1
to 3 years of colle | 6,475 11,621
21,259 21,967 | 128,211
4+ years of college | 5,240
9,465 18,905 21,950 |
116,295
Missing/Unknown | 0 1 0 0 | 1
----------------------+--------------------------------------------+----------
Total | 71,741
145,023 181,488 158,079 |
896,445
Educational | Survey year
attainment recode
| 2000 2008 | Total
----------------------+----------------------+----------
NIU | 30,484
49,771 | 201,448
None or preschool | 457 550 | 5,073
Grades 1, 2, 3, or 4 | 1,187
1,286 | 13,689
Grades 5, 6, 7, or 8 | 6,847
7,557 | 86,482
Grade 9 | 4,161
5,982 | 37,845
Grade 10 | 4,695
6,867 | 42,882
Grade 11 | 4,721
7,235 | 37,213
Grade 12 | 33,461
47,229 | 227,306
1
to 3 years of colle | 25,883 41,006 | 128,211
4+ years of college | 21,814
38,921 | 116,295
Missing/Unknown | 0 0 | 1
----------------------+----------------------+----------
Total | 133,710 206,404 | 896,445
Note that educrec, the older educational variable that applies to all years topcodes education at 4+ years of college. That’s not very useful if we want to look at people with advanced degrees.
tabulate
educrec, nolab
Educational
|
attainment |
recode | Freq.
Percent
Cum.
------------+-----------------------------------
0 |
201,448 22.47 22.47
1 | 5,073 0.57 23.04
2 | 13,689 1.53 24.56
3 | 86,482 9.65 34.21
4 | 37,845 4.22 38.43
5 | 42,882 4.78 43.22
6 | 37,213 4.15 47.37
7 |
227,306 25.36 72.72
8 |
128,211 14.30 87.03
9 |
116,295 12.97 100.00
99 | 1 0.00 100.00
------------+-----------------------------------
Total |
896,445 100.00
When you tabulate without the labels, you see
how the different categories are really stored.
. gen yrsed=. if
educrec==0
(896445
missing values generated)
. replace yrsed=0 if educrec==1
(5073
real changes made)
. replace yrsed=2.5 if educrec==2
(13689
real changes made)
. replace yrsed=6.5 if educrec==3
(86482
real changes made)
. replace yrsed=9 if educrec==4
(37845
real changes made)
. replace yrsed=10 if educrec==5
(42882
real changes made)
. replace yrsed=11 if educrec==6
(37213
real changes made)
. replace yrsed=12 if educrec==7
(227306
real changes made)
. replace yrsed=14 if educrec==8
(128211
real changes made)
. replace yrsed=17 if educrec==9
(116295
real changes made)
. replace yrsed=. if
educrec==99
(0 real
changes made)
. table educrec, contents(mean yrsed)
-------------------------------------
Educational
attainment |
recode | mean(yrsed)
------------------------+------------
NIU |
None or preschool | 0
Grades 1, 2, 3, or 4 | 2.5
Grades
5, 6, 7, or 8 | 6.5
Grade 9 | 9
Grade 10 | 10
Grade 11 | 11
Grade 12 | 12
1 to 3
years of college | 14
4+ years of college | 17
Missing/Unknown |
-------------------------------------
table is like tabulate, except table allows you to put means or other calculated values in the cells, while tabulate just wants counts (or weighted counts) and the associated row and column percentages in each cell.
This then allows me to look at educational attainment by race:
. sort race
. by race: summarize yrsed if year==2000
-> race = White
Variable
| Obs Mean
Std. Dev. Min Max
-------------+--------------------------------------------------------
yrsed | 88334 12.81067
3.154691 0 17
-> race = Black/Negro
Variable
| Obs Mean
Std. Dev. Min Max
-------------+--------------------------------------------------------
yrsed | 9916 12.32498
2.928274 0 17
-> race = American
Indian/Aleut/Eskimo
Variable
| Obs Mean
Std. Dev. Min Max
-------------+--------------------------------------------------------
yrsed | 1320 11.92008
3.017615 0 17
-> race = Asian or
Pacific Islander
Variable
| Obs Mean
Std. Dev. Min Max
-------------+--------------------------------------------------------
yrsed | 3656 13.39401
3.606886 0 17
Or even better, to look at educational attainment by race for people old enough to have gone through the educational system.
. by race: summarize yrsed if year==2000 & age>25
-> race = White
Variable
| Obs Mean
Std. Dev. Min Max
-------------+--------------------------------------------------------
yrsed | 71946 13.09026
3.179583 0 17
-> race = Black/Negro
Variable
| Obs Mean
Std. Dev. Min Max
-------------+--------------------------------------------------------
yrsed | 7768 12.5708
3.006368 0 17
-> race = American
Indian/Aleut/Eskimo
Variable
| Obs Mean
Std. Dev. Min Max
-------------+--------------------------------------------------------
yrsed | 983 12.23398
3.108638 0 17
-> race = Asian or
Pacific Islander
Variable
| Obs Mean
Std. Dev. Min Max
-------------+--------------------------------------------------------
yrsed | 2850 13.66737
3.770145 0 17
8) A Note about Numbers, Data, Categories, and Labels. If you Tabulate race, you get this:
tabulate race if year==2000
Race | Freq. Percent Cum.
--------------------------------------+-----------------------------------
White | 113,475 84.87 84.87
Black/Negro | 13,626 10.19 95.06
American Indian/Aleut/Eskimo | 1,894 1.42 96.47
Asian or Pacific Islander | 4,715 3.53 100.00
--------------------------------------+-----------------------------------
Total | 133,710
100.00
. tabulate race if year==2000, nolabel
Race | Freq. Percent Cum.
------------+-----------------------------------
100 |
113,475 84.87 84.87
200 | 13,626 10.19 95.06
300 | 1,894 1.42 96.47
650 | 4,715
3.53 100.00
------------+-----------------------------------
Total |
133,710 100.00
In other words, race is stored as a numerical variable, with 4 categories in the CPS 2000 (more categories in 2008). The racial groups don’t correspond to numbers in any meaningful way. The numbers are arbitrary. The categories are nominal, as opposed to ordinal (ordinal categories have a meaningful order). You could, if you wanted to, take the average of those numerical categories, but the results would be meaningless.
summarize race if year==2000
Variable | Obs Mean
Std. Dev. Min Max
-------------+--------------------------------------------------------
race | 133710
132.4183 105.8387 100
650
One key mistake *NOT* to make is to treat nominal variables like race as if the numbers really meant something. Do numerical manipulation only on variables that have numerical values that really mean something, for instance the income variables. On the other side of things, tabulate is good for categorical variables like race, but mostly useless for continuous variables like income, because income has thousands of different levels, for every different recorded income in the dataset.
If you want to examine only the blacks, you need to use the statement “if race==200”, Stata doesn’t understand “if race==”black”” because race is stored as a number.
. summarize inctot if
race=="White"
type mismatch
r(109);
So:
tabulate region if year==2000
Region and division |
Freq. Percent Cum.
----------------------------+-----------------------------------
Middle Atlantic Division | 17,734 13.26 20.35
East
North Central Division | 18,311 13.69 34.04
West
North Central Division | 11,446 8.56 42.60
East
South Central Division | 6,564 4.91 63.23
West
South Central Division | 13,299 9.95 73.17
Mountain Division | 16,382 12.25 85.42
Pacific Division | 19,489 14.58 100.00
----------------------------+-----------------------------------
Total | 133,710
100.00
. tabulate region if year==2000 & race==200
Region and division
| Freq. Percent Cum.
----------------------------+-----------------------------------
Middle Atlantic Division | 2,308 16.94 20.44
East
North Central Division | 2,239 16.43 36.87
West
North Central Division | 420 3.08 39.95
East
South Central Division | 1,370 10.05 80.60
West
South Central Division | 1,597 11.72 92.32
Mountain Division | 284 2.08 94.40
Pacific Division | 763 5.60 100.00
----------------------------+-----------------------------------
Total | 13,626 100.00
Blacks are overrepresented in the