This is a brief guide to STATA for students in sociology 180

This is a brief guide to STATA for students in sociology 180 (updated 9/16/2013, © Michael J. Rosenfeld 2008 and 2013), based on the March 2000 Current Population Survey data that is available on my website. We will be using about 15 or 20 commands in STATA, and STATA has hundreds. Most of those commands do all sorts of sophisticated statistical stuff that you don't need to worry about. Stick to the basics:

Remember to start a log file before you do any important work- otherwise your work won't be saved.

Remember that STATA has online help. Use it for the commands you will need to get more advice about options, etc.

/* If you have an older version of STATA, prior to version 12, you may need to bump up the memory allocated to STATA before you can use the dataset. From the STATA command line, type:

. set mem 200m

(this bumps the memory up to 200 MB, which gives us reasonable headroom considering that the CPS dataset is ~90 MB. If you are going to use the 2000 CPS alone, that dataset is about 10MB in size and you would want to set memory to at least 20MB)

In Stata 12 or 13, the program allocates memory on the fly, and you can see the allocation in the data tab of the main screen*/

Useful STATA commands:

1) describe. This tells you about your dataset. For instance:

. describe

Contains data from E:\AAA Miker Data folder\March CPS files for class\version 2 with occ90\Multiyear CPS.dta

obs: 896,445

vars: 51 21 Jan 2009 11:16

size: 92,333,835 (64.8% of memory free)

-------------------------------------------------------------------------------------------------------------------

storage display value

variable name type format label variable label

-------------------------------------------------------------------------------------------------------------------

year int %8.0g yearlbl Survey year

serial long %12.0g seriallbl Household serial number

hhwt float %9.0g hhwtlbl Household weight

region byte %27.0g regionlbl Region and division

statefip byte %57.0g statefiplbl State (FIPS code)

metro byte %27.0g metrolbl Metropolitan central city status

metarea int %50.0g metarealbl Metropolitan area

ownershp byte %21.0g ownershplbl Ownership of dwelling

hhincome long %12.0g hhincomelbl Total household income

pubhous byte %8.0g pubhouslbl Living in public housing

foodstmp byte %8.0g foodstmplbl Food stamp recipiency

pernum byte %8.0g pernumlbl Person number in sample unit

perwt float %9.0g perwtlbl Person weight

momloc byte %8.0g momloclbl Mother's location in the household

poploc byte %8.0g poploclbl Father's location in the household

sploc byte %8.0g sploclbl Spouse's location in household

famsize byte %25.0g famsizelbl

Number of own family members in hh

nchild byte %18.0g nchildlbl

Number of own children in household

nchlt5 byte %23.0g nchlt5lbl

Number of own children under age 5 in hh

nsibs byte %18.0g nsibslbl Number of own siblings in household

relate int %34.0g relatelbl

Relationship to household head

age byte %19.0g agelbl Age

sex byte %8.0g sexlbl Sex

race int %37.0g racelbl Race

marst byte %23.0g marstlbl Marital status

popstat byte %14.0g popstatlbl

Adult civilian, armed forces, or child

bpl long %27.0g bpllbl Birthplace

yrimmig int %11.0g yrimmiglbl

Year of immigration

citizen byte %31.0g citizenlbl

Citizenship status

mbpl long %27.0g mbpllbl Mother's birthplace

fbpl long %27.0g fbpllbl Father's birthplace

hispan int %29.0g hispanlbl

Hispanic origin

educ99 byte %38.0g educ99lbl

Educational attainment, 1990

empstat byte %30.0g empstatlbl

Employment status

occ1990 int %78.0g occ1990lbl

Occupation, 1990 basis

wkswork1 byte %8.0g wkswork1lbl

Weeks worked last year

hrswork byte %8.0g hrsworklbl

Hours worked last week

uhrswork byte %13.0g uhrsworklbl

Usual hours worked per week (last yr)

hourwage int %8.0g hourwagelbl

Hourly wage

union byte %33.0g unionlbl Union membership

inctot long %12.0g inctotlbl

Total personal income

incwage long %12.0g incwagelbl

Wage and salary income

incss long %12.0g incsslbl Social Security income

incwelfr long %12.0g incwelfrlbl

Welfare (public assistance) income

vetstat byte %10.0g vetstatlbl

Veteran status

vetlast byte %26.0g vetlastlbl

Veteran's most recent period of service

disabwrk byte %34.0g disabwrklbl

Work disability

health byte %9.0g healthlbl

Health status

inclugh byte %8.0g inclughlbl

Included in employer group health plan last year

himcaid byte %8.0g himcaidlbl

Covered by Medicaid last year

ftotval double %10.0g ftotvallbl

Total family income

-------------------------------------------------------------------------------------------------------------------

Sorted by:

/* note that in Stata 13, the observations, variables, and size are part of the data tab, not reported in the output for describe*/

Things that this tells us: There are 896,445 ‘observations’ in this dataset. Each observation is one person, and this represents the full number of individuals in the March Current Population Survey for survey years, see below. There are 51 ‘variables’ in the dataset, which is a small fraction of the number of variables available in the CPS. I will be adding a few variables to the dataset over time, so this number will grow. The size of the dataset is 92.3 million Bytes, or something like 90 MB. Since STATA wants to have all the data in memory, you need to allocate at least 200 MB, preferably 15-20MB to STATA before you can load and work on the dataset.

Some more info: the variables are listed in the left hand column; this is the list of variables you will see in your ‘variable’ window in STATA. The next column, storage type, tells you how STATA stores the variable. Byte, int, long and float are all numeric types, from lowest precision to highest. You’ll notice, for instance, that the variable ‘race’ is stored as ‘byte’, which means the values of the variable ‘race’ are integers. But why store race as 1,2, 3 etc instead of ‘Black’, ‘White’, etc? Well, 1,2,3 takes up less space. So how do you know which number corresponds to which race? The best way is to attach labels to the values, ipums has done for us. You can see which variables have value labels and which don’t in the description of the dataset above.

Every variable has a ‘variable description’. The variables and their descriptions are best located at the website www.ipums.org, where the data come from. Even more specifically, see

* ipums variable descriptions for CPS here: http://cps.ipums.org/cps-action/variableGroups.do

* and ipums introduction to the CPS methodology here: http://cps.ipums.org/cps/documentation.shtml

It’s important to look at the Data Dictionary because sometimes a value like -99 really means -99, and sometimes it means ‘missing value’. On variables like income there will be a ‘topcode’, which is the highest income that the Census Bureau will report in order to preserve confidentiality. That’s important stuff to know.

2) Tabulate. Tabulate gives you the breakdown on categorical data, such as:

. tabulate year

Survey year | Freq. Percent Cum.

------------+-----------------------------------

1962 | 71,741 8.00 8.00

1970 | 145,023 16.18 24.18

1980 | 181,488 20.25 44.43

1990 | 158,079 17.63 62.06

2000 | 133,710 14.92 76.98

2008 | 206,404 23.02 100.00

------------+-----------------------------------

Total | 896,445 100.00

Here you see that 896,445 individual observations are spread across 6 different survey years. If you want to limit yourself to 1 survey year, you have to specify that one year, otherwise you get all 6 mixed together. The CPS survey goes out into the field every month, so there is March CPS data for every year in this span, but I have provided a subset of the years (including earliest and most recent) to keep the size of the dataset manageable.

. tabulate race year

| Survey year

Race | 1962 1970 1980 1990 2000 2008 | Total

----------------------+------------------------------------------------------------------+----------

White | 64,266 127,659 158,274 135,652 113,475 164,142 | 763,468

Black/Negro | 6,849 16,038 17,711 16,036 13,626 23,864 | 94,124

American Indian/Aleut | 0 0 0 1,471 1,894 2,803 | 6,168

Asian or Pacific Isla | 0 0 0 4,362 4,715 0 | 9,077

Asian only | 0 0 0 0 0 9,617 | 9,617

Hawaiian/Pacific Isla | 0 0 0 0 0 893 | 893

Other (single) race, | 626 1,326 5,503 558 0 0 | 8,013

White-Black | 0 0 0 0 0 1,003 | 1,003

White-American Indian | 0 0 0 0 0 1,907 | 1,907

White-Asian | 0 0 0 0 0 893 | 893

White-Hawaiian/Pacifi | 0 0 0 0 0 253 | 253

Black-American Indian | 0 0 0 0 0 202 | 202

Black-Asian | 0 0 0 0 0 57 | 57

Black-Hawaiian/Pacifi | 0 0 0 0 0 8 | 8

American Indian-Asian | 0 0 0 0 0 18 | 18

Asian-Hawaiian/Pacifi | 0 0 0 0 0 233 | 233

White-Black-American | 0 0 0 0 0 173 | 173

White-Black-Asian | 0 0 0 0 0 8 | 8

White-American Indian | 0 0 0 0 0 22 | 22

White-Asian-Hawaiian/ | 0 0 0 0 0 240 | 240

White-Black-American | 0 0 0 0 0 4 | 4

Two or three races, u | 0 0 0 0 0 35 | 35

Four or five races, u | 0 0 0 0 0 29 | 29

----------------------+------------------------------------------------------------------+----------

Total | 71,741 145,023 181,488 158,079 133,710 206,404 | 896,445

This is a cross-tabulation, a tabulation of two variables. The first thing you will notice is that the different survey years coded race differently. That is typical: surveys change and adapt. The ipums documentation will generally show you how the variables have changed over time. For instance, the 1962, 1970, and 1980 CPS did not have a separate category for Asians. For 2008, the CPS adopted the newer census rules which let people choose more than one racial category, so they had to categorize a whole bunch of new multiracial combinations, all of which together constitute a fairly small percentage of the population. Out of 133,710 persons in the March 2000 survey, 113,475 are White, and 13,626 are Black. What about the Hispanics? Well, because of the funny way that the Census Bureau categorizes things, ‘Hispanic’ is not a ‘race’, so that the ‘Hispanics’ are hidden in this table, mostly under the ‘White’ category. There’s a separate question about Hispanic ancestry, called hispan.

The CPS is a nationally representative survey, which means you are supposed to be able to say things about the US non-institutional population as a whole, not just the 133,710 people in the survey. How do you generalize to the whole US population? You use the weights provided by the CPS. See section 3 below, on weights.

Here is a cross tabulation of Hispanicity by race, using only data from 2000.

. tabulate hispan race if year==2000

| Race

Hispanic origin | White Black/Neg American Asian or | Total

----------------------+--------------------------------------------+----------

Not Hispanic | 89,551 12,885 1,646 4,559 | 108,641

Mexican American | 6,337 29 73 8 | 6,447

Chicano/Chicana | 360 0 17 7 | 384

Mexican (Mexicano) | 7,970 55 109 21 | 8,155

Puerto Rican | 2,057 169 19 35 | 2,280

Cuban | 905 34 0 4 | 943

Other Spanish | 1,652 171 15 25 | 1,863

Central/South America | 3,206 238 12 31 | 3,487

Do not know | 461 2 0 8 | 471

N/A (and no response | 976 43 3 17 | 1,039

----------------------+--------------------------------------------+----------

Total | 113,475 13,626 1,894 4,715 | 133,710

It's sometimes helpful in a cross tabulation to get the row and column percentages, so:

. tabulate hispan race if year==2000, row col

+-------------------+

| Key |

|-------------------|

| frequency |

| row percentage |

| column percentage |

+-------------------+

| Race

Hispanic origin | White Black/Neg American Asian or | Total

----------------------+--------------------------------------------+----------

Not Hispanic | 89,551 12,885 1,646 4,559 | 108,641

| 82.43 11.86 1.52 4.20 | 100.00

| 78.92 94.56 86.91 96.69 | 81.25

----------------------+--------------------------------------------+----------

Mexican American | 6,337 29 73 8 | 6,447

| 98.29 0.45 1.13 0.12 | 100.00

| 5.58 0.21 3.85 0.17 | 4.82

----------------------+--------------------------------------------+----------

Chicano/Chicana | 360 0 17 7 | 384

| 93.75 0.00 4.43 1.82 | 100.00

| 0.32 0.00 0.90 0.15 | 0.29

----------------------+--------------------------------------------+----------

Mexican (Mexicano) | 7,970 55 109 21 | 8,155

| 97.73 0.67 1.34 0.26 | 100.00

| 7.02 0.40 5.76 0.45 | 6.10

----------------------+--------------------------------------------+----------

Puerto Rican | 2,057 169 19 35 | 2,280

| 90.22 7.41 0.83 1.54 | 100.00

| 1.81 1.24 1.00 0.74 | 1.71

----------------------+--------------------------------------------+----------

Cuban | 905 34 0 4 | 943

| 95.97 3.61 0.00 0.42 | 100.00

| 0.80 0.25 0.00 0.08 | 0.71

----------------------+--------------------------------------------+----------

Other Spanish | 1,652 171 15 25 | 1,863

| 88.67 9.18 0.81 1.34 | 100.00

| 1.46 1.25 0.79 0.53 | 1.39

----------------------+--------------------------------------------+----------

Central/South America | 3,206 238 12 31 | 3,487

| 91.94 6.83 0.34 0.89 | 100.00

| 2.83 1.75 0.63 0.66 | 2.61

----------------------+--------------------------------------------+----------

Do not know | 461 2 0 8 | 471

| 97.88 0.42 0.00 1.70 | 100.00

| 0.41 0.01 0.00 0.17 | 0.35

----------------------+--------------------------------------------+----------

N/A (and no response | 976 43 3 17 | 1,039

| 93.94 4.14 0.29 1.64 | 100.00

| 0.86 0.32 0.16 0.36 | 0.78

----------------------+--------------------------------------------+----------

Total | 113,475 13,626 1,894 4,715 | 133,710

| 84.87 10.19 1.42 3.53 | 100.00

| 100.00 100.00 100.00 100.00 | 100.00

What do we learn from this? 78.92% of whites are not Hispanic. 98% of Mexicans and 90% of Puerto Ricans classify themselves as white.

3) The use of weights: Most commands in STATA can handle weights, such as the following command which gives the actual populations of the US, rather than the number of records in the dataset (as above).

. tabulate race if year==2000

Race | Freq. Percent Cum.

--------------------------------------+-----------------------------------

White | 113,475 84.87 84.87

Black/Negro | 13,626 10.19 95.06

American Indian/Aleut/Eskimo | 1,894 1.42 96.47

Asian or Pacific Islander | 4,715 3.53 100.00

--------------------------------------+-----------------------------------

Total | 133,710 100.00

. tabulate race if year==2000 [fweight= perwt_rounded]

Race | Freq. Percent Cum.

--------------------------------------+-----------------------------------

White |224,806,952 82.02 82.02

Black/Negro | 35,508,668 12.96 94.98

American Indian/Aleut/Eskimo | 2,847,473 1.04 96.01

Asian or Pacific Islander | 10,924,728 3.99 100.00

--------------------------------------+-----------------------------------

Total |274,087,821 100.00

What does this mean? The total US non-institutional population in March of 2002 was 274 million, according to the CPS. Take a look at the CPS documentation to see what ‘non-institutional’ means- some groups such as prisoners are not represented in the CPS. The [fweight= perwt_rounded] syntax tells STATA that there is a frequency weight in the variable perwt_rounded. A frequency weight means that each individual in the survey represents a large number of individuals in the general population. The average weight is roughly 1,500. One in 1,500 residents of the US was directly involved in the survey, and one must multiply the individual responses by the weight of 1,500 to get a picture of the whole US population. It’s very, very, important to keep straight in your mind the difference between the weighted and the unweighted data. If you want to know how many people are in the survey, use the unweighted data (total population 133,710). If you want to know how many people in each category there are in the US (total population 274 million), use the weighted data.

One more point. You’ll notice that in second panel above, the weighted data, that Blacks represent 12.96% of the US population. In the first panel above, the unweighted data, Blacks represent only 10.19% of the population. What accounts for the difference? Well, the weights in the survey are not uniform. Some groups, like Blacks, receive slightly higher weights in the survey because their response rates are slightly lower. The weight makes up for the fact that Blacks are slightly underrepresented in the survey (one might legitimately ask how the Census Bureau figures out that Blacks are underrepresented, and if so by how much, but that’s a complex question that we will not be examining in this class). The overall response rate for this CPS survey is more than 90% (according to Appendix G of the documentation). That’s a very high response rate that the government can achieve because they have a lot of money and resources to do these surveys. Very simply, not only do the weights give us total numbers that very conveniently add up to the whole US population, but the weighted numbers provide what researchers believe are better (more unbiased) estimates of the relative sizes of groups. In other words, 12.95% (from the weighted tabulation) is a better estimate for the Black proportion of the US than 10.19% (from the unweighted tabulation).

4) Summarize. For numerical variables where the numbers really mean something, such as earnings, the summarize command provides mean (i.e. numerical average), standard deviation, minimum and maximum values.

summarize incwage if year==2000

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

incwage | 103226 19462.59 28843.38 0 364302

What does this mean? Well, it means the average earned income for 1999 (the full year before the survey) was $19,463. This seems kind of low, doesn’t it? One of the most important things you have to do is interpret the output, and ask yourself whether the output makes sense. The number of observations here is 103,226, which is less than 133,710, the full number of individuals in the March 2000 CPS. But the survey includes lots of people who are too young or too old to work, not to speak of the adults who are unemployed, and so on! Perhaps the mean earnings are appear to be low because we’ve included in our sample population some people whose earnings must be zero. We will use the if command to tell STATA that we only want to examine a particular sub-population (see below for further explanation). Also, when we get around to comparing wages and income over time, it is important to take inflation into account (otherwise the numbers won’t really make sense).

5) Sorting. You can sort by variables (sex, race, veteran status etc) and then calculate earnings (the Sort and then By syntax works with most STATA commands)

. sort sex

. by sex: summarize incwage if year==2000

> sex = Male

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

incwage | 49353 25943.8 34862.55 0 364302

> sex = Female

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

incwage | 53873 13525.17 20172.65 0 333564

The average earnings of men in the sample is $25943.8 per year compared to $13,525 for women, but remember we’re still including all sorts of people here whose earnings must be zero.

6) Let's say you want to look at earnings, but only for people in their 20s. You would use the IF construction, as follows:

. by sex: summarize incwage if year==2000 & age>19 & age<30

-> sex = Male

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

incwage | 8351 19628.57 19126.9 0 257525

-> sex = Female

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

incwage | 8808 13355.3 16022.36 0 333564

And you could further limit it by looking only at people who had positive earnings, so

. by sex: summarize incwage if year==2000 & age>19 & age<30 & incwage>0

-> sex = Male

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

incwage | 7348 22307.87 18868.08 8 257525

-> sex = Female

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

incwage | 6921 16996.6 16273.32 5 333564

But notice here the low end of income spectrum for both men and women still includes peoples with yearly(!) wages of less than $10 for 1999. Those folks will skew the comparison quite a bit.

If you use the weight variable, you'll get a more accurate average, and you'll know how many people in the US actually fit your criteria. So:

. by sex: summarize incwage if year==2000 & age>19 & age<30 & incwage>0 [fweight= perwt_rounded]

-------------------------------------------------------------------------------------------------------------------

-> sex = Male

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

incwage | 15802234 22712.7 19420.26 8 257525

-------------------------------------------------------------------------------------------------------------------

-> sex = Female

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

incwage | 14773668 17329.45 16514.31 5 333564

What this means is that there are 15.8 million twentysomething men and 14.8 million twentysomething women with positive incomes in the US in 1999, and their average income was about $22.7K for the men and $17.3K for the women. Now positive earnings may still not be the right group to think about. A lot of people in their 20s aren’t working full time, but may still have a small income. So is there a better group to think about? Well, the CPS asks a variety of questions about labor force participation. One of these variables is weeks worked last year, or wkswork1. Let’s limit our analysis to people who worked at least 30 weeks in 1999.

. by sex: summarize incwage if year==2000 & age>19 & age<30 & incwage>0 & wkswork1>29 [fweight= perwt_rounded]

-------------------------------------------------------------------------------------------------------------------

-> sex = Male

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

incwage | 13486090 25412.34 19336.23 8 257525

-------------------------------------------------------------------------------------------------------------------

-> sex = Female

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

incwage | 11744135 20436.34 16906.75 5 333564

Note that in the above command, I said “year==2000” with a double equal sign. Why? Well, STATA has some special syntax for stuff that goes after the IF.

What is the syntax to use after the IF?

< less than

> greater than

<= less than or equal to

>= greater than or equal to

= = Equals (that's right, in comparisons STATA uses the double equal sign to mean 'equal')

~= Not Equal

& And. Puts two or more conditions together

| Or.

7) In order to generate a new variable, you would use generate. So

. generate perwt_rounded=round(perwt)

This operation creates a new variable, perwt_rounded by rounding (to the nearest integer) an existing variable, perwt.

. replace perwt_rounded=0 if perwt<0

(51 real changes made)

This operation replaced all negative values of perwt_rounded to zero. Negative population weights don’t make sense in surveys.

Note that in generate and replace, where values are being assigned to variables, you use a SINGLE Equals sign, rather than a double.

There are several CPS educational variables. The one that is most consistent across suvery years is educrec. None of the educational variables correspond to actual years of education, they are all categorical variables, so:

tabulate educrec year

Educational | Survey year

attainment recode | 1962 1970 1980 1990 | Total

----------------------+--------------------------------------------+----------

NIU | 0 40,544 43,935 36,714 | 201,448

None or preschool | 1,167 1,030 1,114 755 | 5,073

Grades 1, 2, 3, or 4 | 3,360 3,114 2,847 1,895 | 13,689

Grades 5, 6, 7, or 8 | 19,806 22,742 18,145 11,385 | 86,482

Grade 9 | 5,651 7,678 8,490 5,883 | 37,845

Grade 10 | 5,984 8,742 9,807 6,787 | 42,882

Grade 11 | 4,315 6,648 8,060 6,234 | 37,213

Grade 12 | 19,743 33,438 48,926 44,509 | 227,306

1 to 3 years of colle | 6,475 11,621 21,259 21,967 | 128,211

4+ years of college | 5,240 9,465 18,905 21,950 | 116,295

Missing/Unknown | 0 1 0 0 | 1

----------------------+--------------------------------------------+----------

Total | 71,741 145,023 181,488 158,079 | 896,445

Educational | Survey year

attainment recode | 2000 2008 | Total

----------------------+----------------------+----------

NIU | 30,484 49,771 | 201,448

None or preschool | 457 550 | 5,073

Grades 1, 2, 3, or 4 | 1,187 1,286 | 13,689

Grades 5, 6, 7, or 8 | 6,847 7,557 | 86,482

Grade 9 | 4,161 5,982 | 37,845

Grade 10 | 4,695 6,867 | 42,882

Grade 11 | 4,721 7,235 | 37,213

Grade 12 | 33,461 47,229 | 227,306

1 to 3 years of colle | 25,883 41,006 | 128,211

4+ years of college | 21,814 38,921 | 116,295

Missing/Unknown | 0 0 | 1

----------------------+----------------------+----------

Total | 133,710 206,404 | 896,445

Note that educrec, the older educational variable that applies to all years topcodes education at 4+ years of college. That’s not very useful if we want to look at people with advanced degrees.

tabulate educrec, nolab

Educational |

attainment |

recode | Freq. Percent Cum.

------------+-----------------------------------

0 | 201,448 22.47 22.47

1 | 5,073 0.57 23.04

2 | 13,689 1.53 24.56

3 | 86,482 9.65 34.21

4 | 37,845 4.22 38.43

5 | 42,882 4.78 43.22

6 | 37,213 4.15 47.37

7 | 227,306 25.36 72.72

8 | 128,211 14.30 87.03

9 | 116,295 12.97 100.00

99 | 1 0.00 100.00

------------+-----------------------------------

Total | 896,445 100.00

When you tabulate without the labels, you see how the different categories are really stored.

. gen yrsed=. if educrec==0

(896445 missing values generated)

. replace yrsed=0 if educrec==1

(5073 real changes made)

. replace yrsed=2.5 if educrec==2

(13689 real changes made)

. replace yrsed=6.5 if educrec==3

(86482 real changes made)

. replace yrsed=9 if educrec==4

(37845 real changes made)

. replace yrsed=10 if educrec==5

(42882 real changes made)

. replace yrsed=11 if educrec==6

(37213 real changes made)

. replace yrsed=12 if educrec==7

(227306 real changes made)

. replace yrsed=14 if educrec==8

(128211 real changes made)

. replace yrsed=17 if educrec==9

(116295 real changes made)

. replace yrsed=. if educrec==99

(0 real changes made)

. table educrec, contents(mean yrsed)

-------------------------------------

Educational attainment |

recode | mean(yrsed)

------------------------+------------

NIU |

None or preschool | 0

Grades 1, 2, 3, or 4 | 2.5

Grades 5, 6, 7, or 8 | 6.5

Grade 9 | 9

Grade 10 | 10

Grade 11 | 11

Grade 12 | 12

1 to 3 years of college | 14

4+ years of college | 17

Missing/Unknown |

-------------------------------------

table is like tabulate, except table allows you to put means or other calculated values in the cells, while tabulate just wants counts (or weighted counts) and the associated row and column percentages in each cell.

This then allows me to look at educational attainment by race:

. sort race

. by race: summarize yrsed if year==2000

-> race = White

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

yrsed | 88334 12.81067 3.154691 0 17

-> race = Black/Negro

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

yrsed | 9916 12.32498 2.928274 0 17

-> race = American Indian/Aleut/Eskimo

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

yrsed | 1320 11.92008 3.017615 0 17

-> race = Asian or Pacific Islander

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

yrsed | 3656 13.39401 3.606886 0 17

Or even better, to look at educational attainment by race for people old enough to have gone through the educational system.

. by race: summarize yrsed if year==2000 & age>25

-> race = White

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

yrsed | 71946 13.09026 3.179583 0 17

-> race = Black/Negro

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

yrsed | 7768 12.5708 3.006368 0 17

-> race = American Indian/Aleut/Eskimo

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

yrsed | 983 12.23398 3.108638 0 17

-> race = Asian or Pacific Islander

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

yrsed | 2850 13.66737 3.770145 0 17

8) A Note about Numbers, Data, Categories, and Labels. If you Tabulate race, you get this:

tabulate race if year==2000

Race | Freq. Percent Cum.

--------------------------------------+-----------------------------------

White | 113,475 84.87 84.87

Black/Negro | 13,626 10.19 95.06

American Indian/Aleut/Eskimo | 1,894 1.42 96.47

Asian or Pacific Islander | 4,715 3.53 100.00

--------------------------------------+-----------------------------------

Total | 133,710 100.00

. tabulate race if year==2000, nolabel

Race | Freq. Percent Cum.

------------+-----------------------------------

100 | 113,475 84.87 84.87

200 | 13,626 10.19 95.06

300 | 1,894 1.42 96.47

650 | 4,715 3.53 100.00

------------+-----------------------------------

Total | 133,710 100.00

In other words, race is stored as a numerical variable, with 4 categories in the CPS 2000 (more categories in 2008). The racial groups don’t correspond to numbers in any meaningful way. The numbers are arbitrary. The categories are nominal, as opposed to ordinal (ordinal categories have a meaningful order). You could, if you wanted to, take the average of those numerical categories, but the results would be meaningless.

summarize race if year==2000

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

race | 133710 132.4183 105.8387 100 650

One key mistake *NOT* to make is to treat nominal variables like race as if the numbers really meant something. Do numerical manipulation only on variables that have numerical values that really mean something, for instance the income variables. On the other side of things, tabulate is good for categorical variables like race, but mostly useless for continuous variables like income, because income has thousands of different levels, for every different recorded income in the dataset.

If you want to examine only the blacks, you need to use the statement “if race==200”, Stata doesn’t understand “if race==”black”” because race is stored as a number.

. summarize inctot if race=="White"

type mismatch

r(109);

So:

tabulate region if year==2000

Region and division | Freq. Percent Cum.

----------------------------+-----------------------------------

New England Division | 9,470 7.08 7.08

Middle Atlantic Division | 17,734 13.26 20.35

East North Central Division | 18,311 13.69 34.04

West North Central Division | 11,446 8.56 42.60

South Atlantic Division | 21,015 15.72 58.32

East South Central Division | 6,564 4.91 63.23

West South Central Division | 13,299 9.95 73.17

Mountain Division | 16,382 12.25 85.42

Pacific Division | 19,489 14.58 100.00

----------------------------+-----------------------------------

Total | 133,710 100.00

. tabulate region if year==2000 & race==200

Region and division | Freq. Percent Cum.

----------------------------+-----------------------------------

New England Division | 477 3.50 3.50

Middle Atlantic Division | 2,308 16.94 20.44

East North Central Division | 2,239 16.43 36.87

West North Central Division | 420 3.08 39.95

South Atlantic Division | 4,168 30.59 70.54

East South Central Division | 1,370 10.05 80.60

West South Central Division | 1,597 11.72 92.32

Mountain Division | 284 2.08 94.40

Pacific Division | 763 5.60 100.00

----------------------------+-----------------------------------

Total | 13,626 100.00

Blacks are overrepresented in the South Atlantic, and underrepresented in the Pacific region of the US.