This is a brief guide to STATA for students in sociology 180 (updated 9/16/2013, © Michael J. Rosenfeld 2008 and 2013), based on the March 2000 Current Population Survey data that is available on my website.  We will be using about 15 or 20 commands in STATA, and STATA has hundreds.  Most of those commands do all sorts of sophisticated statistical stuff that you don't need to worry about.  Stick to the basics:

 

Remember to start a log file before you do any important work- otherwise your work won't be saved.

 

Remember that STATA has online help.  Use it for the commands you will need to get more advice about options, etc.

 

/* If you have an older version of STATA, prior to version 12, you may need to bump up the memory allocated to STATA before you can use the dataset.  From the STATA command line, type:

 

. set mem 200m

 

(this bumps the memory up to 200 MB, which gives us reasonable headroom considering that the CPS dataset is ~90 MB. If you are going to use the 2000 CPS alone, that dataset is about 10MB in size and you would want to set memory to at least 20MB)

In Stata 12 or 13, the program allocates memory on the fly, and you can see the allocation in the data tab of the main screen*/

 

Useful STATA commands:

 

1) describe.  This tells you about your dataset.  For instance:

 

. describe

 

Contains data from E:\AAA Miker Data folder\March CPS files for class\version 2 with occ90\Multiyear CPS.dta

  obs:       896,445                          

 vars:            51                          21 Jan 2009 11:16

 size:    92,333,835 (64.8% of memory free)

-------------------------------------------------------------------------------------------------------------------

              storage  display     value

variable name   type   format      label      variable label

-------------------------------------------------------------------------------------------------------------------

year            int    %8.0g       yearlbl    Survey year

serial          long   %12.0g      seriallbl  Household serial number

hhwt            float  %9.0g       hhwtlbl    Household weight

region          byte   %27.0g      regionlbl  Region and division

statefip        byte   %57.0g      statefiplbl State (FIPS code)

metro           byte   %27.0g      metrolbl   Metropolitan central city status

metarea         int    %50.0g      metarealbl Metropolitan area

ownershp        byte   %21.0g      ownershplbl Ownership of dwelling

hhincome        long   %12.0g      hhincomelbl Total household income

pubhous         byte   %8.0g       pubhouslbl Living in public housing

foodstmp        byte   %8.0g       foodstmplbl Food stamp recipiency

pernum          byte   %8.0g       pernumlbl Person number in sample unit

perwt           float  %9.0g       perwtlbl   Person weight

momloc          byte   %8.0g       momloclbl Mother's location in the household

poploc          byte   %8.0g       poploclbl Father's location in the household

sploc           byte   %8.0g       sploclbl   Spouse's location in household

famsize         byte   %25.0g      famsizelbl

                                              Number of own family members in hh

nchild          byte   %18.0g      nchildlbl

                                              Number of own children in household

nchlt5          byte   %23.0g      nchlt5lbl

                                              Number of own children under age 5 in hh

nsibs           byte   %18.0g      nsibslbl   Number of own siblings in household

relate          int    %34.0g      relatelbl

                                              Relationship to household head

age             byte   %19.0g      agelbl     Age

sex             byte   %8.0g       sexlbl     Sex

race            int    %37.0g      racelbl    Race

marst           byte   %23.0g      marstlbl   Marital status

popstat         byte   %14.0g      popstatlbl

                                              Adult civilian, armed forces, or child

bpl             long   %27.0g      bpllbl     Birthplace

yrimmig         int    %11.0g      yrimmiglbl

                                              Year of immigration

citizen         byte   %31.0g      citizenlbl

                                              Citizenship status

mbpl            long   %27.0g      mbpllbl    Mother's birthplace

fbpl            long   %27.0g      fbpllbl    Father's birthplace

hispan          int    %29.0g      hispanlbl

                                              Hispanic origin

educ99          byte   %38.0g      educ99lbl

                                              Educational attainment, 1990

empstat         byte   %30.0g      empstatlbl

                                              Employment status

occ1990         int    %78.0g      occ1990lbl

                                              Occupation, 1990 basis

wkswork1        byte   %8.0g       wkswork1lbl

                                              Weeks worked last year

hrswork         byte   %8.0g       hrsworklbl

                                              Hours worked last week

uhrswork        byte   %13.0g      uhrsworklbl

                                              Usual hours worked per week (last yr)

hourwage        int    %8.0g       hourwagelbl

                                              Hourly wage

union           byte   %33.0g      unionlbl   Union membership

inctot          long   %12.0g      inctotlbl

                                              Total personal income

incwage         long   %12.0g      incwagelbl

                                              Wage and salary income

incss           long   %12.0g      incsslbl   Social Security income

incwelfr        long   %12.0g      incwelfrlbl

                                              Welfare (public assistance) income

vetstat         byte   %10.0g      vetstatlbl

                                              Veteran status

vetlast         byte   %26.0g      vetlastlbl

                                              Veteran's most recent period of service

disabwrk        byte   %34.0g      disabwrklbl

                                              Work disability

health          byte   %9.0g       healthlbl

                                              Health status

inclugh         byte   %8.0g       inclughlbl

                                              Included in employer group health plan last year

himcaid         byte   %8.0g       himcaidlbl

                                              Covered by Medicaid last year

ftotval         double %10.0g      ftotvallbl

                                              Total family income

-------------------------------------------------------------------------------------------------------------------

Sorted by: 

 

/* note that in Stata 13, the observations, variables, and size are part of the data tab, not reported in the output for describe*/

 

Things that this tells us:  There are 896,445 ‘observations’ in this dataset.  Each observation is one person, and this represents the full number of individuals in the March Current Population Survey for survey years, see below.  There are 51 ‘variables’ in the dataset, which is a small fraction of the number of variables available in the CPS. I will be adding a few variables to the dataset over time, so this number will grow. The size of the dataset is 92.3 million Bytes, or something like 90 MB.  Since STATA wants to have all the data in memory, you need to allocate at least 200 MB, preferably 15-20MB to STATA before you can load and work on the dataset.

 

Some more info:  the variables are listed in the left hand column; this is the list of variables you will see in your ‘variable’ window in STATA.  The next column, storage type, tells you how STATA stores the variable.  Byte, int, long and float are all numeric types, from lowest precision to highest.  You’ll notice, for instance, that the variable ‘race’ is stored as ‘byte’, which means the values of the variable ‘race’ are integers.  But why store race as 1,2, 3 etc instead of ‘Black’, ‘White’, etc?  Well, 1,2,3 takes up less space.  So how do you know which number corresponds to which race?  The best way is to attach labels to the values, ipums has done for us.  You can see which variables have value labels and which don’t in the description of the dataset above. 

 

Every variable has a ‘variable description’.  The variables and their descriptions are best located at the website www.ipums.org, where the data come from. Even more specifically, see

* ipums variable descriptions for CPS here: http://cps.ipums.org/cps-action/variableGroups.do

* and ipums introduction to the CPS methodology here: http://cps.ipums.org/cps/documentation.shtml

 

 

 It’s important to look at the Data Dictionary because sometimes a value like -99 really means -99, and sometimes it means ‘missing value’.  On variables like income there will be a ‘topcode’, which is the highest income that the Census Bureau will report in order to preserve confidentiality.  That’s important stuff to know.

 

 

 

2) Tabulate.  Tabulate gives you the breakdown on categorical data, such as:

 

. tabulate year

 

Survey year |      Freq.     Percent        Cum.

------------+-----------------------------------

       1962 |     71,741        8.00        8.00

       1970 |    145,023       16.18       24.18

       1980 |    181,488       20.25       44.43

       1990 |    158,079       17.63       62.06

       2000 |    133,710       14.92       76.98

       2008 |    206,404       23.02      100.00

------------+-----------------------------------

      Total |    896,445      100.00

 

Here you see that 896,445 individual observations are spread across 6 different survey years. If you want to limit yourself to 1 survey year, you have to specify that one year, otherwise you get all 6 mixed together. The CPS survey goes out into the field every month, so there is March CPS data for every year in this span, but I have provided a subset of the years (including earliest and most recent) to keep the size of the dataset manageable.

 

 

. tabulate race year

 

                      |                            Survey year

                 Race |      1962       1970       1980       1990       2000       2008 |     Total

----------------------+------------------------------------------------------------------+----------

                White |    64,266    127,659    158,274    135,652    113,475    164,142 |   763,468

          Black/Negro |     6,849     16,038     17,711     16,036     13,626     23,864 |    94,124

American Indian/Aleut |         0          0          0      1,471      1,894      2,803 |     6,168

Asian or Pacific Isla |         0          0          0      4,362      4,715          0 |     9,077

           Asian only |         0          0          0          0          0      9,617 |     9,617

Hawaiian/Pacific Isla |         0          0          0          0          0        893 |       893

Other (single) race,  |       626      1,326      5,503        558          0          0 |     8,013

          White-Black |         0          0          0          0          0      1,003 |     1,003

White-American Indian |         0          0          0          0          0      1,907 |     1,907

          White-Asian |         0          0          0          0          0        893 |       893

White-Hawaiian/Pacifi |         0          0          0          0          0        253 |       253

Black-American Indian |         0          0          0          0          0        202 |       202

          Black-Asian |         0          0          0          0          0         57 |        57

Black-Hawaiian/Pacifi |         0          0          0          0          0          8 |         8

American Indian-Asian |         0          0          0          0          0         18 |        18

Asian-Hawaiian/Pacifi |         0          0          0          0          0        233 |       233

White-Black-American  |         0          0          0          0          0        173 |       173

    White-Black-Asian |         0          0          0          0          0          8 |         8

White-American Indian |         0          0          0          0          0         22 |        22

White-Asian-Hawaiian/ |         0          0          0          0          0        240 |       240

White-Black-American  |         0          0          0          0          0          4 |         4

Two or three races, u |         0          0          0          0          0         35 |        35

Four or five races, u |         0          0          0          0          0         29 |        29

----------------------+------------------------------------------------------------------+----------

                Total |    71,741    145,023    181,488    158,079    133,710    206,404 |   896,445

 

 

 

This is a cross-tabulation, a tabulation of two variables. The first thing you will notice is that the different survey years coded race differently. That is typical: surveys change and adapt. The ipums documentation will generally show you how the variables have changed over time. For instance, the 1962, 1970, and 1980 CPS did not have a separate category for Asians. For 2008, the CPS adopted the newer census rules which let people choose more than one racial category, so they had to categorize a whole bunch of new multiracial combinations, all of which together constitute a fairly small percentage of the population.  Out of 133,710 persons in the March 2000 survey, 113,475 are White, and 13,626 are Black.  What about the Hispanics?  Well, because of the funny way that the Census Bureau categorizes things, ‘Hispanic’ is not a ‘race’, so that the ‘Hispanics’ are hidden in this table, mostly under the ‘White’ category.  There’s a separate question about Hispanic ancestry, called hispan.

 

The CPS is a nationally representative survey, which means you are supposed to be able to say things about the US non-institutional population as a whole, not just the 133,710 people in the survey.  How do you generalize to the whole US population?  You use the weights provided by the CPS.  See section 3 below, on weights.

 

Here is a cross tabulation of Hispanicity by race, using only data from 2000.

 

 

. tabulate hispan race if year==2000

 

                      |                    Race

      Hispanic origin |     White  Black/Neg  American   Asian or  |     Total

----------------------+--------------------------------------------+----------

         Not Hispanic |    89,551     12,885      1,646      4,559 |   108,641

     Mexican American |     6,337         29         73          8 |     6,447

      Chicano/Chicana |       360          0         17          7 |       384

   Mexican (Mexicano) |     7,970         55        109         21 |     8,155

         Puerto Rican |     2,057        169         19         35 |     2,280

                Cuban |       905         34          0          4 |       943

        Other Spanish |     1,652        171         15         25 |     1,863

Central/South America |     3,206        238         12         31 |     3,487

          Do not know |       461          2          0          8 |       471

N/A (and no response  |       976         43          3         17 |     1,039

----------------------+--------------------------------------------+----------

                Total |   113,475     13,626      1,894      4,715 |   133,710

 

 

It's sometimes helpful in a cross tabulation to get the row and column percentages, so:

 

 

 

 

. tabulate hispan race if year==2000, row col

 

+-------------------+

| Key               |

|-------------------|

|     frequency     |

|  row percentage   |

| column percentage |

+-------------------+

 

                      |                    Race

      Hispanic origin |     White  Black/Neg  American   Asian or  |     Total

----------------------+--------------------------------------------+----------

         Not Hispanic |    89,551     12,885      1,646      4,559 |   108,641

                      |     82.43      11.86       1.52       4.20 |    100.00

                      |     78.92      94.56      86.91      96.69 |     81.25

----------------------+--------------------------------------------+----------

     Mexican American |     6,337         29         73          8 |     6,447

                      |     98.29       0.45       1.13       0.12 |    100.00

                      |      5.58       0.21       3.85       0.17 |      4.82

----------------------+--------------------------------------------+----------

      Chicano/Chicana |       360          0         17          7 |       384

                      |     93.75       0.00       4.43       1.82 |    100.00

                      |      0.32       0.00       0.90       0.15 |      0.29

----------------------+--------------------------------------------+----------

   Mexican (Mexicano) |     7,970         55        109         21 |     8,155

                      |     97.73       0.67       1.34       0.26 |    100.00

                      |      7.02       0.40       5.76       0.45 |      6.10

----------------------+--------------------------------------------+----------

         Puerto Rican |     2,057        169         19         35 |     2,280

                      |     90.22       7.41       0.83       1.54 |    100.00

                      |      1.81       1.24       1.00       0.74 |      1.71

----------------------+--------------------------------------------+----------

                Cuban |       905         34          0          4 |       943

                      |     95.97       3.61       0.00       0.42 |    100.00

                      |      0.80       0.25       0.00       0.08 |      0.71

----------------------+--------------------------------------------+----------

        Other Spanish |     1,652        171         15         25 |     1,863

                      |     88.67       9.18       0.81       1.34 |    100.00

                      |      1.46       1.25       0.79       0.53 |      1.39

----------------------+--------------------------------------------+----------

Central/South America |     3,206        238         12         31 |     3,487

                      |     91.94       6.83       0.34       0.89 |    100.00

                      |      2.83       1.75       0.63       0.66 |      2.61

----------------------+--------------------------------------------+----------

          Do not know |       461          2          0          8 |       471

                      |     97.88       0.42       0.00       1.70 |    100.00

                      |      0.41       0.01       0.00       0.17 |      0.35

----------------------+--------------------------------------------+----------

N/A (and no response  |       976         43          3         17 |     1,039

                      |     93.94       4.14       0.29       1.64 |    100.00

                      |      0.86       0.32       0.16       0.36 |      0.78

----------------------+--------------------------------------------+----------

                Total |   113,475     13,626      1,894      4,715 |   133,710

                      |     84.87      10.19       1.42       3.53 |    100.00

                      |    100.00     100.00     100.00     100.00 |    100.00

 

What do we learn from this? 78.92% of whites are not Hispanic. 98% of Mexicans and 90% of Puerto Ricans classify themselves as white.

 

 

3) The use of weights:  Most commands in STATA can handle weights, such as the following command which gives the actual populations of the US, rather than the number of records in the dataset (as above).

 

 

. tabulate race if year==2000

 

                                 Race |      Freq.     Percent        Cum.

--------------------------------------+-----------------------------------

                                White |    113,475       84.87       84.87

                          Black/Negro |     13,626       10.19       95.06

         American Indian/Aleut/Eskimo |      1,894        1.42       96.47

            Asian or Pacific Islander |      4,715        3.53      100.00

--------------------------------------+-----------------------------------

                                Total |    133,710      100.00

 

 

 

. tabulate race if year==2000 [fweight= perwt_rounded]

 

                                 Race |      Freq.     Percent        Cum.

--------------------------------------+-----------------------------------

                                White |224,806,952       82.02       82.02

                          Black/Negro | 35,508,668       12.96       94.98

         American Indian/Aleut/Eskimo |  2,847,473        1.04       96.01

            Asian or Pacific Islander | 10,924,728        3.99      100.00

--------------------------------------+-----------------------------------

                                Total |274,087,821      100.00

 

What does this mean?  The total US non-institutional population in March of 2002 was 274 million, according to the CPS.  Take a look at the CPS documentation to see what ‘non-institutional’ means- some groups such as prisoners are not represented in the CPS.  The [fweight= perwt_rounded] syntax tells STATA that there is a frequency weight in the variable perwt_rounded.  A frequency weight means that each individual in the survey represents a large number of individuals in the general population.  The average weight is roughly 1,500.  One in 1,500 residents of the US was directly involved in the survey, and one must multiply the individual responses by the weight of 1,500 to get a picture of the whole US population.  It’s very, very, important to keep straight in your mind the difference between the weighted and the unweighted data.  If you want to know how many people are in the survey, use the unweighted data (total population 133,710).  If you want to know how many people in each category there are in the US (total population 274 million), use the weighted data.

 

One more point.  You’ll notice that in second panel above, the weighted data, that Blacks represent 12.96% of the US population.  In the first panel above, the unweighted data, Blacks represent only 10.19% of the population.  What accounts for the difference?  Well, the weights in the survey are not uniform.  Some groups, like Blacks, receive slightly higher weights in the survey because their response rates are slightly lower.  The weight makes up for the fact that Blacks are slightly underrepresented in the survey (one might legitimately ask how the Census Bureau figures out that Blacks are underrepresented, and if so by how much, but that’s a complex question that we will not be examining in this class).  The overall response rate for this CPS survey is more than 90% (according to Appendix G of the documentation).  That’s a very high response rate that the government can achieve because they have a lot of money and resources to do these surveys.  Very simply, not only do the weights give us total numbers that very conveniently add up to the whole US population, but the weighted numbers provide what researchers believe are better (more unbiased) estimates of the relative sizes of groups.  In other words, 12.95% (from the weighted tabulation) is a better estimate for the Black proportion of the US than 10.19% (from the unweighted tabulation).

 

 

4) Summarize.  For numerical variables where the numbers really mean something, such as earnings, the summarize command provides mean (i.e. numerical average), standard deviation, minimum and maximum values.

 

summarize  incwage if year==2000

 

    Variable |       Obs        Mean    Std. Dev.       Min        Max

-------------+--------------------------------------------------------

     incwage |    103226    19462.59    28843.38          0     364302

 

What does this mean?  Well, it means the average earned income for 1999 (the full year before the survey) was $19,463.  This seems kind of low, doesn’t it?  One of the most important things you have to do is interpret the output, and ask yourself whether the output makes sense.  The number of observations here is 103,226, which is less than 133,710, the full number of individuals in the March 2000 CPS.  But the survey includes lots of people who are too young or too old to work, not to speak of the adults who are unemployed, and so on!  Perhaps the mean earnings are appear to be low because we’ve included in our sample population some people whose earnings must be zero.  We will use the if command to tell STATA that we only want to examine a particular sub-population (see below for further explanation). Also, when we get around to comparing wages and income over time, it is important to take inflation into account (otherwise the numbers won’t really make sense).

 

 

5) Sorting.  You can sort by variables (sex, race, veteran status etc) and then calculate earnings (the Sort and then By syntax works with most STATA commands)

 

. sort  sex

 

 

. by sex: summarize incwage if year==2000

 

> sex = Male

 

    Variable |       Obs        Mean    Std. Dev.       Min        Max

-------------+--------------------------------------------------------

     incwage |     49353     25943.8    34862.55          0     364302

 

> sex = Female

 

    Variable |       Obs        Mean    Std. Dev.       Min        Max

-------------+--------------------------------------------------------

     incwage |     53873    13525.17    20172.65          0     333564

 

 

The average earnings of men in the sample is $25943.8 per year compared to $13,525 for women, but remember we’re still including all sorts of people here whose earnings must be zero.

 

 

 

6) Let's say you want to look at earnings, but only for people in their 20s.  You would use the IF construction, as follows:

 

 

 

. by sex: summarize incwage if year==2000 & age>19 & age<30

 

-> sex = Male

 

    Variable |       Obs        Mean    Std. Dev.       Min        Max

-------------+--------------------------------------------------------

     incwage |      8351    19628.57     19126.9          0     257525

 

-> sex = Female

 

    Variable |       Obs        Mean    Std. Dev.       Min        Max

-------------+--------------------------------------------------------

     incwage |      8808     13355.3    16022.36          0     333564

 

 

And you could further limit it by looking only at people who had positive earnings, so

 

 

 

. by sex: summarize incwage if year==2000 & age>19 & age<30 & incwage>0

 

-> sex = Male

 

    Variable |       Obs        Mean    Std. Dev.       Min        Max

-------------+--------------------------------------------------------

     incwage |      7348    22307.87    18868.08          8     257525

 

-> sex = Female

 

    Variable |       Obs        Mean    Std. Dev.       Min        Max

-------------+--------------------------------------------------------

     incwage |      6921     16996.6    16273.32          5     333564

 

 

But notice here the low end of income spectrum for both men and women still includes peoples with yearly(!) wages of less than $10 for 1999. Those folks will skew the comparison quite a bit.

 

If you use the weight variable, you'll get a more accurate average, and you'll know how many people in the US actually fit your criteria.  So:

 

. by sex: summarize incwage if year==2000 & age>19 & age<30 & incwage>0 [fweight= perwt_rounded]

 

-------------------------------------------------------------------------------------------------------------------

-> sex = Male

 

    Variable |       Obs        Mean    Std. Dev.       Min        Max

-------------+--------------------------------------------------------

     incwage |  15802234     22712.7    19420.26          8     257525

 

-------------------------------------------------------------------------------------------------------------------

-> sex = Female

 

    Variable |       Obs        Mean    Std. Dev.       Min        Max

-------------+--------------------------------------------------------

     incwage |  14773668    17329.45    16514.31          5     333564

 

 

What this means is that there are 15.8 million twentysomething men and 14.8 million twentysomething women with positive incomes in the US in 1999, and their average income was about $22.7K for the men and $17.3K for the women.  Now positive earnings may still not be the right group to think about.  A lot of people in their 20s aren’t working full time, but may still have a small income.  So is there a better group to think about?  Well, the CPS asks a variety of questions about labor force participation.  One of these variables is weeks worked last year, or wkswork1. Let’s limit our analysis to people who worked at least 30 weeks in 1999.

 

. by sex: summarize incwage if year==2000 & age>19 & age<30 & incwage>0 &  wkswork1>29 [fweight= perwt_rounded]

 

-------------------------------------------------------------------------------------------------------------------

-> sex = Male

 

    Variable |       Obs        Mean    Std. Dev.       Min        Max

-------------+--------------------------------------------------------

     incwage |  13486090    25412.34    19336.23          8     257525

 

-------------------------------------------------------------------------------------------------------------------

-> sex = Female

 

    Variable |       Obs        Mean    Std. Dev.       Min        Max

-------------+--------------------------------------------------------

     incwage |  11744135    20436.34    16906.75          5     333564

 

 

 

 

 

Note that in the above command, I said “year==2000” with a double equal sign.  Why?  Well, STATA has some special syntax for stuff that goes after the IF.

 

What is the syntax to use after the IF?

<          less than

>          greater than

<=        less than or equal to

>=        greater than or equal to

= =       Equals (that's right, in comparisons STATA uses the double equal sign to mean 'equal')

~=        Not Equal

&         And.  Puts two or more conditions together

|           Or. 

 

7)  In order to generate a new variable, you would use generate.  So

 

 

. generate perwt_rounded=round(perwt)

This operation creates a new variable, perwt_rounded by rounding (to the nearest integer) an existing variable, perwt.

 

. replace  perwt_rounded=0 if  perwt<0

(51 real changes made)

 

This operation replaced all negative values of perwt_rounded to zero. Negative population weights don’t make sense in surveys.

 

 

Note that in generate and replace, where values are being assigned to variables, you use a SINGLE Equals sign, rather than a double.

 

There are several CPS educational variables. The one that is most consistent across suvery years is educrec. None of the educational variables correspond to actual years of education, they are all categorical variables, so:

 

tabulate educrec year

 

          Educational |                 Survey year

    attainment recode |      1962       1970       1980       1990 |     Total

----------------------+--------------------------------------------+----------

                  NIU |         0     40,544     43,935     36,714 |   201,448

    None or preschool |     1,167      1,030      1,114        755 |     5,073

 Grades 1, 2, 3, or 4 |     3,360      3,114      2,847      1,895 |    13,689

 Grades 5, 6, 7, or 8 |    19,806     22,742     18,145     11,385 |    86,482

              Grade 9 |     5,651      7,678      8,490      5,883 |    37,845

             Grade 10 |     5,984      8,742      9,807      6,787 |    42,882

             Grade 11 |     4,315      6,648      8,060      6,234 |    37,213

             Grade 12 |    19,743     33,438     48,926     44,509 |   227,306

1 to 3 years of colle |     6,475     11,621     21,259     21,967 |   128,211

  4+ years of college |     5,240      9,465     18,905     21,950 |   116,295

      Missing/Unknown |         0          1          0          0 |         1

----------------------+--------------------------------------------+----------

                Total |    71,741    145,023    181,488    158,079 |   896,445

 

 

          Educational |      Survey year

    attainment recode |      2000       2008 |     Total

----------------------+----------------------+----------

                  NIU |    30,484     49,771 |   201,448

    None or preschool |       457        550 |     5,073

 Grades 1, 2, 3, or 4 |     1,187      1,286 |    13,689

 Grades 5, 6, 7, or 8 |     6,847      7,557 |    86,482

              Grade 9 |     4,161      5,982 |    37,845

             Grade 10 |     4,695      6,867 |    42,882

             Grade 11 |     4,721      7,235 |    37,213

             Grade 12 |    33,461     47,229 |   227,306

1 to 3 years of colle |    25,883     41,006 |   128,211

  4+ years of college |    21,814     38,921 |   116,295

      Missing/Unknown |         0          0 |         1

----------------------+----------------------+----------

                Total |   133,710    206,404 |   896,445

 

Note that educrec, the older educational variable that applies to all years topcodes education at 4+ years of college. That’s not very useful if we want to look at people with advanced degrees.

 

tabulate educrec, nolab

 

Educational |

 attainment |

     recode |      Freq.     Percent        Cum.

------------+-----------------------------------

          0 |    201,448       22.47       22.47

          1 |      5,073        0.57       23.04

          2 |     13,689        1.53       24.56

          3 |     86,482        9.65       34.21

          4 |     37,845        4.22       38.43

          5 |     42,882        4.78       43.22

          6 |     37,213        4.15       47.37

          7 |    227,306       25.36       72.72

          8 |    128,211       14.30       87.03

          9 |    116,295       12.97      100.00

         99 |          1        0.00      100.00

------------+-----------------------------------

      Total |    896,445      100.00

 

When you tabulate without the labels, you see how the different categories are really stored.

 

. gen yrsed=. if educrec==0

(896445 missing values generated)

 

. replace yrsed=0 if educrec==1

(5073 real changes made)

 

. replace yrsed=2.5 if educrec==2

(13689 real changes made)

 

. replace yrsed=6.5 if educrec==3

(86482 real changes made)

 

. replace yrsed=9 if educrec==4

(37845 real changes made)

 

. replace yrsed=10 if educrec==5

(42882 real changes made)

 

. replace yrsed=11 if educrec==6

(37213 real changes made)

 

. replace yrsed=12 if educrec==7

(227306 real changes made)

 

. replace yrsed=14 if educrec==8

(128211 real changes made)

 

. replace yrsed=17 if educrec==9

(116295 real changes made)

 

. replace yrsed=. if educrec==99

(0 real changes made)

 

. table educrec, contents(mean yrsed)

 

-------------------------------------

Educational attainment  |

recode                  | mean(yrsed)

------------------------+------------

                    NIU |           

      None or preschool |           0

   Grades 1, 2, 3, or 4 |         2.5

   Grades 5, 6, 7, or 8 |         6.5

                Grade 9 |           9

               Grade 10 |          10

               Grade 11 |          11

               Grade 12 |          12

1 to 3 years of college |          14

    4+ years of college |          17

        Missing/Unknown |           

-------------------------------------

 

table is like tabulate, except table allows you to put means or other calculated values in the cells, while tabulate just wants counts (or weighted counts) and the associated row and column percentages in each cell.

 

This then allows me to look at educational attainment by race:

 

. sort race

 

. by race: summarize  yrsed if year==2000

 

-> race = White

 

    Variable |       Obs        Mean    Std. Dev.       Min        Max

-------------+--------------------------------------------------------

       yrsed |     88334    12.81067    3.154691          0         17

 

-> race = Black/Negro

 

    Variable |       Obs        Mean    Std. Dev.       Min        Max

-------------+--------------------------------------------------------

       yrsed |      9916    12.32498    2.928274          0         17

 

-> race = American Indian/Aleut/Eskimo

 

    Variable |       Obs        Mean    Std. Dev.       Min        Max

-------------+--------------------------------------------------------

       yrsed |      1320    11.92008    3.017615          0         17

 

-> race = Asian or Pacific Islander

 

    Variable |       Obs        Mean    Std. Dev.       Min        Max

-------------+--------------------------------------------------------

       yrsed |      3656    13.39401    3.606886          0         17

 

Or even better, to look at educational attainment by race for people old enough to have gone through the educational system.

 

 

. by race: summarize  yrsed if year==2000 & age>25

 

-> race = White

 

    Variable |       Obs        Mean    Std. Dev.       Min        Max

-------------+--------------------------------------------------------

       yrsed |     71946    13.09026    3.179583          0         17

 

-> race = Black/Negro

 

    Variable |       Obs        Mean    Std. Dev.       Min        Max

-------------+--------------------------------------------------------

       yrsed |      7768     12.5708    3.006368          0         17

 

-> race = American Indian/Aleut/Eskimo

 

    Variable |       Obs        Mean    Std. Dev.       Min        Max

-------------+--------------------------------------------------------

       yrsed |       983    12.23398    3.108638          0         17

 

-> race = Asian or Pacific Islander

 

    Variable |       Obs        Mean    Std. Dev.       Min        Max

-------------+--------------------------------------------------------

       yrsed |      2850    13.66737    3.770145          0         17

 

 

8) A Note about Numbers, Data, Categories, and Labels.  If you Tabulate race, you get this:

 

 

tabulate race if year==2000

 

                                 Race |      Freq.     Percent        Cum.

--------------------------------------+-----------------------------------

                                White |    113,475       84.87       84.87

                          Black/Negro |     13,626       10.19       95.06

         American Indian/Aleut/Eskimo |      1,894        1.42       96.47

            Asian or Pacific Islander |      4,715        3.53      100.00

--------------------------------------+-----------------------------------

                                Total |    133,710      100.00

 

. tabulate race if year==2000, nolabel

 

       Race |      Freq.     Percent        Cum.

------------+-----------------------------------

        100 |    113,475       84.87       84.87

        200 |     13,626       10.19       95.06

        300 |      1,894        1.42       96.47

        650 |      4,715        3.53      100.00

------------+-----------------------------------

      Total |    133,710      100.00

 

In other words, race is stored as a numerical variable, with 4 categories in the CPS 2000 (more categories in 2008).  The racial groups don’t correspond to numbers in any meaningful way. The numbers are arbitrary. The categories are nominal, as opposed to ordinal (ordinal categories have a meaningful order). You could, if you wanted to, take the average of those numerical categories, but the results would be meaningless.

 

summarize race if year==2000

 

    Variable |       Obs        Mean    Std. Dev.       Min        Max

-------------+--------------------------------------------------------

        race |    133710    132.4183    105.8387        100        650

 

One key mistake *NOT* to make is to treat nominal variables like race as if the numbers really meant something. Do numerical manipulation only on variables that have numerical values that really mean something, for instance the income variables. On the other side of things, tabulate is good for categorical variables like race, but mostly useless for continuous variables like income, because income has thousands of different levels, for every different recorded income in the dataset.

 

If you want to examine only the blacks, you need to use the statement “if race==200”, Stata doesn’t understand “if race==”black”” because race is stored as a number.

 

. summarize inctot if race=="White"

type mismatch

r(109);

 

 

So:

 

tabulate region if year==2000

 

        Region and division |      Freq.     Percent        Cum.

----------------------------+-----------------------------------

       New England Division |      9,470        7.08        7.08

   Middle Atlantic Division |     17,734       13.26       20.35

East North Central Division |     18,311       13.69       34.04

West North Central Division |     11,446        8.56       42.60

    South Atlantic Division |     21,015       15.72       58.32

East South Central Division |      6,564        4.91       63.23

West South Central Division |     13,299        9.95       73.17

          Mountain Division |     16,382       12.25       85.42

           Pacific Division |     19,489       14.58      100.00

----------------------------+-----------------------------------

                      Total |    133,710      100.00

 

. tabulate region if year==2000 & race==200

 

        Region and division |      Freq.     Percent        Cum.

----------------------------+-----------------------------------

       New England Division |        477        3.50        3.50

   Middle Atlantic Division |      2,308       16.94       20.44

East North Central Division |      2,239       16.43       36.87

West North Central Division |        420        3.08       39.95

    South Atlantic Division |      4,168       30.59       70.54

East South Central Division |      1,370       10.05       80.60

West South Central Division |      1,597       11.72       92.32

          Mountain Division |        284        2.08       94.40

           Pacific Division |        763        5.60      100.00

----------------------------+-----------------------------------

                      Total |     13,626      100.00

 

Blacks are overrepresented in the South Atlantic, and underrepresented in the Pacific region of the US.