------------------------------------------------------------------------------------------
name: <unnamed>
log: C:\Documents and Settings\Michael Rosenfeld\My Documents\newer web pages\soc_
> meth_proj3\2010_logs\first_class.log
log type: text
opened on: 26 Jan 2010, 14:23:02
. set mem 200m
* the first thing you are going to need to do is expand the memory.
Current memory allocation
current memory usage
settable value description (1M = 1024k)
--------------------------------------------------------------------
set maxvar 5000 max. variables allowed 1.909M
set memory 200M max. data space 200.000M
set matsize 400 max. RHS vars in models 1.254M
-----------
203.163M
. use "C:\Documents and Settings\Michael Rosenfeld\Desktop\cps_mar_2000_new.dta", clear
. describe
Contains data from C:\Documents and Settings\Michael Rosenfeld\Desktop\cps_mar_2000_new.dt
> a
obs: 133,710
vars: 55 1 Feb 2009 13:36
size: 15,109,230 (92.8% of memory free)
------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
------------------------------------------------------------------------------------------
year int %8.0g yearlbl Survey year
serial long %12.0g seriallbl
Household serial number
hhwt float %9.0g hhwtlbl Household weight
region byte %27.0g regionlbl
Region and division
statefip byte %57.0g statefiplbl
State (FIPS code)
metro byte %27.0g metrolbl Metropolitan central city status
metarea int %50.0g metarealbl
Metropolitan area
ownershp byte %21.0g ownershplbl
Ownership of dwelling
hhincome long %12.0g hhincomelbl
Total household income
pubhous byte %8.0g pubhouslbl
Living in public housing
foodstmp byte %8.0g foodstmplbl
Food stamp recipiency
pernum byte %8.0g pernumlbl
Person number in sample unit
perwt float %9.0g perwtlbl Person weight
momloc byte %8.0g momloclbl
Mother's location in the household
poploc byte %8.0g poploclbl
Father's location in the household
sploc byte %8.0g sploclbl Spouse's location in household
famsize byte %25.0g famsizelbl
Number of own family members in hh
nchild byte %18.0g nchildlbl
Number of own children in household
nchlt5 byte %23.0g nchlt5lbl
Number of own children under age 5 in hh
nsibs byte %18.0g nsibslbl Number of own siblings in household
relate int %34.0g relatelbl
Relationship to household head
age byte %19.0g agelbl Age
sex byte %8.0g sexlbl Sex
race int %37.0g racelbl Race
marst byte %23.0g marstlbl Marital status
popstat byte %14.0g popstatlbl
Adult civilian, armed forces, or child
bpl long %27.0g bpllbl Birthplace
yrimmig int %11.0g yrimmiglbl
Year of immigration
citizen byte %31.0g citizenlbl
Citizenship status
mbpl long %27.0g mbpllbl Mother's birthplace
fbpl long %27.0g fbpllbl Father's birthplace
hispan int %29.0g hispanlbl
Hispanic origin
educ99 byte %38.0g educ99lbl
Educational attainment, 1990
educrec byte %23.0g educreclbl
Educational attainment recode
schlcoll byte %45.0g schlcolllbl
School or college attendance
empstat byte %30.0g empstatlbl
Employment status
occ1990 int %78.0g occ1990lbl
Occupation, 1990 basis
wkswork1 byte %8.0g wkswork1lbl
Weeks worked last year
hrswork byte %8.0g hrsworklbl
Hours worked last week
uhrswork byte %13.0g uhrsworklbl
Usual hours worked per week (last yr)
hourwage int %8.0g hourwagelbl
Hourly wage
union byte %33.0g unionlbl Union membership
inctot long %12.0g Total personal income
incwage long %12.0g Wage and salary income
incss long %12.0g Social Security income
incwelfr long %12.0g Welfare (public assistance) income
vetstat byte %10.0g vetstatlbl
Veteran status
vetlast byte %26.0g vetlastlbl
Veteran's most recent period of service
disabwrk byte %34.0g disabwrklbl
Work disability
health byte %9.0g healthlbl
Health status
inclugh byte %8.0g inclughlbl
Included in employer group health plan last
year
himcaid byte %8.0g himcaidlbl
Covered by Medicaid last year
ftotval double %10.0g ftotvallbl
Total family income
perwt_rounded float %9.0g integer perwt, negative values recoded to 0
yrsed float %9.0g based on educrec
------------------------------------------------------------------------------------------
Sorted by: race
. tabulate sex
Sex | Freq. Percent Cum.
------------+-----------------------------------
Male | 64,791 48.46 48.46
Female | 68,919 51.54 100.00
------------+-----------------------------------
Total | 133,710 100.00
. tabulate race
Race | Freq. Percent Cum.
--------------------------------------+-----------------------------------
White | 113,475 84.87 84.87
Black/Negro | 13,626 10.19 95.06
American Indian/Aleut/Eskimo | 1,894 1.42 96.47
Asian or Pacific Islander | 4,715 3.53 100.00
--------------------------------------+-----------------------------------
Total | 133,710 100.00
. tabulate race [fweight= perwt_rounded]
Race | Freq. Percent Cum.
--------------------------------------+-----------------------------------
White |224,806,952 82.02 82.02
Black/Negro | 35,508,668 12.96 94.98
American Indian/Aleut/Eskimo | 2,847,473 1.04 96.01
Asian or Pacific Islander | 10,924,728 3.99 100.00
--------------------------------------+-----------------------------------
Total |274,087,821 100.00
. *There is a difference between the weighted and unweighted percentages. For instance, bl
> acks make up 10.19% of the unweighted but 12.96% of the weighted data.
.
. summarize perwt_rounded
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
perwt_roun~d | 133710 2049.868 1083.244 93 14281
. *key to understand that tabulate is for categorical variables like race, and summarize is for continuous variables like weight and income, where every respondent might have a different value.
. *If you mix these up, you will get nonsensical results, for example
. tabulate perwt_rounded
integer |
perwt, |
negative |
values |
recoded to |
0 | Freq. Percent Cum.
------------+-----------------------------------
93 | 3 0.00 0.00
96 | 1 0.00 0.00
98 | 1 0.00 0.00
99 | 3 0.00 0.01
103 | 2 0.00 0.01
104 | 1 0.00 0.01
105 | 3 0.00 0.01
109 | 1 0.00 0.01
112 | 1 0.00 0.01
115 | 1 0.00 0.01
116 | 2 0.00 0.01
117 | 2 0.00 0.02
118 | 4 0.00 0.02
120 | 3 0.00 0.02
121 | 7 0.01 0.03
122 | 1 0.00 0.03
123 | 4 0.00 0.03
124 | 1 0.00 0.03
126 | 5 0.00 0.03
128 | 4 0.00 0.04
129 | 2 0.00 0.04
131 | 3 0.00 0.04
132 | 5 0.00 0.04
133 | 2 0.00 0.05
134 | 6 0.00 0.05
135 | 1 0.00 0.05
136 | 4 0.00 0.05
--Break--
r(1);
. *we didn't want that table anyway
. *another wrong thing to do is to summarize a categorical variable
. summarize race
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
race | 133710 132.4183 105.8387 100 650
. *This makes no sense (how can you take the average of race?), but it is possible to do so watch out
. tabulate race
Race | Freq. Percent Cum.
--------------------------------------+-----------------------------------
White | 113,475 84.87 84.87
Black/Negro | 13,626 10.19 95.06
American Indian/Aleut/Eskimo | 1,894 1.42 96.47
Asian or Pacific Islander | 4,715 3.53 100.00
--------------------------------------+-----------------------------------
Total | 133,710 100.00
. tabulate race, nolabel
Race | Freq. Percent Cum.
------------+-----------------------------------
100 | 113,475 84.87 84.87
200 | 13,626 10.19 95.06
300 | 1,894 1.42 96.47
650 | 4,715 3.53 100.00
------------+-----------------------------------
Total | 133,710 100.00
* Without the labels, we can see that race is stored as a number. In fact all the variables are stored as numbers, whereas “White” and “Black/Negro” are just labels associated with the numbers
. sort sex
. by sex: summarize incwelfr
------------------------------------------------------------------------------------------
-> sex = Male
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
incwelfr | 49353 11.35025 245.3368 0 13800
------------------------------------------------------------------------------------------
-> sex = Female
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
incwelfr | 53873 67.43862 618.6006 0 25000
. *What if we want only the welfare income for people who actually had welfare income
incwelfr | 1101 3299.837 2839.866 1 25000
. by sex: summarize incwelfr if incwelfr>0, detail
------------------------------------------------------------------------------------------
-> sex = Male
Welfare (public assistance) income
-------------------------------------------------------------
Percentiles Smallest
1% 4 1
5% 113 4
10% 240 12 Obs 188
25% 829.5 12 Sum of Wgt. 188
50% 2481 Mean 2979.622
Largest Std. Dev. 2644.509
75% 4337.5 8892
90% 6600 11580 Variance 6993429
95% 8400 13200 Skewness 1.260811
99% 13200 13800 Kurtosis 4.869903
------------------------------------------------------------------------------------------
-> sex = Female
Welfare (public assistance) income
-------------------------------------------------------------
Percentiles Smallest
1% 48 1
5% 280 1
10% 480 1 Obs 1101
25% 1074 12 Sum of Wgt. 1101
50% 2766 Mean 3299.837
Largest Std. Dev. 2839.866
75% 4692 15600
90% 7152 19999 Variance 8064841
95% 8400 23292 Skewness 1.863679
99% 12084 25000 Kurtosis 9.951343
. by sex: summarize incwelfr if incwelfr>0 [fweight= perwt_rounded]
------------------------------------------------------------------------------------------
-> sex = Male
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
incwelfr | 357702 2897.24 2577.316 1 13800
------------------------------------------------------------------------------------------
-> sex = Female
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
incwelfr | 1101 3100.608 2837.588 1 25000
. *There seems to be a bit of a bug in the number of weighted observations for women here (1101 is the unweighted N).
. tabulate sex [fweight= perwt_rounded] if incwelfr>0
Sex | Freq. Percent Cum.
------------+-----------------------------------
Male | 31,176,879 49.59 49.59
Female | 31,688,337 50.41 100.00
------------+-----------------------------------
Total | 62,865,216 100.00
. tabulate sex if incwelfr>0 [fweight= perwt_rounded]
Sex | Freq. Percent Cum.
------------+-----------------------------------
Male | 31,176,879 49.59 49.59
Female | 31,688,337 50.41 100.00
------------+-----------------------------------
Total | 62,865,216 100.00
. *Not sure about that. Maybe a bug...
*after class I tried this, below, to get rid of the missing values (represented by the period)
. tabulate sex if incwelfr>0 & incwelfr!=. [fweight=
perwt_rounded]
Sex |
Freq. Percent
Cum.
------------+-----------------------------------
Male | 357,702
14.02 14.02
Female |
2,193,544
85.98 100.00
------------+-----------------------------------
Total |
2,551,246 100.00
. by sex: summarize incwage
------------------------------------------------------------------------------------------
-> sex = Male
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
incwage | 49353 25943.8 34862.55 0 364302
------------------------------------------------------------------------------------------
-> sex = Female
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
incwage | 53873 13525.17 20172.65 0 333564
. *income is topcoded
. *another example of topcoding is age
. tabulate age
Age | Freq. Percent Cum.
--------------------+-----------------------------------
Under 1 year | 1,713 1.28 1.28
1 | 1,932 1.44 2.73
2 | 1,950 1.46 4.18
3 | 1,939 1.45 5.63
4 | 1,965 1.47 7.10
5 | 1,998 1.49 8.60
6 | 2,059 1.54 10.14
7 | 2,176 1.63 11.77
8 | 2,163 1.62 13.38
9 | 2,243 1.68 15.06
10 | 2,202 1.65 16.71
11 | 2,083 1.56 18.27
12 | 2,035 1.52 19.79
13 | 2,047 1.53 21.32
14 | 1,979 1.48 22.80
15 | 2,046 1.53 24.33
16 | 1,965 1.47 25.80
17 | 1,998 1.49 27.29
18 | 1,847 1.38 28.67
19 | 1,826 1.37 30.04
20 | 1,722 1.29 31.33
21 | 1,687 1.26 32.59
22 | 1,638 1.23 33.81
23 | 1,622 1.21 35.03
24 | 1,662 1.24 36.27
25 | 1,666 1.25 37.52
26 | 1,640 1.23 38.74
27 | 1,726 1.29 40.03
28 | 1,801 1.35 41.38
29 | 1,995 1.49 42.87
30 | 1,907 1.43 44.30
31 | 1,991 1.49 45.79
32 | 1,890 1.41 47.20
33 | 1,898 1.42 48.62
34 | 2,024 1.51 50.13
35 | 2,134 1.60 51.73
36 | 2,123 1.59 53.32
37 | 2,099 1.57 54.89
38 | 2,064 1.54 56.43
39 | 2,228 1.67 58.10
40 | 2,190 1.64 59.74
41 | 2,115 1.58 61.32
42 | 2,137 1.60 62.92
43 | 2,091 1.56 64.48
44 | 2,114 1.58 66.06
45 | 2,118 1.58 67.64
46 | 1,939 1.45 69.10
47 | 1,957 1.46 70.56
48 | 1,827 1.37 71.93
49 | 1,767 1.32 73.25
50 | 1,865 1.39 74.64
51 | 1,802 1.35 75.99
52 | 1,825 1.36 77.35
53 | 1,695 1.27 78.62
54 | 1,301 0.97 79.59
55 | 1,323 0.99 80.58
56 | 1,324 0.99 81.57
57 | 1,304 0.98 82.55
58 | 1,128 0.84 83.39
59 | 1,129 0.84 84.24
60 | 1,154 0.86 85.10
61 | 1,051 0.79 85.89
62 | 1,073 0.80 86.69
63 | 938 0.70 87.39
64 | 952 0.71 88.10
65 | 1,014 0.76 88.86
66 | 869 0.65 89.51
67 | 926 0.69 90.20
68 | 908 0.68 90.88
69 | 904 0.68 91.56
70 | 913 0.68 92.24
71 | 885 0.66 92.90
72 | 770 0.58 93.48
73 | 797 0.60 94.08
74 | 814 0.61 94.68
75 | 796 0.60 95.28
76 | 704 0.53 95.81
77 | 646 0.48 96.29
78 | 687 0.51 96.80
79 | 602 0.45 97.25
80 | 514 0.38 97.64
81 | 476 0.36 97.99
82 | 425 0.32 98.31
83 | 427 0.32 98.63
84 | 325 0.24 98.87
85 | 306 0.23 99.10
86 | 248 0.19 99.29
87 | 209 0.16 99.44
88 | 172 0.13 99.57
89 | 155 0.12 99.69
90 (90+, 1988-2002) | 416 0.31 100.00
--------------------+-----------------------------------
Total | 133,710 100.00
. *topcoding (here age topcoded to 90) is for purposes of confidentiality
. by sex: summarize incwage if incwage>0
------------------------------------------------------------------------------------------
-> sex = Male
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
incwage | 34897 36690.96 36394.4 1 364302
------------------------------------------------------------------------------------------
-> sex = Female
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
incwage | 32504 22416.97 21797.73 1 333564
. *stata has a built-in calculator
. display 36690.96-22426.97
14263.99
. by sex: summarize incwage if incwage>0 & age>19 & age<40
------------------------------------------------------------------------------------------
-> sex = Male
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
incwage | 16234 31833.01 29178.08 5 362302
------------------------------------------------------------------------------------------
-> sex = Female
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
incwage | 14777 21108.5 19573.02 1 333564
. display 31833.01-21108.5
10724.51
. ttest incwage if incwage>0 & age>19 & age<40
by() option required
r(100);
. ttest incwage if incwage>0 & age>19 & age<40, by(sex)
Two-sample t test with equal variances
------------------------------------------------------------------------------
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
Male | 16234 31833.01 229.0045 29178.08 31384.13 32281.88
Female | 14777 21108.5 161.0144 19573.02 20792.89 21424.1
---------+--------------------------------------------------------------------
combined | 31011 26722.69 145.5435 25630.13 26437.42 27007.96
---------+--------------------------------------------------------------------
diff | 10724.51 284.9786 10165.94 11283.08
------------------------------------------------------------------------------
diff = mean(Male) - mean(Female) t = 37.6327
Ho: diff = 0 degrees of freedom = 31009
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 0.0000
. *This is one way of answering the question of whether the difference between men and women in earnings is real, or could be explained by random chance. The T-test says it is a big difference, very statistically significant.
. tabulate sex
Sex | Freq. Percent Cum.
------------+-----------------------------------
Male | 64,791 48.46 48.46
Female | 68,919 51.54 100.00
------------+-----------------------------------
Total | 133,710 100.00
. tabulate sex, nolabel
Sex | Freq. Percent Cum.
------------+-----------------------------------
1 | 64,791 48.46 48.46
2 | 68,919 51.54 100.00
------------+-----------------------------------
Total | 133,710 100.00
. gen male=0
*gen is short for generate, which creates a new variable
. replace male=1 if sex==1
(64791 real changes made)
. tabulate sex male
| male
Sex | 0 1 | Total
-----------+----------------------+----------
Male | 0 64,791 | 64,791
Female | 68,919 0 | 68,919
-----------+----------------------+----------
Total | 68,919 64,791 | 133,710
. label define male_label 0 "female" 1 "male"
. label values male male_label
. tabulate sex male
| male
Sex | female male | Total
-----------+----------------------+----------
Male | 0 64,791 | 64,791
Female | 68,919 0 | 68,919
-----------+----------------------+----------
Total | 68,919 64,791 | 133,710
. regress incwage male if incwage>0 & age>19 & age<40
Source | SS df MS Number of obs = 31011
-------------+------------------------------ F( 1, 31009) = 1416.22
Model | 8.8972e+11 1 8.8972e+11 Prob > F = 0.0000
Residual | 1.9481e+13 31009 628232693 R-squared = 0.0437
-------------+------------------------------ Adj R-squared = 0.0436
Total | 2.0371e+13 31010 656903664 Root MSE = 25065
------------------------------------------------------------------------------
incwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
male | 10724.51 284.9786 37.63 0.000 10165.94 11283.08
_cons | 21108.5 206.1898 102.37 0.000 20704.36 21512.64
------------------------------------------------------------------------------
. *simple linear regression is just another way of asking the question about whether the 10K difference income between men and women is greater than we could expect by chance. The answer is yes.
. regress incwage male if incwage>0 & age>19 & age<40 [iweight= perwt_rounded]
Source | SS df MS Number of obs =65026619
-------------+------------------------------ F( 1,65026617) = .
Model | 1.9498e+15 1 1.9498e+15 Prob > F = 0.0000
Residual | 4.2925e+1665026617 660120490 R-squared = 0.0434
-------------+------------------------------ Adj R-squared = 0.0434
Total | 4.4875e+1665026618 690105438 Root MSE = 25693
------------------------------------------------------------------------------
incwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
male | 10966.31 6.380798 1718.64 0.000 10953.81 10978.82
_cons | 21614.93 4.626849 4671.63 0.000 21605.86 21624
------------------------------------------------------------------------------
. regress incwage male if incwage>0 & age>19 & age<40 [aweight= perwt_rounded]
(sum of wgt is 6.5027e+07)
Source | SS df MS Number of obs = 31011
-------------+------------------------------ F( 1, 31009) = 1408.54
Model | 9.2986e+11 1 9.2986e+11 Prob > F = 0.0000
Residual | 2.0471e+13 31009 660163046 R-squared = 0.0434
-------------+------------------------------ Adj R-squared = 0.0434
Total | 2.1401e+13 31010 690127682 Root MSE = 25694
------------------------------------------------------------------------------
incwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
male | 10966.31 292.1976 37.53 0.000 10393.59 11539.03
_cons | 21614.93 211.8786 102.02 0.000 21199.64 22030.22
------------------------------------------------------------------------------
. *Note: The number of observations in the dataset that have positive incwage and meet the age criteria (age 20-39) is 31,101. The unweighted regression at top gives us a difference in income between men and women of 10,724, which we have seen before, and a T-statistic of 37.63. If we use the weights as iweights or fweights, as we do in the second regression, the number of observations is 2000X greater, the difference is income between men and women is only slightly changed, but the T statistic is much bigger (1718) because we told Stata that there are really 65 million instead of 31 thousand people in this little experiment. The third and bottom panel uses aweights, which rescales the weights to average 1, but still uses the weights. The result is that the coefficient reflects the weights, but the number of observations is still 31 thousand, which is the right number when doing statistical tests.
. save "C:\Documents and Settings\Michael Rosenfeld\Desktop\cps_mar_2000_new.dta", replace
file C:\Documents and Settings\Michael Rosenfeld\Desktop\cps_mar_2000_new.dta saved
* I saved because I created a new variable, the male variable. Like use and log, save is best done through the menus.
. exit, clear
* exit is another menu function