-----------------------------------------------------------------------------------------------------------
name: <unnamed>
log: C:\Documents and Settings\Michael Rosenfeld\My Documents\newer web pages\soc_meth_proj3\fall_2010_s381_logs\class8.log
log type: text
opened on: 14 Oct 2010, 12:20:01
* Here are some housekeeping things I did before class started:
. use "C:\Documents and Settings\Michael Rosenfeld\Desktop\cps_mar_2000_new.dta", clear
. gen occ1990_reduced= occ1990 if occ1990==95|occ1990==125|occ1990==178
(132297 missing values generated)
. describe occ1990
storage display value
variable name type format label variable label
------------------------------------------------------------------------------------------------------
occ1990 int %78.0g occ1990lbl
Occupation, 1990 basis
. tabulate occ1990 if occ1990==95|occ1990==125|occ1990==178
Occupation, 1990 basis | Freq. Percent Cum.
----------------------------------------+-----------------------------------
Registered nurses | 966 68.37 68.37
Sociology instructors | 6 0.42 68.79
Lawyers | 441 31.21 100.00
----------------------------------------+-----------------------------------
Total | 1,413 100.00
. tabulate occ1990 if occ1990==95|occ1990==125|occ1990==178, nolab
Occupation, |
1990 basis | Freq. Percent Cum.
------------+-----------------------------------
95 | 966 68.37 68.37
125 | 6 0.42 68.79
178 | 441 31.21 100.00
------------+-----------------------------------
Total | 1,413 100.00
. label define occ1990_reduced 95 "nurses" 125 "sociologists" 178 "lawyers"
. label val occ1990_reduced occ1990_reduced
. tabulate occ1990_reduced
occ1990_redu |
ced | Freq. Percent Cum.
-------------+-----------------------------------
nurses | 966 68.37 68.37
sociologists | 6 0.42 68.79
lawyers | 441 31.21 100.00
-------------+-----------------------------------
Total | 1,413 100.00
. save "C:\Documents and Settings\Michael Rosenfeld\Desktop\cps_mar_2000_new.dta", replace
file C:\Documents and Settings\Michael Rosenfeld\Desktop\cps_mar_2000_new.dta saved
. ttest incwage if occ1990==95| occ1990==125, by(occ1990)
Two-sample t test with equal variances
------------------------------------------------------------------------------
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
Register | 966 37536.85 702.6892 21839.96 36157.88 38915.83
Sociolog | 6 41508.33 2842.722 6963.219 34200.88 48815.78
---------+--------------------------------------------------------------------
combined | 972 37561.37 698.6046 21780.33 36190.42 38932.32
---------+--------------------------------------------------------------------
diff | -3971.481 8923.041 -21482.17 13539.21
------------------------------------------------------------------------------
diff = mean(Register) - mean(Sociolog) t = -0.4451
Ho: diff = 0 degrees of freedom = 970
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.3282 Pr(|T| > |t|) = 0.6564 Pr(T > t) = 0.6718
. ttest incwage if occ1990==95| occ1990==125, by(occ1990) unequal
Two-sample t test with unequal variances
------------------------------------------------------------------------------
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
Register | 966 37536.85 702.6892 21839.96 36157.88 38915.83
Sociolog | 6 41508.33 2842.722 6963.219 34200.88 48815.78
---------+--------------------------------------------------------------------
combined | 972 37561.37 698.6046 21780.33 36190.42 38932.32
---------+--------------------------------------------------------------------
diff | -3971.481 2928.283 -11252.58 3309.62
------------------------------------------------------------------------------
diff = mean(Register) - mean(Sociolog) t = -1.3562
Ho: diff = 0 Satterthwaite's degrees of freedom = 5.62958
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.1134 Pr(|T| > |t|) = 0.2269 Pr(T > t) = 0.8866
* One of the things I talked about a lot in class today was why the df for the unequal variance ttest can be so different from the df for the equal variance ttest. Well, in simple terms, the equal variance t-test takes all the data equally into account, but the unequal variance t-test can weight the standard error of the difference so much to what the smaller sample is (see above, how the sociologists standard error of the mean is similar to the standard error of the difference), that you can think of the unequal variance t-test as taking only the smaller sample into account in terms of variance of the difference, which is why df is 6 rather than 970. But note also that the two tests have the same substantive interpretation (no significant difference) meaning the wild difference between the df of the two models does not determine the answer… See my Excel file and also Stata's documentation on T-tests (either printed doc or online pdfs) for Satterthwaite's formula.
. regress incwage age if age>25 & age<65
Source | SS df MS Number of obs = 67639
-------------+------------------------------ F( 1, 67637) = 4.47
Model | 4.5722e+09 1 4.5722e+09 Prob > F = 0.0345
Residual | 6.9210e+13 67637 1.0233e+09 R-squared = 0.0001
-------------+------------------------------ Adj R-squared = 0.0001
Total | 6.9214e+13 67638 1.0233e+09 Root MSE = 31988
------------------------------------------------------------------------------
incwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | -25.10229 11.87524 -2.11 0.035 -48.37775 -1.826828
_cons | 27890.06 526.1078 53.01 0.000 26858.89 28921.23
------------------------------------------------------------------------------
* See my excel file for a graphical example of why the line does not fit the relationship between age and income. The relationship is a parabola, an upside down "U", and so we need a second order age term to fit it…
. gen age_sq=age^2
. regress incwage age age_sq if age>25 & age<65
Source | SS df MS Number of obs = 67639
-------------+------------------------------ F( 2, 67636) = 536.56
Model | 1.0810e+12 2 5.4051e+11 Prob > F = 0.0000
Residual | 6.8133e+13 67636 1.0074e+09 R-squared = 0.0156
-------------+------------------------------ Adj R-squared = 0.0156
Total | 6.9214e+13 67638 1.0233e+09 Root MSE = 31739
------------------------------------------------------------------------------
incwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | 3276.086 101.6719 32.22 0.000 3076.81 3475.363
age_sq | -37.40804 1.144351 -32.69 0.000 -39.65096 -35.16511
_cons | -40886.74 2167.744 -18.86 0.000 -45135.51 -36637.96
------------------------------------------------------------------------------
* Not that in the first model age is barely significant at all, but here both age and age squared are highly significant, and the R-square of the model has gone up quite a bit (but still has room for improvement).
. tabulate occ1990_reduced
occ1990_redu |
ced | Freq. Percent Cum.
-------------+-----------------------------------
nurses | 966 68.37 68.37
sociologists | 6 0.42 68.79
lawyers | 441 31.21 100.00
-------------+-----------------------------------
Total | 1,413 100.00
. table occ1990_reduced sex, contents(freq mean incwage) row col
----------------------------------------------------
occ1990_redu | Sex
ced | Male Female Total
-------------+--------------------------------------
nurses | 62 904 966
| 48602.45161 36777.9281 37536.85197
|
sociologists | 2 4 6
| 39200 42662.5 41508.33333
|
lawyers | 308 133 441
| 80236.42208 59704.73684 74044.32653
|
Total | 372 1,041 1,413
| 74743.46774 39729.70893 48947.76858
----------------------------------------------------
* Why we do multiple regression: we want to control for potential confounding variables. In this case, maybe we would worry that the apparent advantage of lawyers over nurses could be due to the fact the lawyers are mostly male, and the nurses mostly female. So let's regress both at the same time.
. desmat: regress incwage occ1990_reduced sex
---------------------------------------------------------------------------------
Linear regression
---------------------------------------------------------------------------------
Dependent variable incwage
Number of observations: 1413
F statistic: 83.696
Model degrees of freedom: 3
Residual degrees of freedom: 1409
R-squared: 0.151
Adjusted R-squared: 0.149
Root MSE 42234.929
Prob: 0.000
---------------------------------------------------------------------------------
nr Effect Coeff s.e.
---------------------------------------------------------------------------------
occ1990_reduced
1 sociologists -604.948 17320.322
2 lawyers 25723.531** 3256.452
sex
3 Female -17003.194** 3422.971
4 _cons 53448.743** 3479.591
---------------------------------------------------------------------------------
* p < .05
** p < .01
* OK, even after accounting for the fact that women make less money than men, lawyers still earn significantly more than nurses. So the lawyer- nurse gap is not just a function of the gender distribution in the two occupations. In fact, if you look at the table above, you see that male lawyers make a lot more than male nurses, and female lawyers make a lot more than female nurses.
. predict m1_oc_gen
(option xb assumed; fitted values)
(132297 missing values generated)
* generates the predicted values for the above model.
. table occ1990_reduced sex, contents(freq mean incwage mean m1_oc_gen) row col
----------------------------------------------------
occ1990_redu | Sex
ced | Male Female Total
-------------+--------------------------------------
nurses | 62 904 966
| 48602.45161 36777.9281 37536.85197
| 53448.74 36445.55 37536.85
|
sociologists | 2 4 6
| 39200 42662.5 41508.33333
| 52843.8 35840.6 41508.33
|
lawyers | 308 133 441
| 80236.42208 59704.73684 74044.32653
| 79172.27 62169.08 74044.33
|
Total | 372 1,041 1,413
| 74743.46774 39729.70893 48947.76858
| 74743.47 39729.71 48947.77
----------------------------------------------------
* Notice that the predicted values and the actual values in our 3x2=6 cells do not coincide. That is because our model had only 4 terms, and cannot fit the 6 cells exactly. Another way to think about this is that the 3 occupations have different gender income gaps, but our model above allowed for only 1 general gender income gap. If we want to fit all 6 cells exactly, we need to allow the gender gap to vary across occupations.
. desmat: regress incwage occ1990_reduced*sex
---------------------------------------------------------------------------------
Linear regression
---------------------------------------------------------------------------------
Dependent variable incwage
Number of observations: 1413
F statistic: 50.578
Model degrees of freedom: 5
Residual degrees of freedom: 1407
R-squared: 0.152
Adjusted R-squared: 0.149
Root MSE 42237.424
Prob: 0.000
---------------------------------------------------------------------------------
nr Effect Coeff s.e.
---------------------------------------------------------------------------------
occ1990_reduced
1 sociologists -9402.452 30344.261
2 lawyers 31633.970** 5879.320
sex
3 Female -11824.524* 5545.056
occ1990_reduced.sex
4 sociologists.Female 15287.024 36996.590
5 lawyers.Female -8707.162 7067.771
6 _cons 48602.452** 5364.158
---------------------------------------------------------------------------------
* p < .05
** p < .01
* Our 2 new terms are not significant, and the adjusted R-square does not improve, but the model does fit all 6 cells perfectly now.
. predict m2
(option xb assumed; fitted values)
(132297 missing values generated)
. table occ1990_reduced sex, contents(freq mean incwage mean m1_oc_gen mean m2)row col
----------------------------------------------------
occ1990_redu | Sex
ced | Male Female Total
-------------+--------------------------------------
nurses | 62 904 966
| 48602.45161 36777.9281 37536.85197
| 53448.74 36445.55 37536.85
| 48602.45 36777.93 37536.86
|
sociologists | 2 4 6
| 39200 42662.5 41508.33333
| 52843.8 35840.6 41508.33
| 39200 42662.5 41508.33
|
lawyers | 308 133 441
| 80236.42208 59704.73684 74044.32653
| 79172.27 62169.08 74044.33
| 80236.42 59704.74 74044.33
|
Total | 372 1,041 1,413
| 74743.46774 39729.70893 48947.76858
| 74743.47 39729.71 48947.77
| 74743.47 39729.71 48947.77
----------------------------------------------------
. log close
name: <unnamed>
log: C:\Documents and Settings\Michael Rosenfeld\My Documents\newer web pages\soc_meth_proj
> 3\fall_2010_s381_logs\class8.log
log type: text
closed on: 14 Oct 2010, 16:00:31
---------------------------------------------------------------------------------------------------