----------------------------------------------------------------------------------------------------
name: <unnamed>
log: C:\Documents and Settings\Michael Rosenfeld\My Documents\newer web pages\soc_meth_proj3
> \2011_180B_logs\class4.log
log type: text
opened on: 3 Feb 2011, 13:22:48
. use "C:\Documents and Settings\Michael Rosenfeld\Desktop\cps_mar_2000_new.dta", clear
* a familiar t-test:
. ttest yrsed if age >24 & age <35, by(sex)
Two-sample t test with equal variances
------------------------------------------------------------------------------
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
Male | 9027 13.31212 .0312351 2.967666 13.25089 13.37335
Female | 9511 13.55657 .0292693 2.854472 13.49919 13.61394
---------+--------------------------------------------------------------------
combined | 18538 13.43753 .0213921 2.912627 13.3956 13.47946
---------+--------------------------------------------------------------------
diff | -.2444469 .0427623 -.3282649 -.1606289
------------------------------------------------------------------------------
diff = mean(Male) - mean(Female) t = -5.7164
Ho: diff = 0 degrees of freedom = 18536
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000
* a quick review of what the 5.716 statistic means:
* This below gives us the cumulative normal distribution up to 5.716
. display normal(5.716)
.99999999
* This below takes the tail probability (i.e. 1-P above) and doubles it for the two-tail probability
. display 2*(1-normal(5.716))
1.091e-08
* invnormal starts with cumulative probability and yields a Z-score, or an x axis value. Reminder of what the key value is for Normal distribution that would yield a 2.5% tail value, meaning a 5% probability in the two tail test.
. display invnormal(1-.025)
1.959964
* invttail provides the t-statistic (like a Z-score) given (df, tail probability). Note that as df goes up, the value approaches 1.96, i.e. the t distribution becomes almost exactly like the Normal distribution. You can use Stata online help to pull up the definitions of functions like these. Try it!
. display invttail(20,.025)
2.0859634
. display invttail(100,.025)
1.9839715
. display invttail(20000, .025)
1.9600826
. table sex if age>24 & age<35, contents(freq mean yrsed sd yrsed)
-------------------------------------------------
Sex | Freq. mean(yrsed) sd(yrsed)
----------+--------------------------------------
Male | 9,027 13.31212 2.967666
Female | 9,511 13.55657 2.854472
-------------------------------------------------
* first without weights above, then below with frequency weights note that the sample size has been increased by a factor of about 2000, but the mean and sd are only changed a little.
. table sex if age>24 & age<35 [fweight= perwt_rounded], contents(freq mean yrsed sd yrsed)
-------------------------------------------------
Sex | Freq. mean(yrsed) sd(yrsed)
----------+--------------------------------------
Male | 1.86e+07 13.5574 2.819091
Female | 1.92e+07 13.76295 2.720712
-------------------------------------------------
*aweights, below, use the weights (so the mean is the same as with fweights) but aweights rescale the weights to average one, so that the sample size is the same as the unweighted sample size, which preserves important information.
. table sex if age>24 & age<35 [aweight= perwt_rounded], contents(freq mean yrsed sd yrsed)
-------------------------------------------------
Sex | Freq. mean(yrsed) sd(yrsed)
----------+--------------------------------------
Male | 9,027 13.5574 2.819247
Female | 9,511 13.76295 2.720855
-------------------------------------------------
* Generating a new 0-1 dummy variable for female, so that we can run regressions with gender as a predictor.
. tabulate sex
Sex | Freq. Percent Cum.
------------+-----------------------------------
Male | 64,791 48.46 48.46
Female | 68,919 51.54 100.00
------------+-----------------------------------
Total | 133,710 100.00
. tabulate sex, nolab
Sex | Freq. Percent Cum.
------------+-----------------------------------
1 | 64,791 48.46 48.46
2 | 68,919 51.54 100.00
------------+-----------------------------------
Total | 133,710 100.00
. gen female=0
. replace female=1 if sex==2
(68919 real changes made)
.
. label define female_lbl 0 "male" 1 "female"
. label val female female_lbl
. tabulate sex female
| female
Sex | male female | Total
-----------+----------------------+----------
Male | 64,791 0 | 64,791
Female | 0 68,919 | 68,919
-----------+----------------------+----------
Total | 64,791 68,919 | 133,710
. tabulate sex female, nolab
| female
Sex | 0 1 | Total
-----------+----------------------+----------
1 | 64,791 0 | 64,791
2 | 0 68,919 | 68,919
-----------+----------------------+----------
Total | 64,791 68,919 | 133,710
. ttest yrsed if age >24 & age <35, by(sex)
Two-sample t test with equal variances
------------------------------------------------------------------------------
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
Male | 9027 13.31212 .0312351 2.967666 13.25089 13.37335
Female | 9511 13.55657 .0292693 2.854472 13.49919 13.61394
---------+--------------------------------------------------------------------
combined | 18538 13.43753 .0213921 2.912627 13.3956 13.47946
---------+--------------------------------------------------------------------
diff | -.2444469 .0427623 -.3282649 -.1606289
------------------------------------------------------------------------------
diff = mean(Male) - mean(Female) t = -5.7164
Ho: diff = 0 degrees of freedom = 18536
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000
. regress yrsed female if age>24 & age<35
Source | SS df MS Number of obs = 18538
-------------+------------------------------ F( 1, 18536) = 32.68
Model | 276.742433 1 276.742433 Prob > F = 0.0000
Residual | 156979.922 18536 8.46892111 R-squared = 0.0018
-------------+------------------------------ Adj R-squared = 0.0017
Total | 157256.664 18537 8.48339343 Root MSE = 2.9101
------------------------------------------------------------------------------
yrsed | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | .2444469 .0427623 5.72 0.000 .1606289 .3282649
_cons | 13.31212 .0306297 434.62 0.000 13.25208 13.37216
------------------------------------------------------------------------------
* Note that the regression above, a simple classic OLS regression, with gender predicting yrsed, gives us the same coefficient, the same standard error, and the same t-statistic as the (equal variance) t-test above.
. regress yrsed female [aweight= perwt_rounded] if age>24 & age<35
(sum of wgt is 3.7786e+07)
Source | SS df MS Number of obs = 18538
-------------+------------------------------ F( 1, 18536) = 25.52
Model | 195.741395 1 195.741395 Prob > F = 0.0000
Residual | 142186.809 18536 7.67084641 R-squared = 0.0014
-------------+------------------------------ Adj R-squared = 0.0013
Total | 142382.551 18537 7.6809921 Root MSE = 2.7696
------------------------------------------------------------------------------
yrsed | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | .2055446 .0406899 5.05 0.000 .1257887 .2853005
_cons | 13.5574 .0290221 467.14 0.000 13.50051 13.61429
------------------------------------------------------------------------------
*aweighted regression gives similar coefficient, t-statistic and so on, but may be preferable to the unweighted regression because the weights are important to correct for sampling bias in the sample.
. regress yrsed female [fweight= perwt_rounded] if age>24 & age<35
Source | SS df MS Number of obs =37785945
-------------+------------------------------ F( 1,37785943) =52018.00
Model | 398979.047 1 398979.047 Prob > F = 0.0000
Residual | 28981891037785943 7.67001924 R-squared = 0.0014
-------------+------------------------------ Adj R-squared = 0.0014
Total | 29021788937785944 7.68057796 Root MSE = 2.7695
------------------------------------------------------------------------------
yrsed | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | .2055446 .0009012 228.07 0.000 .2037782 .2073109
_cons | 13.5574 .0006428 2.1e+04 0.000 13.55614 13.55866
------------------------------------------------------------------------------
* fweighted regression unfairly increases the sample size by 2000 times, decreasing the standard error by sqrt(2000) or about 43, and increasing the t-statistic by a factor of about 43.
. table occ1990 if occ1990==178| occ1990==95| occ1990==125, contents(freq mean incwage p25 incwage p75 incwage)
----------------------------------------------------------------------------------
Occupation, 1990 |
basis | Freq. mean(incwage) p25(incwage) p75(incwage)
----------------------+-----------------------------------------------------------
Registered nurses | 966 37536.85197 25000 48000
Sociology instructors | 6 41508.33333 35000 46000
Lawyers | 441 74044.32653 17000 100960
----------------------------------------------------------------------------------
. ttest incwage if occ1990==178| occ1990==95, by(occ1990)
Two-sample t test with equal variances
------------------------------------------------------------------------------
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
Register | 966 37536.85 702.6892 21839.96 36157.88 38915.83
Lawyers | 441 74044.33 3287.284 69032.96 67583.6 80505.06
---------+--------------------------------------------------------------------
combined | 1407 48979.49 1223.363 45888.34 46579.68 51379.31
---------+--------------------------------------------------------------------
diff | -36507.47 2451.758 -41316.97 -31697.97
------------------------------------------------------------------------------
diff = mean(Register) - mean(Lawyers) t = -14.8903
Ho: diff = 0 degrees of freedom = 1405
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000
* above: how to run a t-test on the occupational comparisons.
. graph box incwage if occ1990==178| occ1990==95| occ1990==125, over(occ1990)
* above: how to generate a box plot.
. gen months_ed=yrsed*12
(30484 missing values generated)
* If we rescale yrsed by multiplying by 12, would our results change?
. ttest yrsed if age >24 & age <35, by(sex)
Two-sample t test with equal variances
------------------------------------------------------------------------------
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
Male | 9027 13.31212 .0312351 2.967666 13.25089 13.37335
Female | 9511 13.55657 .0292693 2.854472 13.49919 13.61394
---------+--------------------------------------------------------------------
combined | 18538 13.43753 .0213921 2.912627 13.3956 13.47946
---------+--------------------------------------------------------------------
diff | -.2444469 .0427623 -.3282649 -.1606289
------------------------------------------------------------------------------
diff = mean(Male) - mean(Female) t = -5.7164
Ho: diff = 0 degrees of freedom = 18536
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000
. ttest months_ed if age >24 & age <35, by(sex)
Two-sample t test with equal variances
------------------------------------------------------------------------------
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
Male | 9027 159.7454 .3748215 35.61199 159.0107 160.4802
Female | 9511 162.6788 .3512319 34.25366 161.9903 163.3673
---------+--------------------------------------------------------------------
combined | 18538 161.2504 .2567052 34.95152 160.7472 161.7536
---------+--------------------------------------------------------------------
diff | -2.933363 .5131471 -3.939178 -1.927547
------------------------------------------------------------------------------
diff = mean(Male) - mean(Female) t = -5.7164
Ho: diff = 0 degrees of freedom = 18536
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000
* Answer: the units of educational attainment are changed, increased by factor of 12, but the t-test is not changed, it is unit-free.
. gen lawyers=0
. replace lawyers=1 if occ1990==178
(441 real changes made)
. gen nurses=0
. replace nurses=1 if occ1990==95
(966 real changes made)
. ttest incwage if occ1990==178| occ1990==95, by(occ1990)
Two-sample t test with equal variances
------------------------------------------------------------------------------
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
Register | 966 37536.85 702.6892 21839.96 36157.88 38915.83
Lawyers | 441 74044.33 3287.284 69032.96 67583.6 80505.06
---------+--------------------------------------------------------------------
combined | 1407 48979.49 1223.363 45888.34 46579.68 51379.31
---------+--------------------------------------------------------------------
diff | -36507.47 2451.758 -41316.97 -31697.97
------------------------------------------------------------------------------
diff = mean(Register) - mean(Lawyers) t = -14.8903
Ho: diff = 0 degrees of freedom = 1405
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000
. regress incwage lawyers if occ1990==178| occ1990==95
Source | SS df MS Number of obs = 1407
-------------+------------------------------ F( 1, 1405) = 221.72
Model | 4.0354e+11 1 4.0354e+11 Prob > F = 0.0000
Residual | 2.5571e+12 1405 1.8200e+09 R-squared = 0.1363
-------------+------------------------------ Adj R-squared = 0.1357
Total | 2.9607e+12 1406 2.1057e+09 Root MSE = 42662
------------------------------------------------------------------------------
incwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lawyers | 36507.47 2451.758 14.89 0.000 31697.97 41316.97
_cons | 37536.85 1372.618 27.35 0.000 34844.25 40229.45
------------------------------------------------------------------------------
* would it matter if we compared nurses to lawyers instead of the other way around? No, nothing changes except the signs of the coefficient and the sign of the t-statistic. Since the t-distribution is symmetric, and since we are generally always looking at two tailed tests, plus and minus coefficients have the same meaning.
. regress incwage nurses if occ1990==178| occ1990==95
Source | SS df MS Number of obs = 1407
-------------+------------------------------ F( 1, 1405) = 221.72
Model | 4.0354e+11 1 4.0354e+11 Prob > F = 0.0000
Residual | 2.5571e+12 1405 1.8200e+09 R-squared = 0.1363
-------------+------------------------------ Adj R-squared = 0.1357
Total | 2.9607e+12 1406 2.1057e+09 Root MSE = 42662
------------------------------------------------------------------------------
incwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
nurses | -36507.47 2451.758 -14.89 0.000 -41316.97 -31697.97
_cons | 74044.33 2031.51 36.45 0.000 70059.21 78029.45
------------------------------------------------------------------------------
. log close
name: <unnamed>
log: C:\Documents and Settings\Michael Rosenfeld\My Documents\newer web pages\so
> c_meth_proj3\2011_180B_logs\class4.log
log type: text
closed on: 3 Feb 2011, 15:29:11
----------------------------------------------------------------------------------------