soc381 log 4, fall 2011

---------------------------------------------------------------------------------

name: <unnamed>

log: C:\Documents and Settings\Michael Rosenfeld\My Documents\newer web pages\soc_meth_proj3\fall_2011_381_logs\class4.log

log type: text

opened on: 6 Oct 2011, 13:32:16

. use "C:\Documents and Settings\Michael Rosenfeld\My Documents\newer web pages\soc_meth_proj3\cps_mar_2000_new.dta", clear

. table sex if age >=25 & age <=34, contents (freq mean yrsed sd yrsed semean yrsed)

--------------------------------------------------------------

Sex | Freq. mean(yrsed) sd(yrsed) sem(yrsed)

----------+---------------------------------------------------

Male | 9,027 13.31212 2.967666 .0312351

Female | 9,511 13.55657 2.854472 .0292693

--------------------------------------------------------------

* By the way, just remind ourselves, that standard error of the mean is just sd/(sqrt(N)

. display 2.967666/(sqrt(9027))

.03123513

. ttest yrsed if age >=25 & age<=34, by(sex)

Two-sample t test with equal variances

------------------------------------------------------------------------------

Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

---------+--------------------------------------------------------------------

Male | 9027 13.31212 .0312351 2.967666 13.25089 13.37335

Female | 9511 13.55657 .0292693 2.854472 13.49919 13.61394

---------+--------------------------------------------------------------------

combined | 18538 13.43753 .0213921 2.912627 13.3956 13.47946

---------+--------------------------------------------------------------------

diff | -.2444469 .0427623 -.3282649 -.1606289

------------------------------------------------------------------------------

diff = mean(Male) - mean(Female) t = -5.7164

Ho: diff = 0 degrees of freedom = 18536

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0

Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000

*Note in small letters, what Ho, the Null Hypothesis is: that there is no difference between men and women's educational attainment, i.e. that the real difference in the US in ed attainment between women and men in this age group were zero. The t-statistic of 5.7164, or -5.7164, helps us figure out how unlikely we would be to find the difference that we really have in our sample if there was no difference in the US at all. That is, how likely would we be to find this big a difference (0.244 years) just by chance?

. display ttail(18000, -5.7164)

.99999999

* The syntax for ttail is ttail(df,T)= right hand tail probability. In order to make sense of the functions and their syntax, you have to look them up in Stata's online help.

. display 1-ttail(18000, -5.7164)

5.526e-09

* This tells us what the left hand tail probability is below T=-5.7164

. display ttail(18000, 5.7164)

5.526e-09

* Which is the same as the right hand cumulative probability above T=5.7164

. display 2*ttail(18000, 5.7164)

1.105e-08

* Since we generally want to do two-tailed tests, because we were not making a prior assumption about which should be larger, women's or men's educational attainment, we want to add the probability of both tails.

. drop months_ed

. gen months_ed=yrsed*12

(30484 missing values generated)

* What if we re-scaled education to months rather than years? How would that change our statistical analysis? It turns out that the T-statistic in unit-free (having the units of X in the numerator and the denominator, so they cancel) so T-statistic and therefore significance of the test is not affected by rescaling.

. ttest months_ed if age >=25 & age<=34, by(sex)

Two-sample t test with equal variances

------------------------------------------------------------------------------

Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

---------+--------------------------------------------------------------------

Male | 9027 159.7454 .3748215 35.61199 159.0107 160.4802

Female | 9511 162.6788 .3512319 34.25366 161.9903 163.3673

---------+--------------------------------------------------------------------

combined | 18538 161.2504 .2567052 34.95152 160.7472 161.7536

---------+--------------------------------------------------------------------

diff | -2.933363 .5131471 -3.939178 -1.927547

------------------------------------------------------------------------------

diff = mean(Male) - mean(Female) t = -5.7164

Ho: diff = 0 degrees of freedom = 18536

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0

Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000

. regress yrsed male if age>=25 & age<=34

Source | SS df MS Number of obs = 18538

-------------+------------------------------ F( 1, 18536) = 32.68

Model | 276.742433 1 276.742433 Prob > F = 0.0000

Residual | 156979.922 18536 8.46892111 R-squared = 0.0018

-------------+------------------------------ Adj R-squared = 0.0017

Total | 157256.664 18537 8.48339343 Root MSE = 2.9101

------------------------------------------------------------------------------

yrsed | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

male | -.2444469 .0427623 -5.72 0.000 -.3282649 -.1606289

_cons | 13.55657 .0298401 454.31 0.000 13.49808 13.61506

------------------------------------------------------------------------------

* Note by the way, that the constant here is the mean of women's education, and the "male" term is just the difference between men and women. Both terms have substantive meaning. But also notice the giant T-statistic associated with the constant. What is the null hypothesis of that test? It must be a real doozy, because that null hypothesis is totally rejected. In fact the probability associated with that T-statistic is too small for Stata to display. What is this second null hypothesis? Remember that the null hypotheses are generally centered around zero, which is why the standard Normal and T distribution tables are centered around zero. The Null hypothesis for the constant is that the constant term is zero, meaning the null hypothesis is that women's average education in the US is zero. That null hypothesis is totally nonsensical, and thankfully easy to reject. Not all null hypotheses represent sensible possibilities.

. display 2*(ttail(18000,454))

. display 2*ttail(18000, 5.7164)

1.105e-08

. display normal(5.7164)

.99999999

. display 2*(1-normal(5.7164))

1.088e-08

. display invnormal(1-.025)

1.959964

. display invttail(5, .025)

2.5705818

. display invttail(100, .025)

1.9839715

. display invttail(18000, .025)

1.9600958

. regress yrsed male if age>=25 & age<=34

Source | SS df MS Number of obs = 18538

-------------+------------------------------ F( 1, 18536) = 32.68

Model | 276.742433 1 276.742433 Prob > F = 0.0000

Residual | 156979.922 18536 8.46892111 R-squared = 0.0018

-------------+------------------------------ Adj R-squared = 0.0017

Total | 157256.664 18537 8.48339343 Root MSE = 2.9101

------------------------------------------------------------------------------

yrsed | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

male | -.2444469 .0427623 -5.72 0.000 -.3282649 -.1606289

_cons | 13.55657 .0298401 454.31 0.000 13.49808 13.61506

------------------------------------------------------------------------------

* What if we weight by aweights, which resales the weights to average 1? We get slightly different average and sd, but the same N.

. table sex if age >=25 & age <=34 [aweight=perwt_rounded], contents (freq mean yrsed sd yrsed)

-------------------------------------------------

Sex | Freq. mean(yrsed) sd(yrsed)

----------+--------------------------------------

Male | 9,027 13.5574 2.819247

Female | 9,511 13.76295 2.720855

-------------------------------------------------

. table sex if age >=25 & age <=34 [aweight=perwt_rounded], contents (freq mean yrsed sd yrsed semean yrsed)

--------------------------------------------------------------

Sex | Freq. mean(yrsed) sd(yrsed) sem(yrsed)

----------+---------------------------------------------------

Male | 9,027 13.5574 2.819247 .029673

Female | 9,511 13.76295 2.720855 .0278992

--------------------------------------------------------------

. table sex if age >=25 & age <=34, contents (freq mean yrsed sd yrsed semean yrsed)

--------------------------------------------------------------

Sex | Freq. mean(yrsed) sd(yrsed) sem(yrsed)

----------+---------------------------------------------------

Male | 9,027 13.31212 2.967666 .0312351

Female | 9,511 13.55657 2.854472 .0292693

--------------------------------------------------------------

. regress yrsed male [aweight=perwt_rounded] if age>=25 & age<=34

(sum of wgt is 3.7786e+07)

Source | SS df MS Number of obs = 18538

-------------+------------------------------ F( 1, 18536) = 25.52

Model | 195.741395 1 195.741395 Prob > F = 0.0000

Residual | 142186.809 18536 7.67084641 R-squared = 0.0014

-------------+------------------------------ Adj R-squared = 0.0013

Total | 142382.551 18537 7.6809921 Root MSE = 2.7696

------------------------------------------------------------------------------

yrsed | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

male | -.2055446 .0406899 -5.05 0.000 -.2853005 -.1257887

_cons | 13.76294 .0285199 482.57 0.000 13.70704 13.81885

------------------------------------------------------------------------------

* And aweight gives us a similar result (T statistic about 5) that we got without weights.

. table sex if age >=25 & age <=34 [aweight=perwt_rounded], contents (freq mean yrsed sd yrsed semean yrsed)

--------------------------------------------------------------

Sex | Freq. mean(yrsed) sd(yrsed) sem(yrsed)

----------+---------------------------------------------------

Male | 9,027 13.5574 2.819247 .029673

Female | 9,511 13.76295 2.720855 .0278992

--------------------------------------------------------------

. table sex if age >=25 & age <=34 [fweight=perwt_rounded], contents (freq mean yrsed sd yrsed semean yrsed)

--------------------------------------------------------------

Sex | Freq. mean(yrsed) sd(yrsed) sem(yrsed)

----------+---------------------------------------------------

Male | 1.86e+07 13.5574 2.819091 .0006543

Female | 1.92e+07 13.76295 2.720712 .0006205

--------------------------------------------------------------

* Using fweight gives dramatically larger N, but the same average and sd as using aweight (since average and sd are not functions of N). N goes up by a factor of 2000 or so, and semean decreases by sqrt(2000).

. regress yrsed male [fweight=perwt_rounded] if age>=25 & age<=34

Source | SS df MS Number of obs =37785945

-------------+------------------------------ F( 1,37785943) =52018.00

Model | 398979.047 1 398979.047 Prob > F = 0.0000

Residual | 28981891037785943 7.67001924 R-squared = 0.0014

-------------+------------------------------ Adj R-squared = 0.0014

Total | 29021788937785944 7.68057796 Root MSE = 2.7695

------------------------------------------------------------------------------

yrsed | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

male | -.2055446 .0009012 -228.07 0.000 -.2073109 -.2037782

_cons | 13.76294 .0006317 2.2e+04 0.000 13.76171 13.76418

------------------------------------------------------------------------------

* Result of doing regression with fweights? T statistic is inflated by a factor of sqrt(2000), or about a factor of 43. This is an easy mistake to make, but a bad one. We don't really have 37 million cases in our sample, we have 18 thousand. In this particular case, with this dataset and these weights, fweighted regression yields a misleading answer.

. log close

name: <unnamed>

log: C:\Documents and Settings\Michael Rosenfeld\My Documents\newer web

> pages\soc_meth_proj3\fall_2011_381_logs\class4.log

log type: text

closed on: 6 Oct 2011, 15:23:45

--------------------------------------------------------------------------------