---------------------------------------------------------------------------------

      name:  <unnamed>

       log:  C:\Documents and Settings\Michael Rosenfeld\My Documents\newer web pages\soc_meth_proj3\fall_2011_381_logs\class4.log

  log type:  text

 opened on:   6 Oct 2011, 13:32:16

 

. use "C:\Documents and Settings\Michael Rosenfeld\My Documents\newer web pages\soc_meth_proj3\cps_mar_2000_new.dta", clear

 

 

. table sex if age >=25 & age <=34, contents (freq mean yrsed sd yrsed semean yrsed)

 

--------------------------------------------------------------

      Sex |       Freq.  mean(yrsed)    sd(yrsed)   sem(yrsed)

----------+---------------------------------------------------

     Male |       9,027     13.31212     2.967666     .0312351

   Female |       9,511     13.55657     2.854472     .0292693

--------------------------------------------------------------

* By the way, just remind ourselves, that standard error of the mean is just sd/(sqrt(N)

 

. display 2.967666/(sqrt(9027))

.03123513

 

. ttest yrsed if age >=25 & age<=34, by(sex)

 

Two-sample t test with equal variances

------------------------------------------------------------------------------

   Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]

---------+--------------------------------------------------------------------

    Male |    9027    13.31212    .0312351    2.967666    13.25089    13.37335

  Female |    9511    13.55657    .0292693    2.854472    13.49919    13.61394

---------+--------------------------------------------------------------------

combined |   18538    13.43753    .0213921    2.912627     13.3956    13.47946

---------+--------------------------------------------------------------------

    diff |           -.2444469    .0427623               -.3282649   -.1606289

------------------------------------------------------------------------------

    diff = mean(Male) - mean(Female)                              t =  -5.7164

Ho: diff = 0                                     degrees of freedom =    18536

 

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0

 Pr(T < t) = 0.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 1.0000

 

*Note in small letters, what Ho, the Null Hypothesis is: that there is no difference between men and women's educational attainment, i.e. that the real difference in the US in ed attainment between women and men in this age group were zero. The t-statistic of 5.7164, or -5.7164, helps us figure out how unlikely we would be to find the difference that we really have in our sample if there was no difference in the US at all. That is, how likely would we be to find this big a difference (0.244 years) just by chance?

 

 

. display ttail(18000, -5.7164)

.99999999

 

* The syntax for ttail is ttail(df,T)= right hand tail probability. In order to make sense of the functions and their syntax, you have to look them up in Stata's online help.

 

. display 1-ttail(18000, -5.7164)

5.526e-09

 

* This tells us what the left hand tail probability is below T=-5.7164

 

. display ttail(18000, 5.7164)

5.526e-09

 

* Which is the same as the right hand cumulative probability above T=5.7164

 

. display 2*ttail(18000, 5.7164)

1.105e-08

 

* Since we generally want to do two-tailed tests, because we were not making a prior assumption about which should be larger, women's or men's educational attainment, we want to add the probability of both tails.

 

. drop months_ed

 

. gen months_ed=yrsed*12

(30484 missing values generated)

 

* What if we re-scaled education to months rather than years? How would that change our statistical analysis? It turns out that the T-statistic in unit-free (having the units of X in the numerator and the denominator, so they cancel) so T-statistic and therefore significance of the test is not affected by rescaling.

 

. ttest months_ed if age >=25 & age<=34, by(sex)

 

Two-sample t test with equal variances

------------------------------------------------------------------------------

   Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]

---------+--------------------------------------------------------------------

    Male |    9027    159.7454    .3748215    35.61199    159.0107    160.4802

  Female |    9511    162.6788    .3512319    34.25366    161.9903    163.3673

---------+--------------------------------------------------------------------

combined |   18538    161.2504    .2567052    34.95152    160.7472    161.7536

---------+--------------------------------------------------------------------

    diff |           -2.933363    .5131471               -3.939178   -1.927547

------------------------------------------------------------------------------

    diff = mean(Male) - mean(Female)                              t =  -5.7164

Ho: diff = 0                                     degrees of freedom =    18536

 

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0

 Pr(T < t) = 0.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 1.0000

 

. regress yrsed male if age>=25 & age<=34

 

      Source |       SS       df       MS              Number of obs =   18538

-------------+------------------------------           F(  1, 18536) =   32.68

       Model |  276.742433     1  276.742433           Prob > F      =  0.0000

    Residual |  156979.922 18536  8.46892111           R-squared     =  0.0018

-------------+------------------------------           Adj R-squared =  0.0017

       Total |  157256.664 18537  8.48339343           Root MSE      =  2.9101

 

------------------------------------------------------------------------------

       yrsed |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

        male |  -.2444469   .0427623    -5.72   0.000    -.3282649   -.1606289

       _cons |   13.55657   .0298401   454.31   0.000     13.49808    13.61506

------------------------------------------------------------------------------

 

* Note by the way, that the constant here is the mean of women's education, and the "male" term is just the difference between men and women. Both terms have substantive meaning. But also notice the giant T-statistic associated with the constant. What is the null hypothesis of that test? It must be a real doozy, because that null hypothesis is totally rejected. In fact the probability associated with that T-statistic is too small for Stata to display. What is this second null hypothesis? Remember that the null hypotheses are generally centered around zero, which is why the standard Normal and T distribution tables are centered around zero. The Null hypothesis for the constant is that the constant term is zero, meaning the null hypothesis is that women's average education in the US is zero. That null hypothesis is totally nonsensical, and thankfully easy to reject. Not all null hypotheses represent sensible possibilities.

 

. display 2*(ttail(18000,454))

0

 

. display 2*ttail(18000, 5.7164)

1.105e-08

 

. display normal(5.7164)

.99999999

 

. display 2*(1-normal(5.7164))

1.088e-08

 

. display invnormal(1-.025)

1.959964

 

. display invttail(5, .025)

2.5705818

 

. display invttail(100, .025)

1.9839715

 

. display invttail(18000, .025)

1.9600958

 

. regress yrsed male if age>=25 & age<=34

 

      Source |       SS       df       MS              Number of obs =   18538

-------------+------------------------------           F(  1, 18536) =   32.68

       Model |  276.742433     1  276.742433           Prob > F      =  0.0000

    Residual |  156979.922 18536  8.46892111           R-squared     =  0.0018

-------------+------------------------------           Adj R-squared =  0.0017

       Total |  157256.664 18537  8.48339343           Root MSE      =  2.9101

 

------------------------------------------------------------------------------

       yrsed |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

        male |  -.2444469   .0427623    -5.72   0.000    -.3282649   -.1606289

       _cons |   13.55657   .0298401   454.31   0.000     13.49808    13.61506

------------------------------------------------------------------------------

 

* What if we weight by aweights, which resales the weights to average 1? We get slightly different average and sd, but the same N.

 

. table sex if age >=25 & age <=34 [aweight=perwt_rounded], contents (freq mean yrsed sd yrsed)

 

-------------------------------------------------

      Sex |       Freq.  mean(yrsed)    sd(yrsed)

----------+--------------------------------------

     Male |       9,027      13.5574     2.819247

   Female |       9,511     13.76295     2.720855

-------------------------------------------------

 

 

. table sex if age >=25 & age <=34 [aweight=perwt_rounded], contents (freq mean yrsed sd yrsed semean yrsed)

 

--------------------------------------------------------------

      Sex |       Freq.  mean(yrsed)    sd(yrsed)   sem(yrsed)

----------+---------------------------------------------------

     Male |       9,027      13.5574     2.819247      .029673

   Female |       9,511     13.76295     2.720855     .0278992

--------------------------------------------------------------

 

. table sex if age >=25 & age <=34, contents (freq mean yrsed sd yrsed semean yrsed)

 

--------------------------------------------------------------

      Sex |       Freq.  mean(yrsed)    sd(yrsed)   sem(yrsed)

----------+---------------------------------------------------

     Male |       9,027     13.31212     2.967666     .0312351

   Female |       9,511     13.55657     2.854472     .0292693

--------------------------------------------------------------

 

 

. regress yrsed male [aweight=perwt_rounded] if age>=25 & age<=34

(sum of wgt is   3.7786e+07)

 

      Source |       SS       df       MS              Number of obs =   18538

-------------+------------------------------           F(  1, 18536) =   25.52

       Model |  195.741395     1  195.741395           Prob > F      =  0.0000

    Residual |  142186.809 18536  7.67084641           R-squared     =  0.0014

-------------+------------------------------           Adj R-squared =  0.0013

       Total |  142382.551 18537   7.6809921           Root MSE      =  2.7696

 

------------------------------------------------------------------------------

       yrsed |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

        male |  -.2055446   .0406899    -5.05   0.000    -.2853005   -.1257887

       _cons |   13.76294   .0285199   482.57   0.000     13.70704    13.81885

------------------------------------------------------------------------------

 

* And aweight gives us a similar result (T statistic about 5) that we got without weights.

 

. table sex if age >=25 & age <=34 [aweight=perwt_rounded], contents (freq mean yrsed sd yrsed semean yrsed)

 

--------------------------------------------------------------

      Sex |       Freq.  mean(yrsed)    sd(yrsed)   sem(yrsed)

----------+---------------------------------------------------

     Male |       9,027      13.5574     2.819247      .029673

   Female |       9,511     13.76295     2.720855     .0278992

--------------------------------------------------------------

 

. table sex if age >=25 & age <=34 [fweight=perwt_rounded], contents (freq mean yrsed sd yrsed semean yrsed)

 

--------------------------------------------------------------

      Sex |       Freq.  mean(yrsed)    sd(yrsed)   sem(yrsed)

----------+---------------------------------------------------

     Male |    1.86e+07      13.5574     2.819091     .0006543

   Female |    1.92e+07     13.76295     2.720712     .0006205

--------------------------------------------------------------

 

* Using fweight gives dramatically larger N, but the same average and sd as using aweight (since average and sd are not functions of N). N goes up by a factor of 2000 or so, and semean decreases by sqrt(2000).

 

. regress yrsed male [fweight=perwt_rounded] if age>=25 & age<=34

 

      Source |       SS       df       MS              Number of obs =37785945

-------------+------------------------------           F(  1,37785943) =52018.00

       Model |  398979.047     1  398979.047           Prob > F      =  0.0000

    Residual |   28981891037785943  7.67001924           R-squared     =  0.0014

-------------+------------------------------           Adj R-squared =  0.0014

       Total |   29021788937785944  7.68057796           Root MSE      =  2.7695

 

------------------------------------------------------------------------------

       yrsed |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

        male |  -.2055446   .0009012  -228.07   0.000    -.2073109   -.2037782

       _cons |   13.76294   .0006317  2.2e+04   0.000     13.76171    13.76418

------------------------------------------------------------------------------

 

* Result of doing regression with fweights? T statistic is inflated by a factor of sqrt(2000), or about a factor of 43. This is an easy mistake to make, but a bad one. We don't really have 37 million cases in our sample, we have 18 thousand. In this particular case, with this dataset and these weights, fweighted regression yields a misleading answer.

 

. log close

      name:  <unnamed>

       log:  C:\Documents and Settings\Michael Rosenfeld\My Documents\newer web

> pages\soc_meth_proj3\fall_2011_381_logs\class4.log

  log type:  text

 closed on:   6 Oct 2011, 15:23:45

--------------------------------------------------------------------------------