---------------------------------------------------------------------------------
name: <unnamed>
log: C:\Documents and Settings\Michael Rosenfeld\My Documents\newer web pages\soc_meth_proj3\2010_logs\fourth_class.log
log type: text
opened on: 4 Feb 2010, 14:42:34
. display normal(2)
.97724987
*In the standard Normal distribution, only about 2.3% of the cumulative density remains beyond 2 standard deviations above zero.
. display 1-normal(2)
.02275013
. display 1-normal(1.96)
.0249979
* We are usually concerned with 5% significance levels, which is arbitrary but standard. Given that 2.5% of the distribution remains above 1.96 standard deviations above the mean, the critical value in Z-scores is about 1.96, because you could get a value of 1.96 or less (in either direction from zero) about 5% of the times with a Normal distribution.
. display 2*(1-normal(1.96))
.04999579
. display invnormal(1-.025)
1.959964
* The command normal takes a Z score and gives a probability on the cumulative density function, whereas invnormal takes a cumulative probability and gives you the corresponding Z score. Check Freedman’s tables. Also, use Stata help to look up the commands normal and invnormal to remind yourself of how they work. Stata online help is useful!
. display normal(5.716)
.99999999
. display 1-normal(5.716)
5.453e-09
. display 2*(1-normal(5.716))
1.091e-08
. * that number, 10 to the minus 8, or 1 in 100 million, is the chance that a normally distributed statistic would yield a value as high as 5.716, which is what we got by comparing men and women's years of education. Since this P value is so low, we can reject the null hypothesis of equal educations between the groups.
. display invnormal(1-.025)
1.959964
. *the critical 97.5% cumulative density point for the Normal distribution is at 1.96 standard deviations above the mean. For the T distribution the critical value depends on the degrees of freedom.
. display invttail(16,.025)
2.1199053
. display invttail(100,.025)
1.9839715
. display invttail(1800,.025)
1.9612828
. * as the degrees of freedom, which is just the sample size of the two groups increases, the T and Normal distributions come to be pretty much the same. Compare Freedman’s tables.
. use "C:\Documents and Settings\Michael Rosenfeld\Desktop\cps_mar_2000_new.dta",
> clear
. gen mos_education= yrsed*12
(30484 missing values generated)
. ttest yrsed if age>24 & age<35, by(sex)
Two-sample t test with equal variances
------------------------------------------------------------------------------
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
Male | 9027 13.31212 .0312351 2.967666 13.25089 13.37335
Female | 9511 13.55657 .0292693 2.854472 13.49919 13.61394
---------+--------------------------------------------------------------------
combined | 18538 13.43753 .0213921 2.912627 13.3956 13.47946
---------+--------------------------------------------------------------------
diff | -.2444469 .0427623 -.3282649 -.1606289
------------------------------------------------------------------------------
diff = mean(Male) - mean(Female) t = -5.7164
Ho: diff = 0 degrees of freedom = 18536
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000
. ttest mos_education if age>24 & age<35, by(sex)
Two-sample t test with equal variances
------------------------------------------------------------------------------
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
Male | 9027 159.7454 .3748215 35.61199 159.0107 160.4802
Female | 9511 162.6788 .3512319 34.25366 161.9903 163.3673
---------+--------------------------------------------------------------------
combined | 18538 161.2504 .2567052 34.95152 160.7472 161.7536
---------+--------------------------------------------------------------------
diff | -2.933363 .5131471 -3.939178 -1.927547
------------------------------------------------------------------------------
diff = mean(Male) - mean(Female) t = -5.7164
Ho: diff = 0 degrees of freedom = 18536
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000
. *T statistic is not affected by changes in scale (that is by changes from yrsed to mos_education).
. save "C:\Documents and Settings\Michael Rosenfeld\Desktop\cps_mar_2000_new.dta", replace
file C:\Documents and Settings\Michael Rosenfeld\Desktop\cps_mar_2000_new.dta saved
. exit
---------------------------------------------------------------------------------
name: <unnamed>
log: C:\Documents and Settings\Michael Rosenfeld\My Documents\newer web p
> ages\soc_meth_proj3\2010_logs\fourth_class.log
log type: text
opened on: 4 Feb 2010, 15:18:51
. *That so far is where class ended, but I want to open this log back up to demonstrate a few additional things that I did not get to in class, things which are relevant to HW2
. use "C:\Documents and Settings\Michael Rosenfeld\Desktop\cps_mar_2000_new.dta", clear
. *load up the 2000 CPS data again.
. *One thing I mentioned is that there are two different t-tests, one which assumes that the two subsamples have equal variance (the equal variance t-test), and one whil takes the actual variances of the two samples (be they similar or different).
. *In my excel file, I use and assume the unequal variance t-test.
. ttest yrsed if age>24 & age<35, by(sex)
Two-sample t test with equal variances
------------------------------------------------------------------------------
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
Male | 9027 13.31212 .0312351 2.967666 13.25089 13.37335
Female | 9511 13.55657 .0292693 2.854472 13.49919 13.61394
---------+--------------------------------------------------------------------
combined | 18538 13.43753 .0213921 2.912627 13.3956 13.47946
---------+--------------------------------------------------------------------
diff | -.2444469 .0427623 -.3282649 -.1606289
------------------------------------------------------------------------------
diff = mean(Male) - mean(Female) t = -5.7164
Ho: diff = 0 degrees of freedom = 18536
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000
. * Stata assumes equal variance t-test, as above.
. ttest yrsed if age>24 & age<35, by(sex) unequal
Two-sample t test with unequal variances
------------------------------------------------------------------------------
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
Male | 9027 13.31212 .0312351 2.967666 13.25089 13.37335
Female | 9511 13.55657 .0292693 2.854472 13.49919 13.61394
---------+--------------------------------------------------------------------
combined | 18538 13.43753 .0213921 2.912627 13.3956 13.47946
---------+--------------------------------------------------------------------
diff | -.2444469 .0428057 -.32835 -.1605438
------------------------------------------------------------------------------
diff = mean(Male) - mean(Female) t = -5.7106
Ho: diff = 0 Satterthwaite's degrees of freedom = 18383.6
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000
. * The unequal variance t-test which we invoked by just using the word "unequal" after the comma actually gives us exactly what we got in the Excel file. In this case the equal and unequal t-tests are very similar because as you can see from the t-test table, the standard deviations of education for men and women are very similar.
. * Now let's look at a simple regression version of this.
. *First we generate a new dummy variable for gender.
. codebook sex
-------------------------------------------------------------------------------------------------
sex Sex
-------------------------------------------------------------------------------------------------
type: numeric (byte)
label: sexlbl
range: [1,2] units: 1
unique values: 2 missing .: 0/133710
tabulation: Freq. Numeric Label
64791 1 Male
68919 2 Female
. *OK, sex=1 for men and 2 for women. We are going to create a new variable that=1 for women and
> =0 otherwise.
. gen female=0
. replace female=1 if sex==2
(68919 real changes made)
. regress yrsed female if age>24 & age<35
Source | SS df MS Number of obs = 18538
-------------+------------------------------ F( 1, 18536) = 32.68
Model | 276.742433 1 276.742433 Prob > F = 0.0000
Residual | 156979.922 18536 8.46892111 R-squared = 0.0018
-------------+------------------------------ Adj R-squared = 0.0017
Total | 157256.664 18537 8.48339343 Root MSE = 2.9101
------------------------------------------------------------------------------
yrsed | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | .2444469 .0427623 5.72 0.000 .1606289 .3282649
_cons | 13.31212 .0306297 434.62 0.000 13.25208 13.37216
------------------------------------------------------------------------------
. * you can see here that the coefficient for female is 0.244, that is the additional educational attainment that women have, in the average, over men. The constant (13.31) corresponds to male education, i.e. the educational attainment when the variable female=0. The t-statistic corresponds to the equal variance t-test above.
. *This is the simple OLS regression, using gender to predict yrsed.
. regress yrsed male if age>24 & age<35
Source | SS df MS Number of obs = 18538
-------------+------------------------------ F( 1, 18536) = 32.68
Model | 276.742433 1 276.742433 Prob > F = 0.0000
Residual | 156979.922 18536 8.46892111 R-squared = 0.0018
-------------+------------------------------ Adj R-squared = 0.0017
Total | 157256.664 18537 8.48339343 Root MSE = 2.9101
------------------------------------------------------------------------------
yrsed | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
male | -.2444469 .0427623 -5.72 0.000 -.3282649 -.1606289
_cons | 13.55657 .0298401 454.31 0.000 13.49808 13.61506
------------------------------------------------------------------------------
. *And if we use a dummy variable that=1 for men and =0 for women, we get the same result for coefficient and T-statistic but with signs reversed, and here the constant term=women's education.
. save "C:\Documents and Settings\Michael Rosenfeld\Desktop\cps_mar_2000_new.dta", replace
file C:\Documents and Settings\Michael Rosenfeld\Desktop\cps_mar_2000_new.dta saved
. exit, clear