class5

---------------------------------------------------------------------------

name: <unnamed>

log: C:\Users\Michael\Documents\newer web pages\soc_meth_proj3\soc_180B_win2013\class5.log

log type: text

opened on: 24 Jan 2013, 12:00:36

. use "C:\Users\Michael\Desktop\cps_mar_2000_new_unchanged.dta", clear

. ttest yrsed if age>24 & age<35, by(sex)

Two-sample t test with equal variances

------------------------------------------------------------------------------

Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

---------+--------------------------------------------------------------------

Male | 9027 13.31212 .0312351 2.967666 13.25089 13.37335

Female | 9511 13.55657 .0292693 2.854472 13.49919 13.61394

---------+--------------------------------------------------------------------

combined | 18538 13.43753 .0213921 2.912627 13.3956 13.47946

---------+--------------------------------------------------------------------

diff | -.2444469 .0427623 -.3282649 -.1606289

------------------------------------------------------------------------------

diff = mean(Male) - mean(Female) t = -5.7164

Ho: diff = 0 degrees of freedom = 18536

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0

Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000

. table sex if age>24 & age<35, contents(freq mean yrsed sd yrsed semean yrsed)

--------------------------------------------------------------

Sex | Freq. mean(yrsed) sd(yrsed) sem(yrsed)

----------+---------------------------------------------------

Male | 9,027 13.31212 2.967666 .0312351

Female | 9,511 13.55657 2.854472 .0292693

--------------------------------------------------------------

* Reviewing the mean, sd, and standard error of yrsed by gender.

. display 2.967666/(sqrt(9027))

.03123513

* Remember that standard error of the mean is a simple function of sd/sqrt(n).

. table sex if age>24 & age<35, contents(freq mean yrsed sd yrsed semean yrsed)

--------------------------------------------------------------

Sex | Freq. mean(yrsed) sd(yrsed) sem(yrsed)

----------+---------------------------------------------------

Male | 9,027 13.31212 2.967666 .0312351

Female | 9,511 13.55657 2.854472 .0292693

--------------------------------------------------------------

. table sex if age>24 & age<35 [aweight=perwt_rounded], contents(freq mean yrsed sd yrsed semean yrsed)

--------------------------------------------------------------

Sex | Freq. mean(yrsed) sd(yrsed) sem(yrsed)

----------+---------------------------------------------------

Male | 9,027 13.5574 2.819247 .029673

Female | 9,511 13.76295 2.720855 .0278992

--------------------------------------------------------------

* When we apply the weights the sample size is unchanged because aweights, also known as analytic weights rescales the weights. But the weighted data have somewhat different mean and somewhat different sd, and therefore somewhat different standard error, because the weights put more emphasis on some observations than on others.

. table sex if age>24 & age<35 [fweight=perwt_rounded], contents(freq mean yrsed sd yrsed semean yrsed)

--------------------------------------------------------------

Sex | Freq. mean(yrsed) sd(yrsed) sem(yrsed)

----------+---------------------------------------------------

Male | 1.86e+07 13.5574 2.819091 .0006543

Female | 1.92e+07 13.76295 2.720712 .0006205

--------------------------------------------------------------

* When we apply fweights, we are telling stata that each observation really counts for 2000 observations. This means that our sample size goes up dramatically (by a factor of 2000 compared to the aweight version above), but the mean and sd are the same. The standard error is reduced by a factor of sqrt(2000), or about 42. Note the way that mean and sd are not functions of sample size, but standard error is.

. ttest yrsed if age>24 & age<35, by(sex)

Two-sample t test with equal variances

------------------------------------------------------------------------------

Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

---------+--------------------------------------------------------------------

Male | 9027 13.31212 .0312351 2.967666 13.25089 13.37335

Female | 9511 13.55657 .0292693 2.854472 13.49919 13.61394

---------+--------------------------------------------------------------------

combined | 18538 13.43753 .0213921 2.912627 13.3956 13.47946

---------+--------------------------------------------------------------------

diff | -.2444469 .0427623 -.3282649 -.1606289

------------------------------------------------------------------------------

diff = mean(Male) - mean(Female) t = -5.7164

Ho: diff = 0 degrees of freedom = 18536

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0

Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000

* Back to our favorite t-test. How do we interpret the t-statistic of -5.7164? What probability do we attach to this statistic? The t-test reports a probability, which is the middle probability above, Pr(|T| > |t|) = 0.0000 , which corresponds to a 2-tail test. But the t-test output does not give us enough digits to quantify the probability. So, let’s ask Stata to quantify it for us.

. display normal(5.716)

.99999999

* That is the normal left-cumulative probability, we want the right, and both tails, so:

. display 2*(1-normal(5.716))

1.091e-08

. display ttail(18536,5.7164)

5.524e-09

* ttail gives the right hand cumulative probability to start, that is the probability from 5.7164 to infinity, which is the relevant tail for us.

. display 2*ttail(18536,5.7164)

1.105e-08

* and that, 1-in-100 million is our two-tailed T probability of finding a statistic this large by chance if the null hypothesis were true. So we reject the null hypothesis.

*Finding the key values of the normal distribution:

. display invnormal(1-.025)

1.959964

*and comparing to the t-distribution, as df increases the T becomes more Normal.

. display invttail(10, .025)

2.2281389

. display invttail(100, .025)

1.9839715

. display invttail(20000, .025)

1.9600826

. tabulate sex

Sex | Freq. Percent Cum.

------------+-----------------------------------

Male | 64,791 48.46 48.46

Female | 68,919 51.54 100.00

------------+-----------------------------------

Total | 133,710 100.00

. tabulate sex, nolab

Sex | Freq. Percent Cum.

------------+-----------------------------------

1 | 64,791 48.46 48.46

2 | 68,919 51.54 100.00

------------+-----------------------------------

Total | 133,710 100.00

*now I am going to generate a dummy variable for male, to use in regression.

. gen byte male=0

. replace male=1 if sex==1

(64791 real changes made)

. tabulate sex male

| male

Sex | 0 1 | Total

-----------+----------------------+----------

Male | 0 64,791 | 64,791

Female | 68,919 0 | 68,919

-----------+----------------------+----------

Total | 68,919 64,791 | 133,710

. ttest yrsed if age>24 & age<35, by(sex)

Two-sample t test with equal variances

------------------------------------------------------------------------------

Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

---------+--------------------------------------------------------------------

Male | 9027 13.31212 .0312351 2.967666 13.25089 13.37335

Female | 9511 13.55657 .0292693 2.854472 13.49919 13.61394

---------+--------------------------------------------------------------------

combined | 18538 13.43753 .0213921 2.912627 13.3956 13.47946

---------+--------------------------------------------------------------------

diff | -.2444469 .0427623 -.3282649 -.1606289

------------------------------------------------------------------------------

diff = mean(Male) - mean(Female) t = -5.7164

Ho: diff = 0 degrees of freedom = 18536

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0

Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000

. regress yrsed male if age>24 & age<35

Source | SS df MS Number of obs = 18538

-------------+------------------------------ F( 1, 18536) = 32.68

Model | 276.742433 1 276.742433 Prob > F = 0.0000

Residual | 156979.922 18536 8.46892111 R-squared = 0.0018

-------------+------------------------------ Adj R-squared = 0.0017

Total | 157256.664 18537 8.48339343 Root MSE = 2.9101

------------------------------------------------------------------------------

yrsed | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

male | -.2444469 .0427623 -5.72 0.000 -.3282649 -.1606289

_cons | 13.55657 .0298401 454.31 0.000 13.49808 13.61506

------------------------------------------------------------------------------

*Note that the t-statistic produced by the (equal variance) t-test and the t-statistic produced by regression are the same. Regression is just a generalization of the t-test.

. regress yrsed male if age>24 & age<35 [aweight= perwt_rounded]

(sum of wgt is 3.7786e+07)

Source | SS df MS Number of obs = 18538

-------------+------------------------------ F( 1, 18536) = 25.52

Model | 195.741395 1 195.741395 Prob > F = 0.0000

Residual | 142186.809 18536 7.67084641 R-squared = 0.0014

-------------+------------------------------ Adj R-squared = 0.0013

Total | 142382.551 18537 7.6809921 Root MSE = 2.7696

------------------------------------------------------------------------------

yrsed | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

male | -.2055446 .0406899 -5.05 0.000 -.2853005 -.1257887

_cons | 13.76294 .0285199 482.57 0.000 13.70704 13.81885

------------------------------------------------------------------------------

* aweighted regression yields a different coefficient and t-statistic, but they are of the same order of magnitude. Aweight is one way of applying the weights but making sure that the standard errors reflect the actual sample size you have.

. regress yrsed male if age>24 & age<35 [fweight= perwt_rounded]

Source | SS df MS Number of obs =37785945

-------------+------------------------------ F( 1,37785943) =52018.00

Model | 398979.047 1 398979.047 Prob > F = 0.0000

Residual | 28981891037785943 7.67001924 R-squared = 0.0014

-------------+------------------------------ Adj R-squared = 0.0014

Total | 29021788937785944 7.68057796 Root MSE = 2.7695

------------------------------------------------------------------------------

yrsed | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

male | -.2055446 .0009012 -228.07 0.000 -.2073109 -.2037782

_cons | 13.76294 .0006317 2.2e+04 0.000 13.76171 13.76418

------------------------------------------------------------------------------

* fweighted regression increases the sample size by a factor of 2000, and increases the t-statistic by a factor of sqrt(2000), or about 42.

. table occ1990 if occ1990==95| occ1990==125| occ1990==178, contents (freq mean inctot sd inctot)

----------------------------------------------------------------

Occupation, 1990 |

basis | Freq. mean(inctot) sd(inctot)

----------------------+-----------------------------------------

Registered nurses | 966 40787.1677 22941.43

Sociology instructors | 6 44363.33333 6497.989

Lawyers | 441 99242.58277 71860.66

----------------------------------------------------------------

. graph box inctot if occ1990==95| occ1990==125| occ1990==178, over( occ1990)

* a quick look at how to do make box plots.

* one way to do ttests testing for differences in some variable between two occupations:

. ttest yrsed if occ1990==95| occ1990==125, by(occ1990)

Two-sample t test with equal variances

------------------------------------------------------------------------------

Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

---------+--------------------------------------------------------------------

Sociolog | 6 17 0 0 17 17

---------+--------------------------------------------------------------------

combined | 972 15.55658 .0514811 1.605022 15.45556 15.65761

---------+--------------------------------------------------------------------

diff | -1.452381 .6559623 -2.73965 -.1651122

------------------------------------------------------------------------------

diff = mean(Register) - mean(Sociolog) t = -2.2141

Ho: diff = 0 degrees of freedom = 970

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0

Pr(T < t) = 0.0135 Pr(|T| > |t|) = 0.0271 Pr(T > t) = 0.9865

* generate a dummy variable for nurses.

. gen nurses=0

. replace nurses=1 if occ1990==95

(966 real changes made)

. regress yrsed nurses if occ1990==95| occ1990==125

Source | SS df MS Number of obs = 972

-------------+------------------------------ F( 1, 970) = 4.90

Model | 12.5783363 1 12.5783363 Prob > F = 0.0271

Residual | 2488.80952 970 2.56578301 R-squared = 0.0050

-------------+------------------------------ Adj R-squared = 0.0040

Total | 2501.38786 971 2.5760946 Root MSE = 1.6018

------------------------------------------------------------------------------

yrsed | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

nurses | -1.452381 .6559623 -2.21 0.027 -2.73965 -.1651122

_cons | 17 .6539346 26.00 0.000 15.71671 18.28329

------------------------------------------------------------------------------

. regress yrsed male if age>24 & age<35

Source | SS df MS Number of obs = 18538

-------------+------------------------------ F( 1, 18536) = 32.68

Model | 276.742433 1 276.742433 Prob > F = 0.0000

Residual | 156979.922 18536 8.46892111 R-squared = 0.0018

-------------+------------------------------ Adj R-squared = 0.0017

Total | 157256.664 18537 8.48339343 Root MSE = 2.9101

------------------------------------------------------------------------------

yrsed | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

male | -.2444469 .0427623 -5.72 0.000 -.3282649 -.1606289

_cons | 13.55657 .0298401 454.31 0.000 13.49808 13.61506

------------------------------------------------------------------------------

* What if we changed the units of our variables? What if instead of years of education, we had months?

. gen monthsed=yrsed*12

(30484 missing values generated)

. regress monthsed male if age>24 & age<35

Source | SS df MS Number of obs = 18538

-------------+------------------------------ F( 1, 18536) = 32.68

Model | 39850.9104 1 39850.9104 Prob > F = 0.0000

Residual | 22605108.7 18536 1219.52464 R-squared = 0.0018

-------------+------------------------------ Adj R-squared = 0.0017

Total | 22644959.6 18537 1221.60865 Root MSE = 34.922

------------------------------------------------------------------------------

monthsed | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

male | -2.933363 .5131471 -5.72 0.000 -3.939178 -1.927547

_cons | 162.6788 .3580818 454.31 0.000 161.9769 163.3807

------------------------------------------------------------------------------

* note that change of scale effects our coefficient and standard error (which are in the units of whatever the dependent variable are in), but the t-statistic is unchanged, because the t-statistic is unit free.

. ttest yrsed if age>24 & age<35, by(sex)

Two-sample t test with equal variances

------------------------------------------------------------------------------

Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

---------+--------------------------------------------------------------------

Male | 9027 13.31212 .0312351 2.967666 13.25089 13.37335

Female | 9511 13.55657 .0292693 2.854472 13.49919 13.61394

---------+--------------------------------------------------------------------

combined | 18538 13.43753 .0213921 2.912627 13.3956 13.47946

---------+--------------------------------------------------------------------

diff | -.2444469 .0427623 -.3282649 -.1606289

------------------------------------------------------------------------------

diff = mean(Male) - mean(Female) t = -5.7164

Ho: diff = 0 degrees of freedom = 18536

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0

Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000

. log close

name: <unnamed>

log: C:\Users\Michael\Documents\newer web pages\soc_meth_proj3\soc_180B_win2013\cl

> ass5.log

log type: text

closed on: 24 Jan 2013, 15:54:55

------------------------------------------------------------------------------------------