name: <unnamed>
log: C:\Users\Michael\Documents\newer web pages\soc_meth_proj3\fall_2013_38
> 1_logs\class6.log
log type: text
opened on: 10 Oct 2013, 13:43:13
. use "C:\Users\Michael\Desktop\cps_mar_2000_new_unchanged.dta", clear
* One left over point from HW1 is that Q4 called for a direct comparison between the income of veterans and non-veterans, and most student HW that I read skipped over this. But the simple comparison is important, and revealing.
. tabulate vetlast
Veteran's most recent |
period of service | Freq. Percent Cum.
---------------------------+-----------------------------------
NIU | 30,904 23.11 23.11
No service | 91,149 68.17 91.28
World War II | 2,428 1.82 93.10
Korean War | 1,716 1.28 94.38
Vietnam Era | 3,683 2.75 97.14
Other service | 3,830 2.86 100.00
---------------------------+-----------------------------------
Total | 133,710 100.00
. gen byte veteran=0 if vetlast~=0
(30904 missing values generated)
. replace veteran=1 if vetlast>1
(11657 real changes made)
. tabulate vetlast veteran
Veteran's most recent | veteran
period of service | 0 1 | Total
----------------------+----------------------+----------
No service | 91,149 0 | 91,149
World War II | 0 2,428 | 2,428
Korean War | 0 1,716 | 1,716
Vietnam Era | 0 3,683 | 3,683
Other service | 0 3,830 | 3,830
----------------------+----------------------+----------
Total | 91,149 11,657 | 102,806
. table veteran [aweight= perwt_rounded] , contents (mean inctot)
------------------------
veteran | mean(inctot)
----------+-------------
0 | 25052.93274
1 | 38866.1566
------------------------
* So note: the veterans have a lot more income (on average) than the non-veterans. Why? Because the veterans are more likely to be male, and more likely to be older, when earnings peak.
. graph box age if occ1990==178| occ1990==95 | occ1990==125, over (occ1990)
. graph hbox age if occ1990==178| occ1990==95 | occ1990==125, over (occ1990)
* Two orientations of the box plot. Look up graph boxplot in the Stata manual for an explanation of how the outliers and whiskers are calculated.
*Now on to a brief discussion of dummy variables with metro as the predictor. Note that this is covered in more detail in my Excel sheet, “understanding dummy variables.”
. codebook metro
-----------------------------------------------------------------------
metro Metropolitan central city status
-----------------------------------------------------------------------
type: numeric (byte)
label: metrolbl
range: [0,4] units: 1
unique values: 5 missing .: 0/133710
tabulation: Freq. Numeric Label
340 0 Not identifiable
29658 1 Not in metro area
32481 2 Central city
51468 3 Outside central city
19763 4 Central city status unknown
. table metro if age>29 & age<65 & sex==1, contents(freq mean incwage)
----------------------------------------------------------
Metropolitan central city |
status | Freq. mean(incwage)
----------------------------+-----------------------------
Not identifiable | 94 31743.04255
Not in metro area | 6,628 27189.6465
Central city | 6,727 34445.35841
Outside central city | 11,639 43203.0348
Central city status unknown | 4,247 35557.95997
----------------------------------------------------------
. regress incwage metro if age>29 & age<65
Source | SS df MS Number of obs = 60477
-------------+------------------------------ F( 1, 60475) = 464.31
Model | 5.0002e+11 1 5.0002e+11 Prob > F = 0.0000
Residual | 6.5126e+13 60475 1.0769e+09 R-squared = 0.0076
-------------+------------------------------ Adj R-squared = 0.0076
Total | 6.5626e+13 60476 1.0852e+09 Root MSE = 32816
------------------------------------------------------------------------------
incwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
metro | 2870.889 133.2332 21.55 0.000 2609.752 3132.027
_cons | 20308.34 353.9993 57.37 0.000 19614.5 21002.18
------------------------------------------------------------------------------
* Please don’t ever do this: don’t treat the categorical variable like a continuous variable and just plug it in to the regression. Stata will let you, but it is wrong, wrong, wrong. One way to think about how wrong it is: what are the units of metro? If metro doesn’t have units, you need to go the dummy variable route.
* First, using the old syntax of xi: and i.variable to generate the dummy variables.
. xi: regress incwage i.metro if age>29 & age<65 & sex==1 & metro~=0
i.metro _Imetro_0-4 (naturally coded; _Imetro_0 omitted)
note: _Imetro_1 omitted because of collinearity
Source | SS df MS Number of obs = 29241
-------------+------------------------------ F( 3, 29237) = 252.70
Model | 1.1296e+12 3 3.7652e+11 Prob > F = 0.0000
Residual | 4.3563e+13 29237 1.4900e+09 R-squared = 0.0253
-------------+------------------------------ Adj R-squared = 0.0252
Total | 4.4692e+13 29240 1.5285e+09 Root MSE = 38600
------------------------------------------------------------------------------
incwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Imetro_1 | 0 (omitted)
_Imetro_2 | 7255.712 668.0533 10.86 0.000 5946.297 8565.127
_Imetro_3 | 16013.39 593.9852 26.96 0.000 14849.15 17177.63
_Imetro_4 | 8368.313 758.7058 11.03 0.000 6881.216 9855.411
_cons | 27189.65 474.1327 57.35 0.000 26260.33 28118.97
------------------------------------------------------------------------------
* Note that the coefficients correspond to the actual differences of mean values between the categories, here everything is compared to central city, because I left category zero (not identified) out of the analysis.
. table metro, contents (mean _Imetro_1 mean _Imetro_2 mean _Imetro_3 mean _Imetro_4)
-------------------------------------------------------------------------------------
Metropolitan central city |
status | __000002 __000003 __000004 __000005
----------------------------+--------------------------------------------------------
Not identifiable | 0 0 0 0
Not in metro area | 1 0 0 0
Central city | 0 1 0 0
Outside central city | 0 0 1 0
Central city status unknown | 0 0 0 1
-------------------------------------------------------------------------------------
* What the dummy variables actually look like.
. table metro if age>29 & age<65 & sex==1, contents(freq mean incwage)
----------------------------------------------------------
Metropolitan central city |
status | Freq. mean(incwage)
----------------------------+-----------------------------
Not identifiable | 94 31743.04255
Not in metro area | 6,628 27189.6465
Central city | 6,727 34445.35841
Outside central city | 11,639 43203.0348
Central city status unknown | 4,247 35557.95997
----------------------------------------------------------
. regress incwage ib2.metro if age>29 & age<65 & sex==1 & metro~=0
Source | SS df MS Number of obs = 29241
-------------+------------------------------ F( 3, 29237) = 252.70
Model | 1.1296e+12 3 3.7652e+11 Prob > F = 0.0000
Residual | 4.3563e+13 29237 1.4900e+09 R-squared = 0.0253
-------------+------------------------------ Adj R-squared = 0.0252
Total | 4.4692e+13 29240 1.5285e+09 Root MSE = 38600
------------------------------------------------------------------------------------
incwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------------+----------------------------------------------------------------
metro |
Not in metro area | -7255.712 668.0533 -10.86 0.000 -8565.127 -5946.297
Outside central.. | 8757.676 591.1938 14.81 0.000 7598.91 9916.443
Central city st.. | 1112.602 756.5223 1.47 0.141 -370.2164 2595.419
|
_cons | 34445.36 470.6309 73.19 0.000 33522.9 35367.82
------------------------------------------------------------------------------------
*First, compared to city center (ib2 means compared to base value=2)
. regress incwage i.metro if age>29 & age<65 & sex==1 & metro~=0
Source | SS df MS Number of obs = 29241
-------------+------------------------------ F( 3, 29237) = 252.70
Model | 1.1296e+12 3 3.7652e+11 Prob > F = 0.0000
Residual | 4.3563e+13 29237 1.4900e+09 R-squared = 0.0253
-------------+------------------------------ Adj R-squared = 0.0252
Total | 4.4692e+13 29240 1.5285e+09 Root MSE = 38600
------------------------------------------------------------------------------------
incwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------------+----------------------------------------------------------------
metro |
Central city | 7255.712 668.0533 10.86 0.000 5946.297 8565.127
Outside central.. | 16013.39 593.9852 26.96 0.000 14849.15 17177.63
Central city st.. | 8368.313 758.7058 11.03 0.000 6881.216 9855.411
|
_cons | 27189.65 474.1327 57.35 0.000 26260.33 28118.97
------------------------------------------------------------------------------------
* Next compared to rural. The above 2 regressions have different comparison category for metro, so the coefficients are all different, but the model is the same and the same contrasts can be recovered:
. lincom 2.metro-3.metro
( 1) 2.metro - 3.metro = 0
------------------------------------------------------------------------------
incwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | -8757.676 591.1938 -14.81 0.000 -9916.443 -7598.91
------------------------------------------------------------------------------
* The suburban-urban contrast.
*generating the 3 occupational dummy vars by hand, which is highly recommended.
. gen byte nurses=0
. replace nurses=1 if occ1990==95
(966 real changes made)
. gen byte lawyers=0
. replace lawyers=1 if occ1990==178
(441 real changes made)
. gen byte sociologists=0
. replace sociologists=1 if occ1990==125
(6 real changes made)
. table occ1990 if occ1990==178| occ1990==95 | occ1990==125, contents (freq mean inctot)
--------------------------------------------------
Occupation, 1990 |
basis | Freq. mean(inctot)
----------------------+---------------------------
Registered nurses | 966 40787.1677
Sociology instructors | 6 44363.33333
Lawyers | 441 99242.58277
--------------------------------------------------
. regress inctot nurses if occ1990==178| occ1990==95
Source | SS df MS Number of obs = 1407
-------------+------------------------------ F( 1, 1405) = 522.88
Model | 1.0346e+12 1 1.0346e+12 Prob > F = 0.0000
Residual | 2.7800e+12 1405 1.9787e+09 R-squared = 0.2712
-------------+------------------------------ Adj R-squared = 0.2707
Total | 3.8146e+12 1406 2.7131e+09 Root MSE = 44482
------------------------------------------------------------------------------
inctot | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
nurses | -58455.42 2556.381 -22.87 0.000 -63470.15 -53440.68
_cons | 99242.58 2118.201 46.85 0.000 95087.41 103397.8
------------------------------------------------------------------------------
*nurses compared to lawyers.
. regress inctot lawyers if occ1990==178| occ1990==95
Source | SS df MS Number of obs = 1407
-------------+------------------------------ F( 1, 1405) = 522.88
Model | 1.0346e+12 1 1.0346e+12 Prob > F = 0.0000
Residual | 2.7800e+12 1405 1.9787e+09 R-squared = 0.2712
-------------+------------------------------ Adj R-squared = 0.2707
Total | 3.8146e+12 1406 2.7131e+09 Root MSE = 44482
------------------------------------------------------------------------------
inctot | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lawyers | 58455.42 2556.381 22.87 0.000 53440.68 63470.15
_cons | 40787.17 1431.192 28.50 0.000 37979.66 43594.67
------------------------------------------------------------------------------
*lawyers compared to nurses.
*without restricting the sample, we would get nurses compared to everyone else, which is not what we want in this case.
. regress inctot nurses
Source | SS df MS Number of obs = 103226
-------------+------------------------------ F( 1,103224) = 207.52
Model | 2.1289e+11 1 2.1289e+11 Prob > F = 0.0000
Residual | 1.0590e+14103224 1.0259e+09 R-squared = 0.0020
-------------+------------------------------ Adj R-squared = 0.0020
Total | 1.0611e+14103225 1.0279e+09 Root MSE = 32029
------------------------------------------------------------------------------
inctot | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
nurses | 14915.35 1035.387 14.41 0.000 12886 16944.69
_cons | 25871.82 100.1605 258.30 0.000 25675.51 26068.13
------------------------------------------------------------------------------
. log close
name: <unnamed>
log: C:\Users\Michael\Documents\newer web pages\soc_meth_proj3\fall_2013_381_
> logs\class6.log
log type: text
closed on: 10 Oct 2013, 15:51:04
-------------------------------------------------------------------------------------