--------------------------------------------------------------------------------------
name: <unnamed>
log: C:\Users\Michael\Documents\newer web pages\soc_meth_proj3\fall_2012_381_l
> ogs\class6.log
log type: text
opened on: 11 Oct 2012, 12:09:02
. use "C:\Users\Michael\Desktop\cps_mar_2000_new_unchanged.dta", clear
. *class starts here...
*A couple of examples of box plots:
. graph box age if occ1990==178| occ1990==95| occ1990==125, over(occ1990)
. graph hbox age if occ1990==178| occ1990==95| occ1990==125, over(occ1990)
. codebook metro
------------------------------------------------------------------------------------
metro Metropolitan central city status
------------------------------------------------------------------------------------
type: numeric (byte)
label: metrolbl
range: [0,4] units: 1
unique values: 5 missing .: 0/133710
tabulation: Freq. Numeric Label
340 0 Not identifiable
29658 1 Not in metro area
32481 2 Central city
51468 3 Outside central city
19763 4 Central city status unknown
. table metro if age>29 & age<65 & sex==1, contents( freq mean incwage)
----------------------------------------------------------
Metropolitan central city |
status | Freq. mean(incwage)
----------------------------+-----------------------------
Not identifiable | 94 31743.04255
Not in metro area | 6,628 27189.6465
Central city | 6,727 34445.35841
Outside central city | 11,639 43203.0348
Central city status unknown | 4,247 35557.95997
----------------------------------------------------------
*OK, we have 5 categories of metro, of which the first (“not identifiable”) is not really useful, so we will discard it in analyses below.
* When dealing with categorical variable predictors, one thing you never ever want to do is treat them as continuous predictors.
. regress incwage metro if age>29 & age<65
Source | SS df MS Number of obs = 60477
-------------+------------------------------ F( 1, 60475) = 464.31
Model | 5.0002e+11 1 5.0002e+11 Prob > F = 0.0000
Residual | 6.5126e+13 60475 1.0769e+09 R-squared = 0.0076
-------------+------------------------------ Adj R-squared = 0.0076
Total | 6.5626e+13 60476 1.0852e+09 Root MSE = 32816
------------------------------------------------------------------------------
incwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
metro | 2870.889 133.2332 21.55 0.000 2609.752 3132.027
_cons | 20308.34 353.9993 57.37 0.000 19614.5 21002.18
------------------------------------------------------------------------------
* This above regression is wrong in so many ways…
* On the subject of how to use the syntax to make dummy variables, see the “understanding dummy vars” page of my class Excel file.
. xi: regress incwage i.metro if age>29 & age<65 & sex==1 &metro~=0
i.metro _Imetro_0-4 (naturally coded; _Imetro_0 omitted)
note: _Imetro_1 omitted because of collinearity
Source | SS df MS Number of obs = 29241
-------------+------------------------------ F( 3, 29237) = 252.70
Model | 1.1296e+12 3 3.7652e+11 Prob > F = 0.0000
Residual | 4.3563e+13 29237 1.4900e+09 R-squared = 0.0253
-------------+------------------------------ Adj R-squared = 0.0252
Total | 4.4692e+13 29240 1.5285e+09 Root MSE = 38600
------------------------------------------------------------------------------
incwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Imetro_1 | (omitted)
_Imetro_2 | 7255.712 668.0533 10.86 0.000 5946.297 8565.127
_Imetro_3 | 16013.39 593.9852 26.96 0.000 14849.15 17177.63
_Imetro_4 | 8368.313 758.7058 11.03 0.000 6881.216 9855.411
_cons | 27189.65 474.1327 57.35 0.000 26260.33 28118.97
------------------------------------------------------------------------------
* Note that the constant is the actual value of income for rural men, and the other coefficients are each area minus the average of rural men.
* What do the dummy variables actually look like? They are
. table metro, contents(mean _Imetro_1 mean _Imetro_2 mean _Imetro_3 mean _Imetro_4)
------------------------------------------------------------------------------------
Metropolitan central city |
status | __000002 __000003 __000004 __000005
----------------------------+-------------------------------------------------------
Not identifiable | 0 0 0 0
Not in metro area | 1 0 0 0
Central city | 0 1 0 0
Outside central city | 0 0 1 0
Central city status unknown | 0 0 0 1
------------------------------------------------------------------------------------
. table metro if age>29 & age<65 & sex==1, contents( freq mean incwage)
----------------------------------------------------------
Metropolitan central city |
status | Freq. mean(incwage)
----------------------------+-----------------------------
Not identifiable | 94 31743.04255
Not in metro area | 6,628 27189.6465
Central city | 6,727 34445.35841
Outside central city | 11,639 43203.0348
Central city status unknown | 4,247 35557.95997
----------------------------------------------------------
. xi: regress incwage i.metro if age>29 & age<65 & sex==1 &metro~=0
i.metro _Imetro_0-4 (naturally coded; _Imetro_0 omitted)
note: _Imetro_1 omitted because of collinearity
Source | SS df MS Number of obs = 29241
-------------+------------------------------ F( 3, 29237) = 252.70
Model | 1.1296e+12 3 3.7652e+11 Prob > F = 0.0000
Residual | 4.3563e+13 29237 1.4900e+09 R-squared = 0.0253
-------------+------------------------------ Adj R-squared = 0.0252
Total | 4.4692e+13 29240 1.5285e+09 Root MSE = 38600
------------------------------------------------------------------------------
incwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Imetro_1 | (omitted)
_Imetro_2 | 7255.712 668.0533 10.86 0.000 5946.297 8565.127
_Imetro_3 | 16013.39 593.9852 26.96 0.000 14849.15 17177.63
_Imetro_4 | 8368.313 758.7058 11.03 0.000 6881.216 9855.411
_cons | 27189.65 474.1327 57.35 0.000 26260.33 28118.97
------------------------------------------------------------------------------
* using the ib#.variable_name syntax, it is easy to change the comparison category.
. regress incwage ib2.metro if age>29 & age<65 & sex==1 &metro~=0
Source | SS df MS Number of obs = 29241
-------------+------------------------------ F( 3, 29237) = 252.70
Model | 1.1296e+12 3 3.7652e+11 Prob > F = 0.0000
Residual | 4.3563e+13 29237 1.4900e+09 R-squared = 0.0253
-------------+------------------------------ Adj R-squared = 0.0252
Total | 4.4692e+13 29240 1.5285e+09 Root MSE = 38600
------------------------------------------------------------------------------
incwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
metro |
1 | -7255.712 668.0533 -10.86 0.000 -8565.127 -5946.297
3 | 8757.676 591.1938 14.81 0.000 7598.91 9916.443
4 | 1112.602 756.5223 1.47 0.141 -370.2164 2595.419
|
_cons | 34445.36 470.6309 73.19 0.000 33522.9 35367.82
------------------------------------------------------------------------------
. regress incwage ib1.metro if age>29 & age<65 & sex==1 &metro~=0
Source | SS df MS Number of obs = 29241
-------------+------------------------------ F( 3, 29237) = 252.70
Model | 1.1296e+12 3 3.7652e+11 Prob > F = 0.0000
Residual | 4.3563e+13 29237 1.4900e+09 R-squared = 0.0253
-------------+------------------------------ Adj R-squared = 0.0252
Total | 4.4692e+13 29240 1.5285e+09 Root MSE = 38600
------------------------------------------------------------------------------
incwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
metro |
2 | 7255.712 668.0533 10.86 0.000 5946.297 8565.127
3 | 16013.39 593.9852 26.96 0.000 14849.15 17177.63
4 | 8368.313 758.7058 11.03 0.000 6881.216 9855.411
|
_cons | 27189.65 474.1327 57.35 0.000 26260.33 28118.97
------------------------------------------------------------------------------
* But notice: changing the comparison category changes all the coefficients, but the regression goodness of fit is the same, and each specific comparison, when recovered, is exactly the same. The models are identical, just expressed differently.
. lincom 2.metro-3.metro
( 1) 2.metro - 3.metro = 0
------------------------------------------------------------------------------
incwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | -8757.676 591.1938 -14.81 0.000 -9916.443 -7598.91
------------------------------------------------------------------------------
* my favorite dummy variable syntax is from the free add-on, desmat.
* try ssc install desmat, replace
. desmat: regress incwage metro=ind(3) if age>29 & age<65 & sex==1 & metro~=0,desrep (zval ci)
------------------------------------------------------------------------------------
Linear regression
------------------------------------------------------------------------------------
Dependent variable incwage
Number of observations: 29241
F statistic: 252.703
Model degrees of freedom: 3
Residual degrees of freedom: 29237
R-squared: 0.025
Adjusted R-squared: 0.025
Root MSE 38600.339
Prob: 0.000
------------------------------------------------------------------------------------
nr Effect Coeff s.e. t lo 95% hi 95%
------------------------------------------------------------------------------------
metro
1 Not identifiable 0.000 . . . .
2 Not in metro area -7255.712** 668.053 -10.861 -8565.127 -5946.297
3 Outside central city 8757.676** 591.194 14.814 7598.910 9916.443
4 ntral city status unknown 1112.602 756.522 1.471 -370.216 2595.419
5 _cons 34445.358** 470.631 73.190 33522.901 35367.816
------------------------------------------------------------------------------------
* p < .05
** p < .01
* For homework 2, it will be easiest to make the dummy variables by hand, in part because occupation has many hundred categories besides the 3 we are interested in.
. gen byte nurses=0
. replace nurses=1 if occ1990==95
(966 real changes made)
. gen byte lawyers=0
. replace lawyers=1 if occ1990==178
(441 real changes made)
. gen byte sociologists=0
. replace sociologists=1 if occ1990==125
(6 real changes made)
. table occ1990 if occ1990==178| occ1990==95| occ1990==125, contents (freq mean inctot)
--------------------------------------------------
Occupation, 1990 |
basis | Freq. mean(inctot)
----------------------+---------------------------
Registered nurses | 966 40787.1677
Sociology instructors | 6 44363.33333
Lawyers | 441 99242.58277
--------------------------------------------------
. regress inctot lawyers if occ1990==178|occ1990==95
Source | SS df MS Number of obs = 1407
-------------+------------------------------ F( 1, 1405) = 522.88
Model | 1.0346e+12 1 1.0346e+12 Prob > F = 0.0000
Residual | 2.7800e+12 1405 1.9787e+09 R-squared = 0.2712
-------------+------------------------------ Adj R-squared = 0.2707
Total | 3.8146e+12 1406 2.7131e+09 Root MSE = 44482
------------------------------------------------------------------------------
inctot | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lawyers | 58455.42 2556.381 22.87 0.000 53440.68 63470.15
_cons | 40787.17 1431.192 28.50 0.000 37979.66 43594.67
------------------------------------------------------------------------------
. regress inctot nurses if occ1990==178|occ1990==95
Source | SS df MS Number of obs = 1407
-------------+------------------------------ F( 1, 1405) = 522.88
Model | 1.0346e+12 1 1.0346e+12 Prob > F = 0.0000
Residual | 2.7800e+12 1405 1.9787e+09 R-squared = 0.2712
-------------+------------------------------ Adj R-squared = 0.2707
Total | 3.8146e+12 1406 2.7131e+09 Root MSE = 44482
------------------------------------------------------------------------------
inctot | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
nurses | -58455.42 2556.381 -22.87 0.000 -63470.15 -53440.68
_cons | 99242.58 2118.201 46.85 0.000 95087.41 103397.8
------------------------------------------------------------------------------
*notice how these two regressions above are the same, but with comparison categories reversed.
. log close
name: <unnamed>
log: C:\Users\Michael\Documents\newer web pages\soc_meth_proj3\fall_2012_381
> _logs\class6.log
log type: text
closed on: 11 Oct 2012, 15:51:46
------------------------------------------------------------------------------------