-------------------------------------------------------------------------------
name: <unnamed>
log: C:\Documents and Settings\Michael Rosenfeld\My Documents\newer web pages\soc_meth_proj3\fal
> l_2010_s381_logs\class6.log
log type: text
opened on: 7 Oct 2010, 14:08:36
. tabulate metro
Metropolitan central city |
status | Freq. Percent Cum.
----------------------------+-----------------------------------
Not identifiable | 340 0.25 0.25
Not in metro area | 29,658 22.18 22.44
Central city | 32,481 24.29 46.73
Outside central city | 51,468 38.49 85.22
Central city status unknown | 19,763 14.78 100.00
----------------------------+-----------------------------------
Total | 133,710 100.00
. tabulate metro, nolabel
Metropolita |
n central |
city status | Freq. Percent Cum.
------------+-----------------------------------
0 | 340 0.25 0.25
1 | 29,658 22.18 22.44
2 | 32,481 24.29 46.73
3 | 51,468 38.49 85.22
4 | 19,763 14.78 100.00
------------+-----------------------------------
Total | 133,710 100.00
. codebook metro
-------------------------------------------------------------------------------
metro Metropolitan central city status
-------------------------------------------------------------------------------
type: numeric (byte)
label: metrolbl
range: [0,4] units: 1
unique values: 5 missing .: 0/133710
tabulation: Freq. Numeric Label
340 0 Not identifiable
29658 1 Not in metro area
32481 2 Central city
51468 3 Outside central city
19763 4 Central city status unknown
*Codebook, and tabulate followed by tabulate, nolabel are two ways of figuring out which numerical codes correspond to which actual categories.
. table metro if age>29 & age<65 & sex==1, contents(freq mean incwage)
----------------------------------------------------------
Metropolitan central city |
status | Freq. mean(incwage)
----------------------------+-----------------------------
Not identifiable | 94 31743.04255
Not in metro area | 6,628 27189.6465
Central city | 6,727 34445.35841
Outside central city | 11,639 43203.0348
Central city status unknown | 4,247 35557.95997
----------------------------------------------------------
* Right: "outside central city," or suburbs, have the highest income, while rural "not in metro area" have the lowest.
. regress incwage metro if age>29 & age<65 & sex==1
Source | SS df MS Number of obs = 29335
-------------+------------------------------ F( 1, 29333) = 400.17
Model | 6.0248e+11 1 6.0248e+11 Prob > F = 0.0000
Residual | 4.4162e+13 29333 1.5055e+09 R-squared = 0.0135
-------------+------------------------------ Adj R-squared = 0.0134
Total | 4.4765e+13 29334 1.5260e+09 Root MSE = 38801
------------------------------------------------------------------------------
incwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
metro | 4512.636 225.5831 20.00 0.000 4070.483 4954.789
_cons | 25359.28 598.1346 42.40 0.000 24186.91 26531.65
------------------------------------------------------------------------------
* Don't ever do this except by accident. Here we put metro in to our regression as a numeric variable, which makes no sense since the numbers of metro are just place holders for the different categories. The numbers could be anything, so the regression makes no sense. If you can't think of what the units of the variable are, it probably should not be treated as a numeric variable in regression. So what do we need? We need dummy variables.
. xi i.metro
i.metro _Imetro_0-4 (naturally coded; _Imetro_0 omitted)
* metro has 5 levels, Stata generated 4 dummy variables and omitted the first category, metro=0, or "not identifiable." The other indicator variables get coded zero for all the other categories, 1 for the indicated category. xi is built in to stata, and it will work on both Stata ver 10 and Stata ver 11. And when you run the xi command, you will see new variables showing up in your variable list for each new dummy variable.
. table metro, contents(mean _Imetro_1 mean _Imetro_2 mean _Imetro_3 mean _Imetro_4)
----------------------------------------------------------------------------------------
Metropolitan central city |
status | __000002 __000003 __000004 __000005
----------------------------+-----------------------------------------------------------
Not identifiable | 0 0 0 0
Not in metro area | 1 0 0 0
Central city | 0 1 0 0
Outside central city | 0 0 1 0
Central city status unknown | 0 0 0 1
----------------------------------------------------------------------------------------
. char metro[omit] 1
*unfortunately with xi, you need a separate command to change the omitted category for variable metro, in this case we are changing to the comparison category to metro==1, which are the rural folks.
. xi i.metro
i.metro _Imetro_0-4 (naturally coded; _Imetro_1 omitted)
. table metro, contents(mean _Imetro_0 mean _Imetro_2 mean _Imetro_3 mean _Imetro_4)
----------------------------------------------------------------------------------------
Metropolitan central city |
status | __000002 __000003 __000004 __000005
----------------------------+-----------------------------------------------------------
Not identifiable | 1 0 0 0
Not in metro area | 0 0 0 0
Central city | 0 1 0 0
Outside central city | 0 0 1 0
Central city status unknown | 0 0 0 1
----------------------------------------------------------------------------------------
* because we made metro=1, "not in metro area" the omitted category above with the char command, now we get indicator variables for every category but that one.
. table metro if age>29 & age<65 & sex==1, contents(freq mean incwage)
----------------------------------------------------------
Metropolitan central city |
status | Freq. mean(incwage)
----------------------------+-----------------------------
Not identifiable | 94 31743.04255
Not in metro area | 6,628 27189.6465
Central city | 6,727 34445.35841
Outside central city | 11,639 43203.0348
Central city status unknown | 4,247 35557.95997
----------------------------------------------------------
. regress incwage _Imetro* if metro~=0 & age>29 & age<65 & sex==1
note: _Imetro_0 omitted because of collinearity
Source | SS df MS Number of obs = 29241
-------------+------------------------------ F( 3, 29237) = 252.70
Model | 1.1296e+12 3 3.7652e+11 Prob > F = 0.0000
Residual | 4.3563e+13 29237 1.4900e+09 R-squared = 0.0253
-------------+------------------------------ Adj R-squared = 0.0252
Total | 4.4692e+13 29240 1.5285e+09 Root MSE = 38600
------------------------------------------------------------------------------
incwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Imetro_0 | (omitted)
_Imetro_2 | 7255.712 668.0533 10.86 0.000 5946.297 8565.127
_Imetro_3 | 16013.39 593.9852 26.96 0.000 14849.15 17177.63
_Imetro_4 | 8368.313 758.7058 11.03 0.000 6881.216 9855.411
_cons | 27189.65 474.1327 57.35 0.000 26260.33 28118.97
------------------------------------------------------------------------------
* So, a few things to note about the regression. First, we have 4 terms including the constant predicting 4 categories (we dropped "not identifiable" from our analysis). Having 4 terms predicting 4 things means we can fit the actual data exactly. The constant coefficient equals the income for our excluded category, rural income. The other coefficients represent the difference between that area and rural average income.
. regress incwage _Imetro* if age>29 & age<65 & sex==1
Source | SS df MS Number of obs = 29335
-------------+------------------------------ F( 4, 29330) = 190.17
Model | 1.1316e+12 4 2.8291e+11 Prob > F = 0.0000
Residual | 4.3633e+13 29330 1.4877e+09 R-squared = 0.0253
-------------+------------------------------ Adj R-squared = 0.0251
Total | 4.4765e+13 29334 1.5260e+09 Root MSE = 38570
------------------------------------------------------------------------------
incwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Imetro_0 | 4553.396 4006.316 1.14 0.256 -3299.164 12405.96
_Imetro_2 | 7255.712 667.5305 10.87 0.000 5947.322 8564.102
_Imetro_3 | 16013.39 593.5204 26.98 0.000 14850.06 17176.71
_Imetro_4 | 8368.313 758.1121 11.04 0.000 6882.38 9854.247
_cons | 27189.65 473.7616 57.39 0.000 26261.05 28118.24
------------------------------------------------------------------------------
* If we put the unimportant metro==0 "not idenfiable" folks back into the regression, what changes? Well, the N goes up, the fit statistics change a little, the constant and the other coefficients are unchanged, but all of the standard errors and the t statistics are changed a little, because the new cases affect the joint variance of income across all cases.
. lincom _Imetro_3- _Imetro_2
( 1) - _Imetro_2 + _Imetro_3 = 0
------------------------------------------------------------------------------
incwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | 8757.676 590.7312 14.83 0.000 7599.817 9915.536
------------------------------------------------------------------------------
* if we want to make a comparison between two of the other groups, lincom is one way- it uses the results of the previous regression, so this is what we would get for metro=3 if we had made metro==2 the comparison category..
. xi: regress incwage i.metro if age>29 & age<65 & sex==1
i.metro _Imetro_0-4 (naturally coded; _Imetro_1 omitted)
Source | SS df MS Number of obs = 29335
-------------+------------------------------ F( 4, 29330) = 190.17
Model | 1.1316e+12 4 2.8291e+11 Prob > F = 0.0000
Residual | 4.3633e+13 29330 1.4877e+09 R-squared = 0.0253
-------------+------------------------------ Adj R-squared = 0.0251
Total | 4.4765e+13 29334 1.5260e+09 Root MSE = 38570
------------------------------------------------------------------------------
incwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Imetro_0 | 4553.396 4006.316 1.14 0.256 -3299.164 12405.96
_Imetro_2 | 7255.712 667.5305 10.87 0.000 5947.322 8564.102
_Imetro_3 | 16013.39 593.5204 26.98 0.000 14850.06 17176.71
_Imetro_4 | 8368.313 758.1121 11.04 0.000 6882.38 9854.247
_cons | 27189.65 473.7616 57.39 0.000 26261.05 28118.24
------------------------------------------------------------------------------
* The xi: format combines the xi and the regression step, and here we would put an i. in front of every variable that is categorical that we need to generate dummy variables for.
. regress incwage i.metro if age>29 & age<65 & sex==1
Source | SS df MS Number of obs = 29335
-------------+------------------------------ F( 4, 29330) = 190.17
Model | 1.1316e+12 4 2.8291e+11 Prob > F = 0.0000
Residual | 4.3633e+13 29330 1.4877e+09 R-squared = 0.0253
-------------+------------------------------ Adj R-squared = 0.0251
Total | 4.4765e+13 29334 1.5260e+09 Root MSE = 38570
------------------------------------------------------------------------------
incwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
metro |
1 | -4553.396 4006.316 -1.14 0.256 -12405.96 3299.164
2 | 2702.316 4005.904 0.67 0.500 -5149.436 10554.07
3 | 11459.99 3994.238 2.87 0.004 3631.107 19288.88
4 | 3814.917 4021.99 0.95 0.343 -4068.363 11698.2
|
_cons | 31743.04 3978.206 7.98 0.000 23945.58 39540.5
------------------------------------------------------------------------------
* the above is a similar syntax, which Stata calls factor variables, but without the xi: and this syntax is only available in Stata 11.
. regress incwage ib2.metro if age>29 & age<65 & sex==1
Source | SS df MS Number of obs = 29335
-------------+------------------------------ F( 4, 29330) = 190.17
Model | 1.1316e+12 4 2.8291e+11 Prob > F = 0.0000
Residual | 4.3633e+13 29330 1.4877e+09 R-squared = 0.0253
-------------+------------------------------ Adj R-squared = 0.0251
Total | 4.4765e+13 29334 1.5260e+09 Root MSE = 38570
------------------------------------------------------------------------------
incwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
metro |
0 | -2702.316 4005.904 -0.67 0.500 -10554.07 5149.436
1 | -7255.712 667.5305 -10.87 0.000 -8564.102 -5947.322
3 | 8757.676 590.7312 14.83 0.000 7599.817 9915.536
4 | 1112.602 755.9304 1.47 0.141 -369.0558 2594.259
|
_cons | 34445.36 470.2626 73.25 0.000 33523.62 35367.09
------------------------------------------------------------------------------
* Like desmat (see below), factor variables (above- search fvvarlist on the Stata help) allow you to set the omitted category on the fly, which is nice. But factor variables (unlike xi and unlike desmat) do not create a new set of variables in your variable list that you can manipulate later (say with lincom), so that is a limitation.
. desmat: regress incwage metro=ind(2) if age>29 & age<65 & sex==1
--------------------------------------------------------------------------------------
Linear regression
--------------------------------------------------------------------------------------
Dependent variable incwage
Number of observations: 29335
F statistic: 190.172
Model degrees of freedom: 4
Residual degrees of freedom: 29330
R-squared: 0.025
Adjusted R-squared: 0.025
Root MSE 38570.134
Prob: 0.000
--------------------------------------------------------------------------------------
nr Effect Coeff s.e.
--------------------------------------------------------------------------------------
metro
1 Not identifiable 4553.396 4006.316
2 Central city 7255.712** 667.531
3 Outside central city 16013.388** 593.520
4 Central city status unknown 8368.313** 758.112
5 _cons 27189.646** 473.762
--------------------------------------------------------------------------------------
* p < .05
** p < .01
* type findit desmat and follow the links to download. desmat assumes that predictor variables are always categorical, and need to be made into dummies, unless you use the @ prefix to indicate that the predictor is continuous, so no i. prefix for the categorical metro. The ind(2) means that desmat will take the second category (which in this case is metro==1) and make it the omitted category.
. codebook metro
--------------------------------------------------------------------------------------
metro Metropolitan central city status
--------------------------------------------------------------------------------------
type: numeric (byte)
label: metrolbl
range: [0,4] units: 1
unique values: 5 missing .: 0/133710
tabulation: Freq. Numeric Label
340 0 Not identifiable
29658 1 Not in metro area
32481 2 Central city
51468 3 Outside central city
19763 4 Central city status unknown
. log close
name: <unnamed>
log: C:\Documents and Settings\Michael Rosenfeld\My Documents\newer web pages\soc_meth_pr
> oj3\fall_2010_s381_logs\class6.log
log type: text
closed on: 7 Oct 2010, 15:33:05
-------------------------------------------------------------------------------------------------