-----------------------------------------------------------------------------------------------------
name: <unnamed>
log: C:\Documents and Settings\Michael Rosenfeld\My Documents\newer web pages\soc_meth_proj3\2010_logs\sixth_class.log
log type: text
opened on: 11 Feb 2010, 14:04:58
. *ssc install desmat,replace
. *the above command gives loads the free add-on desmat to your Stata files. Make the command without the asterisk. Desmat is a free add-in which serves as an alternative to Stata's built in facility for dealing with dummy variables, xi. I like desmat better and I think it is more customizable and provides more easily readable output. Install desmat on your machine.
* You can use the i. notation in front of categorical variables, without xi, but the problem is that Stata does not generate the dummy variables readily for you, so that makes post regression estimation harder. You need the dummy variables to do lincom, for instance.
. regress inctot i.metro
Source | SS df MS Number of obs = 103226
-------------+------------------------------ F( 4,103221) = 260.60
Model | 1.0608e+12 4 2.6521e+11 Prob > F = 0.0000
Residual | 1.0505e+14103221 1.0177e+09 R-squared = 0.0100
-------------+------------------------------ Adj R-squared = 0.0100
Total | 1.0611e+14103225 1.0279e+09 Root MSE = 31901
------------------------------------------------------------------------------
inctot | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
metro |
1 | -7505.233 1959.96 -3.83 0.000 -11346.73 -3663.736
2 | -3513.946 1959.129 -1.79 0.073 -7353.813 325.9204
3 | 856.9903 1955.278 0.44 0.661 -2975.33 4689.31
4 | -3436.875 1965.638 -1.75 0.080 -7289.498 415.7494
|
_cons | 28722.62 1948.69 14.74 0.000 24903.21 32542.03
------------------------------------------------------------------------------
. regress inctot ib1.metro
Source | SS df MS Number of obs = 103226
-------------+------------------------------ F( 4,103221) = 260.60
Model | 1.0608e+12 4 2.6521e+11 Prob > F = 0.0000
Residual | 1.0505e+14103221 1.0177e+09 R-squared = 0.0100
-------------+------------------------------ Adj R-squared = 0.0100
Total | 1.0611e+14103225 1.0279e+09 Root MSE = 31901
------------------------------------------------------------------------------
inctot | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
metro |
0 | 7505.233 1959.96 3.83 0.000 3663.736 11346.73
2 | 3991.287 291.2823 13.70 0.000 3420.377 4562.196
3 | 8362.223 264.1467 31.66 0.000 7844.499 8879.947
4 | 4068.359 332.2516 12.24 0.000 3417.15 4719.567
|
_cons | 21217.39 209.8869 101.09 0.000 20806.01 21628.76
------------------------------------------------------------------------------
*If you are using the i. notation without the xi preceding it, you can change the base value, that is the comparison value by writing ib#. In the above case, we specified metro==1 as the comparison category, so it is excluded from the output.
. exit, clear
---------------------------------------------------------------------------------------------------------------------------------------------------
name: <unnamed>
log: C:\Documents and Settings\Michael Rosenfeld\My Documents\newer web pages\soc_meth_proj3\2010_logs\sixth_class.log
log type: text
opened on: 11 Feb 2010, 14:27:26
. use "C:\Documents and Settings\Michael Rosenfeld\My Documents\newer web pages\soc_meth_proj3\cps_mar_2000_new.dta", clear
. table metro, contents(freq mean inctot)
--------------------------------------------------------
Metropolitan central city |
status | Freq. mean(inctot)
----------------------------+---------------------------
Not identifiable | 340 28722.6194
Not in metro area | 29,658 21217.38633
Central city | 32,481 25208.6732
Outside central city | 51,468 29579.60967
Central city status unknown | 19,763 25285.74487
--------------------------------------------------------
*This table shows the average income for each metro status. Note the values carefully. Whatever the excluded value, or the comparison category, that is what the constant value is going to be.
. display 25208-21217
3991
* The urban-rural difference is $3991 and this will be reflected in the coefficients of the model (if either urban or rural are the excluded category), or else we can recover the urban-rural difference by doing lincom if some other category is the excluded comparison category.
. regress inctot i.metro
Source | SS df MS Number of obs = 103226
-------------+------------------------------ F( 4,103221) = 260.60
Model | 1.0608e+12 4 2.6521e+11 Prob > F = 0.0000
Residual | 1.0505e+14103221 1.0177e+09 R-squared = 0.0100
-------------+------------------------------ Adj R-squared = 0.0100
Total | 1.0611e+14103225 1.0279e+09 Root MSE = 31901
------------------------------------------------------------------------------
inctot | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
metro |
1 | -7505.233 1959.96 -3.83 0.000 -11346.73 -3663.736
2 | -3513.946 1959.129 -1.79 0.073 -7353.813 325.9204
3 | 856.9903 1955.278 0.44 0.661 -2975.33 4689.31
4 | -3436.875 1965.638 -1.75 0.080 -7289.498 415.7494
|
_cons | 28722.62 1948.69 14.74 0.000 24903.21 32542.03
------------------------------------------------------------------------------
*Here, the automatic excluded category is the first category, "non identifiable", so the constant is 28722.62, the "non identifiable" average, and every other category is compared to that one.
. *ssc install desmat, replace
. desmat inctot metro
--Break--
r(1);
. desmat: inctot metro
, invalid name
r(198);
. desmat: regress inctot metro
---------------------------------------------------------------------------------
Linear regression
---------------------------------------------------------------------------------
Dependent variable inctot
Number of observations: 103226
F statistic: 260.597
Model degrees of freedom: 4
Residual degrees of freedom: 103221
R-squared: 0.010
Adjusted R-squared: 0.010
Root MSE 31901.428
Prob: 0.000
---------------------------------------------------------------------------------
nr Effect Coeff s.e.
---------------------------------------------------------------------------------
metro
1 Not in metro area -7505.233** 1959.960
2 Central city -3513.946 1959.129
3 Outside central city 856.990 1955.278
4 Central city status unknown -3436.875 1965.638
5 _cons 28722.619** 1948.690
---------------------------------------------------------------------------------
* p < .05
** p < .01
* I think it makes more sense to exclude the small group of "non identifiable" respondents from the analysis. Now we are going to be comparing everyone to the second category, the rural folks.
. desmat: regress inctot metro=ind(2) if metro!=0
---------------------------------------------------------------------------------
Linear regression
---------------------------------------------------------------------------------
Dependent variable inctot
Number of observations: 102958
F statistic: 346.821
Model degrees of freedom: 3
Residual degrees of freedom: 102954
R-squared: 0.010
Adjusted R-squared: 0.010
Root MSE 31901.199
Prob: 0.000
---------------------------------------------------------------------------------
nr Effect Coeff s.e.
---------------------------------------------------------------------------------
metro
1 Not identifiable 0.000 .
2 Central city 3991.287** 291.280
3 Outside central city 8362.223** 264.145
4 Central city status unknown 4068.359** 332.249
5 _cons 21217.386** 209.885
---------------------------------------------------------------------------------
* p < .05
** p < .01
* A quick definition of what things mean in the regression output. The coefficients and their standard errors, and resulting T-statistics or Z values you should have some understanding of already.
* Number of observations, 102958 is just the number of cases who have inctot and metro!=0.
* The F statistic is a test (which we won't discuss or make use of in this class) which compares how well this model fits compared to a model that has only the constant term in it. The constant-only model is a silly model (it assumes all respondents have basically the same inctot). Every reasonable model should fit the data better than the constant-only model.
* Model degrees of freedom tells you how many terms are in the model in addition to the constant term. There are three terms in the model, so model df is 3.
* Residual degrees of freedom is number of observations -df -1.
* While we will not be making use of the F-tests, we will be looking at the R-square and adjusted R-square as measures of model fit. The R-square tells us what percentage of the variance of inctot (across all 100K respondents) is explained by our predictor variables, in this case metro. The answer is 1% (R-square=0.01). Models with higher R-square (closer to 1) fit better, and will be preferred. The adjusted R-square is like regular R-square, but makes a slight adjustment to penalize the R-square value depending on how many terms you put in the model. If you put a lot of useless terms in the model, R-square won't change but adjusted R-square will go down.
* In the above regression, we had rural as the comparison category. We can see the central city- rural contrast of 3991, which has a standard error of 291, and therefore a T-statistic of more than 10. We can calculate the T-statistic by hand:
. display 3991.287/291.280
13.702578
* Or we can ask desmat's little brother desrep to give us the regression output that also includes the T-statistic (which is activated by option zval) and the probability of the null hypothesis (that the two samples are equal) being true. This test tells us with absolute certainty that we can reject the null hypothesis that rural workers and city workers earn the same amount. Clearly city workers earn more.
desrep, zval prob
------------------------------------------------------------------------------------------
Linear regression
------------------------------------------------------------------------------------------
Dependent variable inctot
Number of observations: 102958
F statistic: 346.821
Model degrees of freedom: 3
Residual degrees of freedom: 102954
R-squared: 0.010
Adjusted R-squared: 0.010
Root MSE 31901.199
Prob: 0.000
------------------------------------------------------------------------------------------
nr Effect Coeff s.e. t prob
------------------------------------------------------------------------------------------
metro
1 Not identifiable 0.000 . . .
2 Central city 3991.287** 291.280 13.703 0.000
3 Outside central city 8362.223** 264.145 31.658 0.000
4 Central city status unknown 4068.359** 332.249 12.245 0.000
5 _cons 21217.386** 209.885 101.090 0.000
------------------------------------------------------------------------------------------
* p < .05
** p < .01
.
* Now let's say we want to compare the suburbs(i.e. "outside central city") to the central city. We don't need to run the regression again, we just use lincom to compare them.
. lincom _x_3- _x_2
( 1) - _x_2 + _x_3 = 0
------------------------------------------------------------------------------
inctot | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | 4370.936 257.9009 16.95 0.000 3865.454 4876.419
------------------------------------------------------------------------------
*If we want to see the constant only model, which really makes no sense but is used as an implicit comparison by the F-test, here it is.
. regress inctot
Source | SS df MS Number of obs = 103226
-------------+------------------------------ F( 0,103225) = 0.00
Model | 0 0 . Prob > F = .
Residual | 1.0611e+14103225 1.0279e+09 R-squared = 0.0000
-------------+------------------------------ Adj R-squared = 0.0000
Total | 1.0611e+14103225 1.0279e+09 Root MSE = 32061
------------------------------------------------------------------------------
inctot | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_cons | 26011.4 99.79046 260.66 0.000 25815.81 26206.99
------------------------------------------------------------------------------
. desmat: regress inctot metro=ind(2) if metro!=0
---------------------------------------------------------------------------------
Linear regression
---------------------------------------------------------------------------------
Dependent variable inctot
Number of observations: 102958
F statistic: 346.821
Model degrees of freedom: 3
Residual degrees of freedom: 102954
R-squared: 0.010
Adjusted R-squared: 0.010
Root MSE 31901.199
Prob: 0.000
---------------------------------------------------------------------------------
nr Effect Coeff s.e.
---------------------------------------------------------------------------------
metro
1 Not identifiable 0.000 .
2 Central city 3991.287** 291.280
3 Outside central city 8362.223** 264.145
4 Central city status unknown 4068.359** 332.249
5 _cons 21217.386** 209.885
---------------------------------------------------------------------------------
* p < .05
** p < .01
*What I want to show here is that the model fit statistics are the same regardless of which category of metro is the excluded comparison category. Above we exclude the 2nd category, which is rural. Below we exclude the third category, which is central city. In the above comparison we have urban-rural=3991. In the below comparison we have rural-urban=-3991. And note that all the model fit and summary statistics are the same. It is the same model.
. desmat: regress inctot metro=ind(3) if metro!=0
---------------------------------------------------------------------------------
Linear regression
---------------------------------------------------------------------------------
Dependent variable inctot
Number of observations: 102958
F statistic: 346.821
Model degrees of freedom: 3
Residual degrees of freedom: 102954
R-squared: 0.010
Adjusted R-squared: 0.010
Root MSE 31901.199
Prob: 0.000
---------------------------------------------------------------------------------
nr Effect Coeff s.e.
---------------------------------------------------------------------------------
metro
1 Not identifiable 0.000 .
2 Not in metro area -3991.287** 291.280
3 Outside central city 4370.936** 257.901
4 Central city status unknown 77.072 327.307
5 _cons 25208.673** 201.971
---------------------------------------------------------------------------------
* p < .05
** p < .01
* And if we want to compare suburban to rural, we do lincom.
. lincom _x_3 - _x_2
( 1) - _x_2 + _x_3 = 0
------------------------------------------------------------------------------
inctot | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | 8362.223 264.1448 31.66 0.000 7844.503 8879.944
------------------------------------------------------------------------------
. codebook metro
---------------------------------------------------------------------------------
metro Metropolitan central city status
---------------------------------------------------------------------------------
type: numeric (byte)
label: metrolbl
range: [0,4] units: 1
unique values: 5 missing .: 0/133710
tabulation: Freq. Numeric Label
340 0 Not identifiable
29658 1 Not in metro area
32481 2 Central city
51468 3 Outside central city
19763 4 Central city status unknown
* So why do we do dummy variables at all? Well, because metro is a categorical variables whose numbers don't mean anything. If we treated metro like a continuous variable, in desmat language we put an @ in front of it, we get a regression but it is totally nonsensical. There is no such thing as "units" of metro. The results don't make any sense:
. desmat: regress inctot @metro
---------------------------------------------------------------------------------
Linear regression
---------------------------------------------------------------------------------
Dependent variable inctot
Number of observations: 103226
F statistic: 489.217
Model degrees of freedom: 1
Residual degrees of freedom: 103224
R-squared: 0.005
Adjusted R-squared: 0.005
Root MSE 31985.930
Prob: 0.000
---------------------------------------------------------------------------------
nr Effect Coeff s.e.
---------------------------------------------------------------------------------
1 Metropolitan central city status 2193.168** 99.156
2 _cons 20634.754** 262.683
---------------------------------------------------------------------------------
* p < .05
** p < .01
. *look at the dummy variables created by desmat. They are all 0-1 indicator variables.
. desmat metro
Desmat generated the following design matrix:
nr Variables Term Parameterization
First Last
1 _x_1 _x_4 metro ind(0)
. tabulate metro _x_2
Metropolitan central | metro==2
city status | 0 1 | Total
----------------------+----------------------+----------
Not identifiable | 340 0 | 340
Not in metro area | 29,658 0 | 29,658
Central city | 0 32,481 | 32,481
Outside central city | 51,468 0 | 51,468
Central city status u | 19,763 0 | 19,763
----------------------+----------------------+----------
Total | 101,229 32,481 | 133,710
. log close
name: <unnamed>
log: C:\Documents and Settings\Michael Rosenfeld\My Documents\newer web pages\soc_meth_proj3\2010_logs\sixth_class.log
log type: text
closed on: 11 Feb 2010, 15:19:01
------------------------------------------------------------------------------------------------------------