-----------------------------------------------------------------------------------------------------

      name:  <unnamed>

       log:  C:\Documents and Settings\Michael Rosenfeld\My Documents\newer web pages\soc_meth_proj3\2010_logs\sixth_class.log

  log type:  text

 opened on:  11 Feb 2010, 14:04:58

 

. *ssc install desmat,replace

 

. *the above command gives loads the free add-on desmat to your Stata files. Make the command without the asterisk. Desmat is a free add-in which serves as an alternative to Stata's built in facility for dealing with dummy variables, xi. I like desmat better and I think it is more customizable and provides more easily readable output. Install desmat on your machine.

 

* You can use the i. notation in front of categorical variables, without xi, but the problem is that Stata does not generate the dummy variables readily for you, so that makes post regression estimation harder. You need the dummy variables to do lincom, for instance.

 

. regress inctot i.metro

 

      Source |       SS       df       MS              Number of obs =  103226

-------------+------------------------------           F(  4,103221) =  260.60

       Model |  1.0608e+12     4  2.6521e+11           Prob > F      =  0.0000

    Residual |  1.0505e+14103221  1.0177e+09           R-squared     =  0.0100

-------------+------------------------------           Adj R-squared =  0.0100

       Total |  1.0611e+14103225  1.0279e+09           Root MSE      =   31901

 

------------------------------------------------------------------------------

      inctot |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

       metro |

          1  |  -7505.233    1959.96    -3.83   0.000    -11346.73   -3663.736

          2  |  -3513.946   1959.129    -1.79   0.073    -7353.813    325.9204

          3  |   856.9903   1955.278     0.44   0.661     -2975.33     4689.31

          4  |  -3436.875   1965.638    -1.75   0.080    -7289.498    415.7494

             |

       _cons |   28722.62    1948.69    14.74   0.000     24903.21    32542.03

------------------------------------------------------------------------------

 

. regress inctot ib1.metro

 

      Source |       SS       df       MS              Number of obs =  103226

-------------+------------------------------           F(  4,103221) =  260.60

       Model |  1.0608e+12     4  2.6521e+11           Prob > F      =  0.0000

    Residual |  1.0505e+14103221  1.0177e+09           R-squared     =  0.0100

-------------+------------------------------           Adj R-squared =  0.0100

       Total |  1.0611e+14103225  1.0279e+09           Root MSE      =   31901

 

------------------------------------------------------------------------------

      inctot |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

       metro |

          0  |   7505.233    1959.96     3.83   0.000     3663.736    11346.73

          2  |   3991.287   291.2823    13.70   0.000     3420.377    4562.196

          3  |   8362.223   264.1467    31.66   0.000     7844.499    8879.947

          4  |   4068.359   332.2516    12.24   0.000      3417.15    4719.567

             |

       _cons |   21217.39   209.8869   101.09   0.000     20806.01    21628.76

------------------------------------------------------------------------------

 

*If you are using the i. notation without the xi preceding it, you can change the base value, that is the comparison value by writing ib#. In the above case, we specified metro==1 as the comparison category, so it is excluded from the output.

 

. exit, clear

---------------------------------------------------------------------------------------------------------------------------------------------------

      name:  <unnamed>

       log:  C:\Documents and Settings\Michael Rosenfeld\My Documents\newer web pages\soc_meth_proj3\2010_logs\sixth_class.log

  log type:  text

 opened on:  11 Feb 2010, 14:27:26

 

. use "C:\Documents and Settings\Michael Rosenfeld\My Documents\newer web pages\soc_meth_proj3\cps_mar_2000_new.dta", clear

 

. table metro, contents(freq mean inctot)

 

--------------------------------------------------------

Metropolitan central city   |

status                      |        Freq.  mean(inctot)

----------------------------+---------------------------

           Not identifiable |          340    28722.6194

          Not in metro area |       29,658   21217.38633

               Central city |       32,481    25208.6732

       Outside central city |       51,468   29579.60967

Central city status unknown |       19,763   25285.74487

--------------------------------------------------------

 

*This table shows the average income for each metro status. Note the values carefully. Whatever the excluded value, or the comparison category, that is what the constant value is going to be.

 

. display 25208-21217

3991

 

* The urban-rural difference is $3991 and this will be reflected in the coefficients of the model (if either urban or rural are the excluded category), or else we can recover the urban-rural difference by doing lincom if some other category is the excluded comparison category.

 

. regress inctot i.metro

 

      Source |       SS       df       MS              Number of obs =  103226

-------------+------------------------------           F(  4,103221) =  260.60

       Model |  1.0608e+12     4  2.6521e+11           Prob > F      =  0.0000

    Residual |  1.0505e+14103221  1.0177e+09           R-squared     =  0.0100

-------------+------------------------------           Adj R-squared =  0.0100

       Total |  1.0611e+14103225  1.0279e+09           Root MSE      =   31901

 

------------------------------------------------------------------------------

      inctot |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

       metro |

          1  |  -7505.233    1959.96    -3.83   0.000    -11346.73   -3663.736

          2  |  -3513.946   1959.129    -1.79   0.073    -7353.813    325.9204

          3  |   856.9903   1955.278     0.44   0.661     -2975.33     4689.31

          4  |  -3436.875   1965.638    -1.75   0.080    -7289.498    415.7494

             |

       _cons |   28722.62    1948.69    14.74   0.000     24903.21    32542.03

------------------------------------------------------------------------------

 

*Here, the automatic excluded category is the first category, "non identifiable", so the constant is 28722.62, the "non identifiable" average, and every other category is compared to that one.

 

. *ssc install desmat, replace

 

. desmat inctot metro

--Break--

r(1);

 

. desmat: inctot metro

, invalid name

r(198);

 

. desmat: regress inctot metro

---------------------------------------------------------------------------------

   Linear regression

---------------------------------------------------------------------------------

   Dependent variable                                                     inctot

   Number of observations:                                                103226

   F statistic:                                                          260.597

   Model degrees of freedom:                                                   4

   Residual degrees of freedom:                                           103221

   R-squared:                                                              0.010

   Adjusted R-squared:                                                     0.010

   Root MSE                                                            31901.428

   Prob:                                                                   0.000

---------------------------------------------------------------------------------

nr Effect                                                      Coeff        s.e.

---------------------------------------------------------------------------------

   metro

1    Not in metro area                                     -7505.233**  1959.960

2    Central city                                          -3513.946    1959.129

3    Outside central city                                    856.990    1955.278

4    Central city status unknown                           -3436.875    1965.638

5  _cons                                                   28722.619**  1948.690

---------------------------------------------------------------------------------

*  p < .05

** p < .01

 

* I think it makes more sense to exclude the small group of "non identifiable" respondents from the analysis. Now we are going to be comparing everyone to the second category, the rural folks.

 

. desmat: regress inctot metro=ind(2) if metro!=0

---------------------------------------------------------------------------------

   Linear regression

---------------------------------------------------------------------------------

   Dependent variable                                                     inctot

   Number of observations:                                                102958

   F statistic:                                                          346.821

   Model degrees of freedom:                                                   3

   Residual degrees of freedom:                                           102954

   R-squared:                                                              0.010

   Adjusted R-squared:                                                     0.010

   Root MSE                                                            31901.199

   Prob:                                                                   0.000

---------------------------------------------------------------------------------

nr Effect                                                      Coeff        s.e.

---------------------------------------------------------------------------------

   metro

1    Not identifiable                                          0.000           .

2    Central city                                           3991.287**   291.280

3    Outside central city                                   8362.223**   264.145

4    Central city status unknown                            4068.359**   332.249

5  _cons                                                   21217.386**   209.885

---------------------------------------------------------------------------------

*  p < .05

** p < .01

 

* A quick definition of what things mean in the regression output. The coefficients and their standard errors, and resulting T-statistics or Z values you should have some understanding of already.

 

* Number of observations, 102958 is just the number of cases who have inctot and metro!=0.

 

* The F statistic is a test (which we won't discuss or make use of in this class) which compares how well this model fits compared to a model that has only the constant term in it. The constant-only model is a silly model (it assumes all respondents have basically the same inctot). Every reasonable model should fit the data better than the constant-only model.

 

* Model degrees of freedom tells you how many terms are in the model in addition to the constant term. There are three terms in the model, so model df is 3.

 

* Residual degrees of freedom is number of observations -df -1.

 

* While we will not be making use of the F-tests, we will be looking at the R-square and adjusted R-square as measures of model fit. The R-square tells us what percentage of the variance of inctot (across all 100K respondents) is explained by our predictor variables, in this case metro. The answer is 1% (R-square=0.01). Models with higher R-square (closer to 1) fit better, and will be preferred. The adjusted R-square is like regular R-square, but makes a slight adjustment to penalize the R-square value depending on how many terms you put in the model. If you put a lot of useless terms in the model, R-square won't change but adjusted R-square will go down.

 

* In the above regression, we had rural as the comparison category. We can see the central city- rural contrast of 3991, which has a standard error of 291, and therefore a T-statistic of more than 10. We can calculate the T-statistic by hand:

 

 

. display 3991.287/291.280

13.702578

 

* Or we can ask desmat's little brother desrep  to give us the regression output that also includes the T-statistic (which is activated by option zval) and the probability of the null hypothesis (that the two samples are equal) being true. This test tells us with absolute certainty that we can reject the null hypothesis that rural workers and city workers earn the same amount. Clearly city workers earn more.

 

desrep, zval prob

------------------------------------------------------------------------------------------

   Linear regression

------------------------------------------------------------------------------------------

   Dependent variable                                                              inctot

   Number of observations:                                                         102958

   F statistic:                                                                   346.821

   Model degrees of freedom:                                                            3

   Residual degrees of freedom:                                                    102954

   R-squared:                                                                       0.010

   Adjusted R-squared:                                                              0.010

   Root MSE                                                                     31901.199

   Prob:                                                                            0.000

------------------------------------------------------------------------------------------

nr Effect                                           Coeff        s.e.       t        prob

------------------------------------------------------------------------------------------

   metro

1    Not identifiable                               0.000           .         .         .

2    Central city                                3991.287**   291.280    13.703     0.000

3    Outside central city                        8362.223**   264.145    31.658     0.000

4    Central city status unknown                 4068.359**   332.249    12.245     0.000

5  _cons                                        21217.386**   209.885   101.090     0.000

------------------------------------------------------------------------------------------

*  p < .05

** p < .01

 

.

 

* Now let's say we want to compare the suburbs(i.e. "outside central city") to the central city. We don't need to run the regression again, we just use lincom to compare them.

. lincom  _x_3- _x_2

 

 ( 1)  - _x_2 + _x_3 = 0

 

------------------------------------------------------------------------------

      inctot |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

         (1) |   4370.936   257.9009    16.95   0.000     3865.454    4876.419

------------------------------------------------------------------------------

 

 

*If we want to see the constant only model, which really makes no sense but is used as an implicit comparison by the F-test, here it is.

 

. regress inctot

 

      Source |       SS       df       MS              Number of obs =  103226

-------------+------------------------------           F(  0,103225) =    0.00

       Model |           0     0           .           Prob > F      =       .

    Residual |  1.0611e+14103225  1.0279e+09           R-squared     =  0.0000

-------------+------------------------------           Adj R-squared =  0.0000

       Total |  1.0611e+14103225  1.0279e+09           Root MSE      =   32061

 

------------------------------------------------------------------------------

      inctot |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

       _cons |    26011.4   99.79046   260.66   0.000     25815.81    26206.99

------------------------------------------------------------------------------

 

. desmat: regress inctot metro=ind(2) if metro!=0

---------------------------------------------------------------------------------

   Linear regression

---------------------------------------------------------------------------------

   Dependent variable                                                     inctot

   Number of observations:                                                102958

   F statistic:                                                          346.821

   Model degrees of freedom:                                                   3

   Residual degrees of freedom:                                           102954

   R-squared:                                                              0.010

   Adjusted R-squared:                                                     0.010

   Root MSE                                                            31901.199

   Prob:                                                                   0.000

---------------------------------------------------------------------------------

nr Effect                                                      Coeff        s.e.

---------------------------------------------------------------------------------

   metro

1    Not identifiable                                          0.000           .

2    Central city                                           3991.287**   291.280

3    Outside central city                                   8362.223**   264.145

4    Central city status unknown                            4068.359**   332.249

5  _cons                                                   21217.386**   209.885

---------------------------------------------------------------------------------

*  p < .05

** p < .01

 

*What I want to show here is that the model fit statistics are the same regardless of which category of metro is the excluded comparison category. Above we exclude the 2nd category, which is rural. Below we exclude the third category, which is central city. In the above comparison we have urban-rural=3991. In the below comparison we have rural-urban=-3991. And note that all the model fit and summary statistics are the same. It is the same model.

 

 

. desmat: regress inctot metro=ind(3) if metro!=0

---------------------------------------------------------------------------------

   Linear regression

---------------------------------------------------------------------------------

   Dependent variable                                                     inctot

   Number of observations:                                                102958

   F statistic:                                                          346.821

   Model degrees of freedom:                                                   3

   Residual degrees of freedom:                                           102954

   R-squared:                                                              0.010

   Adjusted R-squared:                                                     0.010

   Root MSE                                                            31901.199

   Prob:                                                                   0.000

---------------------------------------------------------------------------------

nr Effect                                                      Coeff        s.e.

---------------------------------------------------------------------------------

   metro

1    Not identifiable                                          0.000           .

2    Not in metro area                                     -3991.287**   291.280

3    Outside central city                                   4370.936**   257.901

4    Central city status unknown                              77.072     327.307

5  _cons                                                   25208.673**   201.971

---------------------------------------------------------------------------------

*  p < .05

** p < .01

 

* And if we want to compare suburban to rural, we do lincom.

. lincom  _x_3 - _x_2

 

 ( 1)  - _x_2 + _x_3 = 0

 

------------------------------------------------------------------------------

      inctot |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

         (1) |   8362.223   264.1448    31.66   0.000     7844.503    8879.944

------------------------------------------------------------------------------

 

. codebook metro

 

---------------------------------------------------------------------------------

metro                                            Metropolitan central city status

---------------------------------------------------------------------------------

 

                  type:  numeric (byte)

                 label:  metrolbl

 

                 range:  [0,4]                        units:  1

         unique values:  5                        missing .:  0/133710

 

            tabulation:  Freq.   Numeric  Label

                           340         0  Not identifiable

                         29658         1  Not in metro area

                         32481         2  Central city

                         51468         3  Outside central city

                         19763         4  Central city status unknown

 

* So why do we do dummy variables at all? Well, because metro is a categorical variables whose numbers don't mean anything. If we treated metro like a continuous variable, in desmat language we put an @ in front of it, we get a regression but it is totally nonsensical. There is no such thing as "units" of metro. The results don't make any sense:

 

. desmat: regress inctot @metro

---------------------------------------------------------------------------------

   Linear regression

---------------------------------------------------------------------------------

   Dependent variable                                                     inctot

   Number of observations:                                                103226

   F statistic:                                                          489.217

   Model degrees of freedom:                                                   1

   Residual degrees of freedom:                                           103224

   R-squared:                                                              0.005

   Adjusted R-squared:                                                     0.005

   Root MSE                                                            31985.930

   Prob:                                                                   0.000

---------------------------------------------------------------------------------

nr Effect                                                      Coeff        s.e.

---------------------------------------------------------------------------------

1  Metropolitan central city status                         2193.168**    99.156

2  _cons                                                   20634.754**   262.683

---------------------------------------------------------------------------------

*  p < .05

** p < .01

 

. *look at the dummy variables created by desmat. They are all 0-1 indicator variables.

 

. desmat metro

 

Desmat generated the following design matrix:

 

nr   Variables       Term                        Parameterization

     First    Last

 

 1    _x_1    _x_4   metro                       ind(0)

 

. tabulate metro  _x_2

 

 Metropolitan central |       metro==2

          city status |         0          1 |     Total

----------------------+----------------------+----------

     Not identifiable |       340          0 |       340

    Not in metro area |    29,658          0 |    29,658

         Central city |         0     32,481 |    32,481

 Outside central city |    51,468          0 |    51,468

Central city status u |    19,763          0 |    19,763

----------------------+----------------------+----------

                Total |   101,229     32,481 |   133,710

 

 

. log close

      name:  <unnamed>

       log:  C:\Documents and Settings\Michael Rosenfeld\My Documents\newer web pages\soc_meth_proj3\2010_logs\sixth_class.log

  log type:  text

 closed on:  11 Feb 2010, 15:19:01

------------------------------------------------------------------------------------------------------------