--------------------------------------------------------------------------------------------------

      name:  <unnamed>

       log:  C:\Documents and Settings\Michael Rosenfeld\My Documents\newer web pages\soc_meth_pro

> j3\2011_180B_logs\class10.log

  log type:  text

 opened on:  24 Feb 2011, 13:26:09

 

* First we load the 50 state dataset.

 

. use "C:\Documents and Settings\Michael Rosenfeld\My Documents\newer web pages\soc_meth_proj3\fifty_state_dataset.dta", clear

 

. describe

 

* Note that the data has 51 observations, 50 states plus DC.

 

Contains data from C:\Documents and Settings\Michael Rosenfeld\My Documents\newer web pages\soc_meth_proj3\fifty_state_dataset.dta

  obs:            51                         

 vars:            11                          7 Nov 2010 14:16

 size:         2,703 (99.9% of memory free)

---------------------------------------------------------------------------------------

              storage  display     value

variable name   type   format      label      variable label

---------------------------------------------------------------------------------------

statefip        byte   %57.0g      statefiplbl

                                              State (FIPS code)

US_born_propo~n float  %8.0g                  mean(US_b~n)

seniors_propo~n float  %8.0g                  mean(ove~65)

children_prop~n float  %8.0g                  mean(chil~n)

NH_White_prop~n float  %8.0g                  mean(NH_w~e)

inctot          double %12.0g                 mean(inctot)

CPS_population  long   %9.0gc                 Freq.

incwage         double %12.0g                 mean(incw~e)

male_proportion float  %8.0g                  mean(male)

urban_proport~n float  %8.0g                  mean(urban)

yrsed           float  %9.0g                  mean(yrsed)

---------------------------------------------------------------------------------------

Sorted by: 

 

. twoway (scatter incwage   US_born_proportion, mlabel(statefip)) (lfit incwage   US_born_proportion)

* If you weren't in class you really must plot these scatter plots and look at the line to understand which states are outliers and why.

 

. regress incwage   US_born_proportion

 

      Source |       SS       df       MS              Number of obs =      51

-------------+------------------------------           F(  1,    49) =   14.84

       Model |   104897990     1   104897990           Prob > F      =  0.0003

    Residual |   346387525    49  7069133.17           R-squared     =  0.2324

-------------+------------------------------           Adj R-squared =  0.2168

       Total |   451285515    50   9025710.3           Root MSE      =  2658.8

 

------------------------------------------------------------------------------

     incwage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

US_born_pr~n |  -24075.79       6250    -3.85   0.000    -36635.63   -11515.94

       _cons |   41551.87   5790.425     7.18   0.000     29915.58    53188.17

------------------------------------------------------------------------------

 

* The regress command produces the line that lfit plotted in the scatter plot above.

 

. predict newer_M1_predicted

(option xb assumed; fitted values)

* After regression we can generate predicted values.

 

. gen newer_M1_residuals=incwage- newer_M1_predicted

* Stata can generate the predicted values directly, but in this case I wanted to show that the residuals can be easily obtained by Actual- Predicted values.

 

. summarize  newer_M1_residuals

 

    Variable |       Obs        Mean    Std. Dev.       Min        Max

-------------+--------------------------------------------------------

newer_M1_r~s |        51   -.0000266    2632.062  -4678.628   4750.814

* The mean of residuals is supposed to be zero (here it is close).

 

. gen abs_newer_M1_residual=abs( newer_M1_residuals)

If we want to know which states have the larges residual in absolute value, first we have to create a new variable with the absolute value of the residuals, and

 

. gsort  -abs_newer_M1_residual

* Sort the dataset by this new absolute value of residual variable, from largest to smallest, and then list the first few observations.

 

. list incwage   US_born_proportion statefip newer_M1_residuals if _n<10

 

     +-----------------------------------------------------------+

     |     incwage   US_bor~n               statefip   newer_M~s |

     |-----------------------------------------------------------|

  1. | 24803.26155     .89299            Connecticut    4750.814 |

  2. | 17874.36641    .789128                Florida   -4678.628 |

  3. | 24575.59697    .898431               Maryland    4654.167 |

  4. | 23241.05063    .938505                 Alaska    4284.434 |

  5. | 24481.98743    .879239   District of Columbia    4098.485 |

     |-----------------------------------------------------------|

  6. | 13746.25247    .986109                Montana   -4064.271 |

  7. |    13760.09    .988432          West Virginia   -3994.506 |

  8. | 22967.69873    .935135              Minnesota    3929.941 |

  9. | 15344.73038    .933279             New Mexico   -3737.723 |

     +-----------------------------------------------------------+

* The _n is Stata's built in variable that holds the number for each observation, from 1 to (in this case) 51. When you sort, the order of the observations changes. Note that Connecticut, which is above the middle of the line, has the largest residual.

 

. twoway (scatter incwage  US_born_proportion, mlabel(statefip)) (lfit incwage   US_born_proportion)

* Our scatter plot again, with best fit line superimposed.

 

 

. predict newer_M1_dfbeta, dfbeta( US_born_proportion)

* This predict command again works off the regress command we ran earlier, and now generates dfbetas for each observation. Think of the dfbetas as how much the slope of the line in our regression would change if each point were removed.

 

. gen newer_abs_M1_dfbeta=abs( newer_M1_dfbeta)

* And now we take the absolute value

 

. gsort - newer_abs_M1_dfbeta

* And then we sort from largest to smallest, and then we list the largest (i.e. the first few) observations.

 

. list incwage   US_born_proportion statefip newer_M1_residuals  newer_M1_dfbeta if _n<10

 

     +----------------------------------------------------------------+

     |     incwage   US_bor~n        statefip   newer_M~s   newer_M~a |

     |----------------------------------------------------------------|

  1. | 20573.98456    .730234      California   -3396.939    .7651464 |

  2. | 17874.36641    .789128         Florida   -4678.628    .6546844 |

  3. |  20716.6877    .777071        New York   -2126.592    .3214945 |

  4. | 24990.40441    .829979      New Jersey    3420.941   -.3097655 |

  5. |    13760.09    .988432   West Virginia   -3994.506   -.2389347 |

     |----------------------------------------------------------------|

  6. | 13746.25247    .986109         Montana   -4064.271   -.2340724 |

  7. | 19547.63379    .818878          Hawaii   -2289.112    .2323307 |

  8. |  17986.0649    .856372         Arizona   -2947.986    .1866927 |

  9. | 23697.95964    .849691   Massachusetts    2603.065    -.181481 |

     +----------------------------------------------------------------+

*California has the largest dfbeta, because California is the state with the highest percentage foreign born, so California is an outlier in X, and that makes California very influential over the slope.

 

* Now we will run the regression with all states, and then without California.

 

. *CA is statefip==6

 

 

. regress incwage   US_born_proportion

 

      Source |       SS       df       MS              Number of obs =      51

-------------+------------------------------           F(  1,    49) =   14.84

       Model |   104897990     1   104897990           Prob > F      =  0.0003

    Residual |   346387525    49  7069133.17           R-squared     =  0.2324

-------------+------------------------------           Adj R-squared =  0.2168

       Total |   451285515    50   9025710.3           Root MSE      =  2658.8

 

------------------------------------------------------------------------------

     incwage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

US_born_pr~n |  -24075.79       6250    -3.85   0.000    -36635.63   -11515.94

       _cons |   41551.87   5790.425     7.18   0.000     29915.58    53188.17

------------------------------------------------------------------------------

 

 

. display -24075-((.7651)*6250)

-28856.875

 

* California's dfbeta of .7651 means that without California, the slope of the line in US born proportion would be 0.76 times the std error of 6250 more steep.

 

. regress incwage   US_born_proportion if statefip!=6

 

      Source |       SS       df       MS              Number of obs =      50

-------------+------------------------------           F(  1,    48) =   17.11

       Model |   118175271     1   118175271           Prob > F      =  0.0001

    Residual |   331435409    48  6904904.35           R-squared     =  0.2628

-------------+------------------------------           Adj R-squared =  0.2475

       Total |   449610680    49  9175728.16           Root MSE      =  2627.7

 

------------------------------------------------------------------------------

     incwage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

US_born_pr~n |  -28802.08   6962.086    -4.14   0.000    -42800.29   -14803.87

       _cons |   46007.88   6474.534     7.11   0.000     32989.95     59025.8

------------------------------------------------------------------------------

 

. twoway (scatter incwage  US_born_proportion, mlabel(statefip)) (lfit incwage   US_born_proportion)

* same graph again.

 

. log close

      name:  <unnamed>

       log:  C:\Documents and Settings\Michael Rosenfeld\My Documents\newer web pages\s

> oc_meth_proj3\2011_180B_logs\class10.log

  log type:  text

 closed on:  24 Feb 2011, 15:19:33

---------------------------------------------------------------------------------------