--------------------------------------------------------------------------------

name:  <unnamed>

log type:  text

opened on:  19 Feb 2013, 13:36:58

* First we are going to start with Anscombe’s data, which you will need to copy from Excel and then save to the Stata data editor, which has an icon at the top of the Stata control bar, which looks like a spreadsheet. I already have saved the Anscombe data as a Stata file, so:

. use "C:\Users\Michael\Documents\current class files\intro soc methods\anscombe.dta", clear

. regress y2 x2

Source |       SS       df       MS              Number of obs =      11

-------------+------------------------------           F(  1,     9) =   17.97

Model |  27.5000024     1  27.5000024           Prob > F      =  0.0022

Residual |   13.776294     9  1.53069933           R-squared     =  0.6662

Total |  41.2762964    10  4.12762964           Root MSE      =  1.2372

------------------------------------------------------------------------------

y2 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

x2 |         .5   .1179638     4.24   0.002     .2331475    .7668526

_cons |   3.000909   1.125303     2.67   0.026     .4552978     5.54652

------------------------------------------------------------------------------

. twoway (scatter y2 x2) (lfit y2 x2)

* In order to really follow what went on in class, you are going to need to generate the plots yourself.

. predict M2_dfbeta, dfbeta(x2)

* predict is a post-estimation command, meaning you can run it after you run a regression, which we ran above.

. twoway (scatter y2 x2) (scatter  M2_dfbeta x2) (lfit y2 x2)

*the above command shows the scatter plot, the best fit line, and a plot of the dfbetas, which turn out to be greatest at the extremes of the X-distribution, because X outliers are most influential over the slope. DFbetas measure how much each point changes the slope, i.e. how different would the slope be if each point were missing…

. gen abs_m2_dfbeta=abs( M2_dfbeta)

. gsort - abs_m2_dfbeta

. list  abs_m2_dfbeta M2_dfbeta x2 y2

+----------------------------------+

| abs_m2~a   M2_dfbeta   x2     y2 |

|----------------------------------|

1. | 1.291224    1.291224    4    3.1 |

2. | 1.291224   -1.291224   14    8.1 |

3. | .2979074   -.2979074   13   8.74 |

4. | .2979073    .2979073    5   4.74 |

5. | .1295366   -.1295366    7   7.26 |

|----------------------------------|

6. | .1295366    .1295366   11   9.26 |

7. | .0971856   -.0971856    8   8.14 |

8. | .0971856    .0971856   10   9.14 |

9. | .0340383   -.0340383    6   6.13 |

10. | .0340383    .0340383   12   9.13 |

|----------------------------------|

11. |        0           0    9   8.77 |

*then we generate an absolute value of the dfbetas, we sort the observations from largest to smallest on the new absolute value dfbeta variable, and we list all points in order from largest absolute dfbeta value to smallest.

. clear all

* OK, now on to the 50 state dataset, which is posted on my website in Stata format.

. use "C:\Users\Michael\Documents\current class files\intro soc methods\fifty_state_dataset.dta", clear

. describe

Contains data from C:\Users\Michael\Documents\current class files\intro soc methods\fifty_state_dataset.dta

obs:            51

vars:            11                          7 Nov 2010 14:16

size:         2,703 (99.9% of memory free)

-------------------------------------------------------------------------------------

storage  display     value

variable name   type   format      label      variable label

-------------------------------------------------------------------------------------

statefip        byte   %57.0g      statefiplbl

State (FIPS code)

US_born_propo~n float  %8.0g                  mean(US_b~n)

seniors_propo~n float  %8.0g                  mean(ove~65)

children_prop~n float  %8.0g                  mean(chil~n)

NH_White_prop~n float  %8.0g                  mean(NH_w~e)

inctot          double %12.0g                 mean(inctot)

CPS_population  long   %9.0gc                 Freq.

incwage         double %12.0g                 mean(incw~e)

male_proportion float  %8.0g                  mean(male)

urban_proport~n float  %8.0g                  mean(urban)

yrsed           float  %9.0g                  mean(yrsed)

-------------------------------------------------------------------------------------

Sorted by:

. twoway (scatter incwage   US_born_proportion, mlabel(statefip)) (lfit incwage   US_born_proportion)

* this scatter plot is much like one you will have to do yourself in HW4.

. regress incwage  US_born_proportion

Source |       SS       df       MS              Number of obs =      51

-------------+------------------------------           F(  1,    49) =   14.84

Model |   104897990     1   104897990           Prob > F      =  0.0003

Residual |   346387525    49  7069133.17           R-squared     =  0.2324

Total |   451285515    50   9025710.3           Root MSE      =  2658.8

------------------------------------------------------------------------------

incwage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

US_born_pr~n |  -24075.79       6250    -3.85   0.000    -36635.63   -11515.94

_cons |   41551.87   5790.425     7.18   0.000     29915.58    53188.17

------------------------------------------------------------------------------

. predict m1_50st_predicted

(option xb assumed; fitted values)

. gen m1_residuals=incwage- m1_50st_predicted

. gen abs_resid=abs( m1_residuals)

* we generate residuals, and then a new variable with the absolute value of residuals, and then we want to see which state have the largest residuals.

* And by the way: what is the average of all residuals? That’s right, it has to be zero.

. summarize  m1_residuals

Variable |       Obs        Mean    Std. Dev.       Min        Max

-------------+--------------------------------------------------------

m1_residuals |        51   -.0000266    2632.062  -4678.628   4750.814

. predict m1_dfbeta,dfbeta( US_born_proportion)

. gen abs_dfbeta=abs( m1_dfbeta)

. gsort - abs_dfbeta

* we generate absolute values for the dfbeta variable, then we sort from largest to smallest, and list the first 9 observations (the 9 largest abs_dfbetas):

. list statefip incwage  US_born_proportion abs_dfbeta m1_dfbeta m1_residuals if _n<10

+---------------------------------------------------------------------------+

|      statefip       incwage   US_bor~n   abs_df~a   m1_dfbeta   m1_resi~s |

|---------------------------------------------------------------------------|

1. |    California   20573.98456    .730234   .7651464    .7651464   -3396.939 |

2. |       Florida   17874.36641    .789128   .6546844    .6546844   -4678.628 |

3. |      New York    20716.6877    .777071   .3214945    .3214945   -2126.592 |

4. |    New Jersey   24990.40441    .829979   .3097655   -.3097655    3420.941 |

5. | West Virginia      13760.09    .988432   .2389347   -.2389347   -3994.506 |

|---------------------------------------------------------------------------|

6. |       Montana   13746.25247    .986109   .2340724   -.2340724   -4064.271 |

7. |        Hawaii   19547.63379    .818878   .2323307    .2323307   -2289.112 |

8. |       Arizona    17986.0649    .856372   .1866927    .1866927   -2947.986 |

9. | Massachusetts   23697.95964    .849691    .181481    -.181481    2603.065 |

+---------------------------------------------------------------------------+

* California, Florida, and New York have the largest abs_dfbetas, but they are not the largest residuals, see below. _n is the Stata variable that accounts for the order of the observations.

. gsort - abs_resid

. list statefip incwage  US_born_proportion abs_dfbeta m1_dfbeta m1_residuals if _n<10

+----------------------------------------------------------------------------------+

|             statefip       incwage   US_bor~n   abs_df~a   m1_dfbeta   m1_resi~s |

|----------------------------------------------------------------------------------|

1. |          Connecticut   24803.26155     .89299   .1393247   -.1393247    4750.814 |

2. |              Florida   17874.36641    .789128   .6546844    .6546844   -4678.628 |

3. |             Maryland   24575.59697    .898431   .1125866   -.1125866    4654.167 |

4. |               Alaska   23241.05063    .938505   .0549285    .0549285    4284.434 |

5. | District of Columbia   24481.98743    .879239   .1720585   -.1720585    4098.485 |

|----------------------------------------------------------------------------------|

6. |              Montana   13746.25247    .986109   .2340724   -.2340724   -4064.271 |

7. |        West Virginia      13760.09    .988432   .2389347   -.2389347   -3994.506 |

8. |            Minnesota   22967.69873    .935135    .038025     .038025    3929.941 |

9. |           New Mexico   15344.73038    .933279   .0297481   -.0297481   -3737.723 |

+----------------------------------------------------------------------------------+

. log close

name:  <unnamed>