--------------------------------------------------------------------------------------------------
name: <unnamed>
log: C:\Documents and Settings\Michael Rosenfeld\My Documents\newer web pages\soc_meth_pro
> j3\2011_180B_logs\class10.log
log type: text
opened on: 24 Feb 2011, 13:26:09
* First we load the 50 state dataset.
. use "C:\Documents and Settings\Michael Rosenfeld\My Documents\newer web pages\soc_meth_proj3\fifty_state_dataset.dta", clear
. describe
* Note that the data has 51 observations, 50 states plus DC.
Contains data from C:\Documents and Settings\Michael Rosenfeld\My Documents\newer web pages\soc_meth_proj3\fifty_state_dataset.dta
obs: 51
vars: 11 7 Nov 2010 14:16
size: 2,703 (99.9% of memory free)
---------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
---------------------------------------------------------------------------------------
statefip byte %57.0g statefiplbl
State (FIPS code)
US_born_propo~n float %8.0g mean(US_b~n)
seniors_propo~n float %8.0g mean(ove~65)
children_prop~n float %8.0g mean(chil~n)
NH_White_prop~n float %8.0g mean(NH_w~e)
inctot double %12.0g mean(inctot)
CPS_population long %9.0gc Freq.
incwage double %12.0g mean(incw~e)
male_proportion float %8.0g mean(male)
urban_proport~n float %8.0g mean(urban)
yrsed float %9.0g mean(yrsed)
---------------------------------------------------------------------------------------
Sorted by:
. twoway (scatter incwage US_born_proportion, mlabel(statefip)) (lfit incwage US_born_proportion)
* If you weren't in class you really must plot these scatter plots and look at the line to understand which states are outliers and why.
. regress incwage US_born_proportion
Source | SS df MS Number of obs = 51
-------------+------------------------------ F( 1, 49) = 14.84
Model | 104897990 1 104897990 Prob > F = 0.0003
Residual | 346387525 49 7069133.17 R-squared = 0.2324
-------------+------------------------------ Adj R-squared = 0.2168
Total | 451285515 50 9025710.3 Root MSE = 2658.8
------------------------------------------------------------------------------
incwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
US_born_pr~n | -24075.79 6250 -3.85 0.000 -36635.63 -11515.94
_cons | 41551.87 5790.425 7.18 0.000 29915.58 53188.17
------------------------------------------------------------------------------
* The regress command produces the line that lfit plotted in the scatter plot above.
. predict newer_M1_predicted
(option xb assumed; fitted values)
* After regression we can generate predicted values.
. gen newer_M1_residuals=incwage- newer_M1_predicted
* Stata can generate the predicted values directly, but in this case I wanted to show that the residuals can be easily obtained by Actual- Predicted values.
. summarize newer_M1_residuals
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
newer_M1_r~s | 51 -.0000266 2632.062 -4678.628 4750.814
* The mean of residuals is supposed to be zero (here it is close).
. gen abs_newer_M1_residual=abs( newer_M1_residuals)
If we want to know which states have the larges residual in absolute value, first we have to create a new variable with the absolute value of the residuals, and
. gsort -abs_newer_M1_residual
* Sort the dataset by this new absolute value of residual variable, from largest to smallest, and then list the first few observations.
. list incwage US_born_proportion statefip newer_M1_residuals if _n<10
+-----------------------------------------------------------+
| incwage US_bor~n statefip newer_M~s |
|-----------------------------------------------------------|
1. | 24803.26155 .89299 Connecticut 4750.814 |
2. | 17874.36641 .789128 Florida -4678.628 |
3. | 24575.59697 .898431 Maryland 4654.167 |
4. | 23241.05063 .938505 Alaska 4284.434 |
5. | 24481.98743 .879239 District of Columbia 4098.485 |
|-----------------------------------------------------------|
6. | 13746.25247 .986109 Montana -4064.271 |
7. | 13760.09 .988432 West Virginia -3994.506 |
8. | 22967.69873 .935135 Minnesota 3929.941 |
9. | 15344.73038 .933279 New Mexico -3737.723 |
+-----------------------------------------------------------+
* The _n is Stata's built in variable that holds the number for each observation, from 1 to (in this case) 51. When you sort, the order of the observations changes. Note that Connecticut, which is above the middle of the line, has the largest residual.
. twoway (scatter incwage US_born_proportion, mlabel(statefip)) (lfit incwage US_born_proportion)
* Our scatter plot again, with best fit line superimposed.
. predict newer_M1_dfbeta, dfbeta( US_born_proportion)
* This predict command again works off the regress command we ran earlier, and now generates dfbetas for each observation. Think of the dfbetas as how much the slope of the line in our regression would change if each point were removed.
. gen newer_abs_M1_dfbeta=abs( newer_M1_dfbeta)
* And now we take the absolute value
. gsort - newer_abs_M1_dfbeta
* And then we sort from largest to smallest, and then we list the largest (i.e. the first few) observations.
. list incwage US_born_proportion statefip newer_M1_residuals newer_M1_dfbeta if _n<10
+----------------------------------------------------------------+
| incwage US_bor~n statefip newer_M~s newer_M~a |
|----------------------------------------------------------------|
1. | 20573.98456 .730234 California -3396.939 .7651464 |
2. | 17874.36641 .789128 Florida -4678.628 .6546844 |
3. | 20716.6877 .777071 New York -2126.592 .3214945 |
4. | 24990.40441 .829979 New Jersey 3420.941 -.3097655 |
5. | 13760.09 .988432 West Virginia -3994.506 -.2389347 |
|----------------------------------------------------------------|
6. | 13746.25247 .986109 Montana -4064.271 -.2340724 |
7. | 19547.63379 .818878 Hawaii -2289.112 .2323307 |
8. | 17986.0649 .856372 Arizona -2947.986 .1866927 |
9. | 23697.95964 .849691 Massachusetts 2603.065 -.181481 |
+----------------------------------------------------------------+
*California has the largest dfbeta, because California is the state with the highest percentage foreign born, so California is an outlier in X, and that makes California very influential over the slope.
* Now we will run the regression with all states, and then without California.
. *CA is statefip==6
. regress incwage US_born_proportion
Source | SS df MS Number of obs = 51
-------------+------------------------------ F( 1, 49) = 14.84
Model | 104897990 1 104897990 Prob > F = 0.0003
Residual | 346387525 49 7069133.17 R-squared = 0.2324
-------------+------------------------------ Adj R-squared = 0.2168
Total | 451285515 50 9025710.3 Root MSE = 2658.8
------------------------------------------------------------------------------
incwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
US_born_pr~n | -24075.79 6250 -3.85 0.000 -36635.63 -11515.94
_cons | 41551.87 5790.425 7.18 0.000 29915.58 53188.17
------------------------------------------------------------------------------
. display -24075-((.7651)*6250)
-28856.875
* California's dfbeta of .7651 means that without California, the slope of the line in US born proportion would be 0.76 times the std error of 6250 more steep.
. regress incwage US_born_proportion if statefip!=6
Source | SS df MS Number of obs = 50
-------------+------------------------------ F( 1, 48) = 17.11
Model | 118175271 1 118175271 Prob > F = 0.0001
Residual | 331435409 48 6904904.35 R-squared = 0.2628
-------------+------------------------------ Adj R-squared = 0.2475
Total | 449610680 49 9175728.16 Root MSE = 2627.7
------------------------------------------------------------------------------
incwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
US_born_pr~n | -28802.08 6962.086 -4.14 0.000 -42800.29 -14803.87
_cons | 46007.88 6474.534 7.11 0.000 32989.95 59025.8
------------------------------------------------------------------------------
. twoway (scatter incwage US_born_proportion, mlabel(statefip)) (lfit incwage US_born_proportion)
* same graph again.
. log close
name: <unnamed>
log: C:\Documents and Settings\Michael Rosenfeld\My Documents\newer web pages\s
> oc_meth_proj3\2011_180B_logs\class10.log
log type: text
closed on: 24 Feb 2011, 15:19:33
---------------------------------------------------------------------------------------