HW 4:
Revised Nov 3, 2011
1) Start with Anscombe’s data (available on my website). The data are Excel format, you can copy them into the Stata data editor. In Stata and in Excel, plot Y1 vs X1, then Y2 vs X2, and so on. and superimpose the regression line of X on Y (i.e. regress y x) in each scatterplot. In your homework include the Stata versions of 2 graphs, and the Excel versions of the other two. [Note: doing regression lines in Excel requires an add-in module for statistics that you may not have. If you do all the figures for this part of the homework in Stata, that is fine too].
a) What do you notice about the regression line in all 4 cases? How different are the datasets? Comment on how informative the regression lines are.
b) Indicate on your graphs the point with the largest absolute value residual from the linear regression (retrieve the residuals after regression by predict varname, residual).
c) There are several measures of influence, that is which point is the most influential over the slope of the regression line. One measure of influence DFbeta, calculates alternative regression slopes by dropping one point at a time, and then reports which point’s absence would change the line slope the most. You can retrieve the DFbetas with the command dfbeta, after your regression. Indicate the point on each graph with the largest absolute value DFbeta. Why do you think Stata cannot calculate the DFbetas for the model regress y4 x4?
2) Use the 50-state dataset (available on my website, stata format) which is derived from the very familiar
March, 2000
a) Is there a linear relationship between pct US_born and total income?
b) How would you describe the shape of the relationship between inctot and pct US born?
c) Generate the regression line of inctot
on US_born (i.e. regress
inctot US_born), and
add that line to the graph. Do the regressions separately with and without fweight=
d) Generate the residuals and the DFbetas for the unweighted regression above. Which state has the highest absolute value residual? Which state has the highest absolute value DFbeta? Why do you think the state with the largest (absolute value) residual and the state with the largest (absolute value) DFbeta are different?
e) Make a regression table, where regress inctot US_born
[fweight=
|
Model 1 |
Model 2 |
Model 3 |
|
|
|
|
|
US_Born pct |
US_Born pct |
US_Born pct |
|
|
|
|
|
|
Your first control var |
Your first control var |
|
|
|
|
|
|
|
Your second control var |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Constant |
Constant |
Constant |
|
|
|
|
|
Unweighted N |
Unweighted N |
Unweighted N |
|
Adjusted R-square |
Adjusted R-square |
Adjusted R-square |
f) This dataset has only 51 observations (50 states plus
DC), so it is obviously a very reduced dataset from our original March 2000
g) [New] Now go back to our regular individual level March, 2000 CPS dataset, and make a simple table of the average of inctot for US born adults and immigrant adults. Then run a simple OLS regression on the same age group, with dummy variable US_born as the sole predictor of inctot. Who has more income? Why are these results different from Model 1 in part 2e above?
3) There is no question 3.
4) Question 4 should be done by hand, though you can check your work in Excel or in Stata if you need the confidence boost. On the final you will have to do something like this by hand:
We select 4 numbers at random from a large set. The 4 numbers are 21,21,29,29
a) What is the Average of the 4 numbers?
b) What is the Variance of the 4 numbers? (use 1/N rather than 1/(N-1) in the Variance formula if you want the numbers to work out most easily)
c) What is the Standard Deviation of the 4 numbers?
d) What is the Standard Error of the Mean?
e) How sure are we that the average (of the large group of numbers these 4 are picked from) is greater than 21? Consult Freedman’s T-statistic table for the answer. What T-statistic value and how many degrees of freedom would this test correspond to? Is this a one or a two tailed test?
f) How sure are we that the average of these 4 numbers is greater than 21?
g) Let’s say we want to select more numbers from the large group in order to drive down the Standard Error of the Mean. How many numbers (approximately) would we have to draw from our large set in order to be 95% sure that the average of the whole set was within 1 point of the average that we measured? You can assume that the mean and the variance of our sample won’t change as we gather more measurements.
5) Interpretation of a regression table. In this table the dependent variable is being married to a black man, and the population is married white women. While this kind of dichotmous dependent variable would be in theory be better served by logistic regression, here we will be using regular OLS regression for simplicity. It will probably be easiest to do this problem in Excel.
a) Regression predicting proportion married to black men for married white women (to get percentage, multiply results by 100) in 1940, 1960, 1970, 1980, 1990, and 2000. Year2 is a continuous variable for the actual census year. fsomecolplus is a dichotomous variable for whether the married women had at least some college education or not.
regress
hus_black year2
fSomeColplus if frace==1
Source | SS
df
MS Number of obs = 2446725
-------------+------------------------------ F(
2,2446722) = 838.32
Model |
3.80217899 2 1.9010895 Prob >
F =
0.0000
Residual |
5548.54042446722 .002267745 R-squared =
0.0007
-------------+------------------------------ Adj R-squared
= 0.0007
Total |
5552.342582446724 .002269297 Root MSE =
.04762
------------------------------------------------------------------------------
hus_black | Coef. Std. Err. t
P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
year2 |
.0000586 1.75e-06 33.41
0.000 .0000552 .0000621
fSomeColplus |
.0007976 .0000695 11.48
0.000 .0006614 .0009338
_cons |
-.113863 .0034622 -32.89
0.000 -.1206487 -.1070773
------------------------------------------------------------------------------
Generate the following table of predicted values, based only on the regression results above:
Predicted percentage married to black men:
|
Year |
wives
with at least some college |
wives
without college education |
|
1940 |
|
|
|
1960 |
|
|
|
1970 |
|
|
|
1980 |
|
|
|
1990 |
|
|
|
2000 |
|
|
b) Generate the same table of predicted values, based only on the regression results below:
Predicted percentage (remembering that the predicted values from the model will be in terms of proportion, theoretically between 0 and 1) married to black men (this regression replaces continuous year with a categorical year variable, excluding the first census year 1940):
xi:
regress hus_black i.year2 fSomeColplus if frace==1
i.year2 _Iyear2_1940-2000 (naturally coded; _Iyear2_1940 omitted)
Source | SS
df
MS Number of obs = 2446725
-------------+------------------------------ F(
6,2446718) = 346.20
Model |
4.70979513 6 .784965855 Prob >
F =
0.0000
Residual |
5547.632792446718 .002267377 R-squared =
0.0008
-------------+------------------------------ Adj
R-squared = 0.0008
Total |
5552.342582446724 .002269297 Root MSE =
.04762
------------------------------------------------------------------------------
hus_black | Coef. Std.
Err. t P>|t|
[95% Conf. Interval]
-------------+----------------------------------------------------------------
_Iyear2_1960
| -.0004723 .0001217
-3.88 0.000 -.0007108
-.0002337
_Iyear2_1970
| -.0001328 .0001197
-1.11 0.268 -.0003674
.0001019
_Iyear2_1980
| .001089 .000118
9.23 0.000 .0008578
.0013202
_Iyear2_1990
| .0016261 .0001185
13.72 0.000 .0013938
.0018583
_Iyear2_2000
| .0030743 .0001199
25.64 0.000 .0028392
.0033093
fSomeColplus |
.0006729 .0000699 9.63
0.000 .0005359 .0008099
_cons |
.0010347 .0000934 11.08
0.000 .0008517 .0012178
------------------------------------------------------------------------------
Predicted percentage married to black men:
|
Year |
wives
with at least some college |
wives
without college education |
|
1940 |
|
|
|
1960 |
|
|
|
1970 |
|
|
|
1980 |
|
|
|
1990 |
|
|
|
2000 |
|
|
c) Take the predicted values from parts a and b above, and graph them in Excel. Put the college educated white women and the non-college educated white women together on the same graph, but plot the predicted values from part (a) and part (b) separately. In Excel, use XY scatter plots, with points connected by lines. Examine the two graphs. Comment on linearity (are the predicted values linear with respect to year?) and additivity (is the difference between college educated women and non-college educated women constant across years?).
Question 6 is an Extra Credit Question for soc 180B/280B, but is required in Soc 381:
6) A brief logistic regression exercise, using our friendly
old March 2000
a) Run a logistic regression predicting whether the subject is married, for subjects over age 16, using age, age_squared, and race as the predictor variables. The syntax will be as follows:
desmat: logit married
@age @age_sq race if age>16
or
xi: logit married
age age_sq i.race if
age>16
or
logit
married age age_sq i.race
if age>16
generate predicted values and summarize them. And don’t forget that the “or” option will give you the exponentiated, or odds ratio version of the results:
logit married age age_sq i.race if age>16, or
b) Interpret the black coefficient (assuming white is the excluded category for race) in the above logistic regression. What is the 95% confidence interval for the black coefficient in odds ratio terms, and in log odds ratio terms? Does the 95% confidence interval for odds ratio of the black coefficient include 1 (explain)?
c) Explain the 5 df Likelihood Ratio Test that Stata produces in the model for question 6(a). What two models are being compared, and what is the conclusion of this Likelihood Ratio Test (in other words, what null hypothesis is being rejected)?
d) Start with the logistic regression model from 6(a), let’s call that Model 1. For Model 2, add one term, a dummy variable for whether the respondent was born in the US. For Model 3, add dummy variables for each category of metropolitan status (using the variable metro). Comment on the Likelihood Ratio Test comparison between Models 1, 2, and 3 indicating degrees of freedom difference between the models, expected chisquare values for each comparison, what the null hypothesis is, and whether the null hypothesis is rejected. Which model fits best by the Likelihood Ratio Test? Which model fits best by BIC?
e) Run the same regression as part 6(a) above (i.e. Model 1), but with regular OLS regression (i.e. the familiar Stata function regress) instead of logit regression. Generate the predicted values and summarize them. How are the predicted values different between the OLS and the logistic regression? Comment on the difference in the range of the predicted values between the OLS regression and logistic regression results. Graph the black and white actual and predicted marriage rates by respondent age, for OLS and for logistic models predicting marriage rate.