HW 4:

 

Revised Oct 31, 2021

 

 

 

 

1) Start with Anscombe’s data (available on my website). The data are Excel format, you can copy them into the Stata data editor. In Stata and in Excel, plot Y1 vs X1, then Y2 vs X2, and so on. and superimpose the regression line of X on Y (i.e. regress y x) in each scatterplot. In your homework include the Stata versions of 2 graphs, and the Excel versions of the other two. [Note: doing regression lines in Excel requires an add-in module for statistics that you may not have. If you do all the figures for this part of the homework in Stata, that is fine too].

a) What do you notice about the regression line in all 4 cases? How different are the datasets? Comment on how informative the regression lines are.

b) Indicate on your graphs the point with the largest absolute value residual from the linear regression (retrieve the residuals after regression by predict varname, residual).

c) There are several measures of influence, that is which point is the most influential over the slope of the regression line. One measure of influence DFbeta, calculates alternative regression slopes by dropping one point at a time, and then reports which point’s absence would change the line slope the most. You can retrieve the DFbetas with the command dfbeta, after your regression. Indicate the point on each graph with the largest absolute value DFbeta. Why do you think Stata cannot calculate the DFbetas for the model regress y4 x4?

 

2) Use the 50-state dataset (available on my website, stata format) which is derived from the very familiar March, 2000 CPS data. In Stata, make a scatterplot of average inctot (on the Y axis) by pct US_born (on the X axis), and include marker labels for each state name (variable statefip).

a) Is there a linear relationship between pct US_born and total income?

b) How would you describe the shape of the relationship between inctot and pct US born?

c) Generate the regression line of inctot on US_born (i.e. regress inctot US_born), and add that line to the graph. Do the regressions separately with and without fweight=CPS_population, which is the actual count of observations in the CPS. Why is the slope different between the two lines? Verify that CPS_population is the CPS count of observations in each state in the March, 2000 CPS.

d) Generate the residuals and the DFbetas for the unweighted regression above. Which state has the highest absolute value residual? Which state has the highest absolute value DFbeta? Why do you think the state with the largest (absolute value) residual and the state with the largest (absolute value) DFbeta are different?

e) Make a regression table, where regress inctot US_born [fweight=CPS_population] is your first model. Create models 2 and 3 with additional controls from the dataset (or your transformation of variables from the dataset). Use the weights as indicated. Don’t include any income variables among the predictor variables. What do you think the relationship is between percent US born and average total income? How does it change across models? Note that the unweighted N in the table below will be 51.

 

Model 1

Model 2

Model 3

 

 

 

US_Born pct

US_Born pct

US_Born pct

 

 

 

 

Your first control var

Your first control var

 

 

 

 

 

Your second control var

 

 

 

 

 

 

 

 

 

 

 

 

Constant

Constant

Constant

 

 

 

Unweighted N

Unweighted N

Unweighted N

Adjusted R-square

Adjusted R-square

Adjusted R-square

 

 

f) This dataset has only 51 observations (50 states plus DC), so it is obviously a very reduced dataset from our original March 2000 CPS dataset, which had 133,710 observations. What do you think is gained by doing regressions on the reduced dataset of 51 state averages? What do you think is lost?

 

g) [New] Now go back to our regular individual level March, 2000 CPS dataset, and make a simple table of the average of inctot for US born adults and immigrant adults. Then run a simple OLS regression on the same age group, with dummy variable US_born as the sole predictor of inctot. Who has more income? Why are these results different from Model 1 in part 2e above?

 

3) Mediation tests. Take the small GSS dataset posted to the class website, a) and run a mediation test predicting approval of marriage equality (the Y variable) with gay friends in 2006 as the mediator variable, and you choose the X variable other than education. Use the Stata command sureg, and test the mediation with nlcom. [new] Make a simple triangular diagram X->M->Y of the type in Rosenfeld’s mediation notes, with coefficients and asterisks (* P<0.05, ** P<0.01, *** P<0.001) indicating significance of each coefficient, drawn or indicated for each of the 3 arrows. Do gay friends mediate between your X variable and approval of marriage equality in 2008-2010? b) Repeat the same test as 3a but with the medeff command (which you will have to download). c) Now add one or two additional controls from the small GSS dataset (add the control(s) to both equations) and run the mediation test again with sureg and test the mediation with nlcom. d) Summarize your results for the role of gay friends as a mediator between your X variable and approval of marriage equality.

 

4) Question 4 should be done by hand, though you can check your work in Excel or in Stata if you need the confidence boost.

 

We select 4 numbers at random from a large set. The 4 numbers are 21,21,29,29

 

a) What is the Average of the 4 numbers?

b) What is the Variance of the 4 numbers? (use 1/N rather than 1/(N-1) in the Variance formula if you want the numbers to work out most easily)

c) What is the Standard Deviation of the 4 numbers?

d) What is the Standard Error of the Mean?

e) How sure are we that the average (of the large group of numbers these 4 are picked from) is greater than 21? Consult Freedman’s T-statistic table for the answer. What T-statistic value and how many degrees of freedom would this test correspond to? Is this a one or a two tailed test?

f) How sure are we that the average of these 4 numbers is greater than 21?

g) Let’s say we want to select more numbers from the large group in order to drive down the Standard Error of the Mean. How many numbers (approximately) would we have to draw from our large set in order to be 95% sure that the average of the whole set was within 1 point of the average that we measured? You can assume that the mean and the variance of our sample won’t change as we gather more measurements.

 

 

5) [For 2021, we will be skipping ALL of Q5 in Soc 381]

Interpretation of a regression table. In this table the dependent variable is being married to a black man, and the population is married white women. While this kind of dichotmous dependent variable would be in theory be better served by logistic regression, here we will be using regular OLS regression for simplicity. It will probably be easiest to do this problem in Excel.

 

a) Regression predicting proportion married to black men for married white women (to get percentage, multiply results by 100) in 1940, 1960, 1970, 1980, 1990, and 2000. Year2 is a continuous variable for the actual census year. fsomecolplus is a dichotomous variable for whether the married women had at least some college education or not.

 regress hus_black year2  fSomeColplus if frace==1

 

      Source |       SS       df       MS              Number of obs = 2446725

-------------+------------------------------           F(  2,2446722) =  838.32

       Model |  3.80217899     2   1.9010895           Prob > F      =  0.0000

    Residual |   5548.54042446722  .002267745           R-squared     =  0.0007

-------------+------------------------------           Adj R-squared =  0.0007

       Total |  5552.342582446724  .002269297           Root MSE      =  .04762

 

------------------------------------------------------------------------------

   hus_black |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

       year2 |   .0000586   1.75e-06    33.41   0.000     .0000552    .0000621

fSomeColplus |   .0007976   .0000695    11.48   0.000     .0006614    .0009338

       _cons |   -.113863   .0034622   -32.89   0.000    -.1206487   -.1070773

------------------------------------------------------------------------------

 

Generate the following table of predicted values, based only on the regression results above:

Predicted percentage married to black men:

Year

wives with at least some college

wives without college education

1940

 

 

1960

 

 

1970

 

 

1980

 

 

1990

 

 

2000

 

 

 

 

b) Generate the same table of predicted values, based only on the regression results below:

Predicted percentage (remembering that the predicted values from the model will be in terms of proportion, theoretically between 0 and 1) married to black men (this regression replaces continuous year with a categorical year variable, excluding the first census year 1940):

 

xi: regress hus_black i.year2  fSomeColplus if frace==1

i.year2           _Iyear2_1940-2000   (naturally coded; _Iyear2_1940 omitted)

 

      Source |       SS       df       MS              Number of obs = 2446725

-------------+------------------------------           F(  6,2446718) =  346.20

       Model |  4.70979513     6  .784965855           Prob > F      =  0.0000

    Residual |  5547.632792446718  .002267377           R-squared     =  0.0008

-------------+------------------------------           Adj R-squared =  0.0008

       Total |  5552.342582446724  .002269297           Root MSE      =  .04762

 

------------------------------------------------------------------------------

   hus_black |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

_Iyear2_1960 |  -.0004723   .0001217    -3.88   0.000    -.0007108   -.0002337

_Iyear2_1970 |  -.0001328   .0001197    -1.11   0.268    -.0003674    .0001019

_Iyear2_1980 |    .001089    .000118     9.23   0.000     .0008578    .0013202

_Iyear2_1990 |   .0016261   .0001185    13.72   0.000     .0013938    .0018583

_Iyear2_2000 |   .0030743   .0001199    25.64   0.000     .0028392    .0033093

fSomeColplus |   .0006729   .0000699     9.63   0.000     .0005359    .0008099

       _cons |   .0010347   .0000934    11.08   0.000     .0008517    .0012178

------------------------------------------------------------------------------

 

Predicted percentage married to black men:

Year

wives with at least some college

wives without college education

1940

 

 

1960

 

 

1970

 

 

1980

 

 

1990

 

 

2000

 

 

 

 

c) Take the predicted values from parts a and b above, and graph them in Excel. Put the college educated white women and the non-college educated white women together on the same graph, but plot the predicted values from part (a) and part (b) separately. In Excel, use XY scatter plots, with points connected by lines. Examine the two graphs. Comment on linearity (are the predicted values linear with respect to year?) and additivity (is the difference between college educated women and non-college educated women constant across years?).

 

 

Question 6 is an Extra Credit Question for soc 180B/280B, but is required in Soc 381:

6) A brief logistic regression exercise, using our friendly old March 2000 CPS. Create a new dummy variable married, equal to 1 when the subject is married (marst==1| marst==2), and equal to zero otherwise. This married variable will be the dependent variable in our regressions.

a) Run a logistic regression predicting whether the subject is married, for subjects over age 16, using age, age_squared, and race as the predictor variables. The syntax will be as follows:

desmat: logit married @age @age_sq race if age>16

or

xi: logit married age age_sq i.race if age>16

or

logit married age age_sq i.race if age>16

 

generate predicted values and summarize them. And don’t forget that the “or” option will give you the exponentiated, or odds ratio version of the results:

 

logit married age age_sq i.race if age>16, or

 

b) Interpret the black coefficient (assuming white is the excluded category for race) in the above logistic regression. What is the 95% confidence interval for the black coefficient in odds ratio terms, and in log odds ratio terms? Does the 95% confidence interval for odds ratio of the black coefficient include 1 (explain)?

 

c) Explain the 5 df Likelihood Ratio Test that Stata produces in the model for question 6(a). What two models are being compared, and what is the conclusion of this Likelihood Ratio Test (in other words, what null hypothesis is being rejected)?

 

d) Start with the logistic regression model from 6(a), let’s call that Model 1. For Model 2, add one term, a dummy variable for whether the respondent was born in the US. For Model 3, add dummy variables for each category of metropolitan status (using the variable metro). Comment on the Likelihood Ratio Test comparison between Models 1, 2, and 3 indicating degrees of freedom difference between the models, expected chisquare values for each comparison, what the null hypothesis is, and whether the null hypothesis is rejected. Which model fits best by the Likelihood Ratio Test? Which model fits best by BIC? Generate P-values for the LRT and the BIC comparisons between models.

 

e) [Skip this part for Soc 381 in 2021] Run the same regression as part 6(a) above (i.e. Model 1), but with regular OLS regression (i.e. the familiar Stata function regress)  instead of logit regression. Generate the predicted values and summarize them. How are the predicted values different between the OLS and the logistic regression? Comment on the difference in the range of the predicted values between the OLS regression and logistic regression results. Graph the black and white actual and predicted marriage rates by respondent age, for OLS and for logistic models predicting marriage rate.