HW 3, Soc 388
Due Wednesday, Oct 24, in class
Late homeworks will generally not be accepted, because I will post answers to my website soon after the homework is due. If you're stuck, email me or the TA. If you still can't figure it out, just do the best you can and don't panic.
NOTE: All homeworks should include an edited STATA log.
Previous
New Reading Assignment: Agresti, Ch 3
Once again, (since it isn't defined in either text): BIC= LRT- df*ln(N), where LRT is the goodness
of fit chisquare, df is the residual degrees of freedom, and N is the sample
size from the whole dataset. The
syllabus contains references that define BIC (Raftery 1986) and critique it
(Weakliem 1999).
Important ideas: Goodness of fit measures, hypothesis testing, inference across many dimensions, different kinds of controls.
The data are available from my website, as well as my public folder via ftp (/afs/ir/users/m/r/mrosenfe/public) under the name "70-80-90 MR intermar.dta" (Stata ver 6) or "70-80-90 MR intermar.xls" if you'd rather start with the excel file and copy it into Stata.
The data have 225 cells, and 6
variables. There 649,821 couples in the
dataset (it's intermarriage data, surprise surprise). The data consist of married people age 20-29
at the time of the census. The variables
are meth (husband's ethnicity) and feth (wife's ethnicity), with the same 5
categories we have seen before (non Hispanic Black, non Hispanic White,
Mexican, Other Hispanic, non Hispanic Other).
There is a variable for census year (70, 80, and 90), and there is a
variable for nativity of each spouse (born in the
In the following table, BW is the gender symmetric Black- White interaction; MOh is the gender symmetric Mexican- Other Hispanic interaction; ethintdm is the dummy variable that treats all 5 kinds of ethnic endogamy the same, ethintct is the categorical variable that treats each kind of ethnic intermarriage differently.
Fill in the following Table
|
Model # |
Model
Description |
Terms in model |
Residual df |
Goodness of fit Chi-square |
Goodness of fit Chi-square P |
BIC |
ID |
|
1 |
Constant
only |
1 |
224 |
4,503,895 |
0 |
4500897 |
86.2 |
|
2 |
year*meth year*feth |
27 |
198 |
1,579,790 |
0 |
1577140 |
66.4 |
|
3 |
year*meth*mgen year*feth*fgen |
57 |
168 |
453,658 |
0 |
451409.4 |
19.7 |
|
4 |
year*meth*mgen year*feth*fgen BW, MOh |
59 |
166 |
200,027 |
0 |
197805.2 |
8.5 |
|
5 |
year*meth*mgen year*feth*fgen ethintdm |
58 |
167 |
26,839 |
0 |
24603.8 |
2.94 |
|
6 |
year*meth*mgen year*feth*fgen ethintct |
62 |
163 |
5,070 |
0 |
2888.3 |
1.02 |
|
7a |
year*meth*mgen year*feth*fgen ethintct*@year |
67 |
158 |
4,069 |
0 |
1954.3 |
0.789 |
|
7b |
year*meth*mgen year*feth*fgen ethintct*year |
72 |
153 |
3,882 |
0 |
1834.2 |
0.744 |
|
8 |
year*meth*mgen year*feth*fgen ethintct*year BW MOh |
74 |
151 |
3,203 |
0 |
1181.9 |
0.687 |
|
|
My better
fitting models: |
|
|
|
|
|
|
|
9a |
year*meth*mgen year*feth*fgen ethintct*year BW MOh meth*fgen feth*mgen |
82 |
143 |
2,053 |
0 |
139 |
0.466 |
|
9b |
year*meth*mgen*fgen year*feth*fgen*mgen ethintct*year*fgen*mgen BW MOh |
128 |
97 |
536.5 |
0 |
-761.8 |
0.133 |
|
9c |
year*meth*mgen*fgen year*feth*fgen*mgen ethintct*year*fgen*mgen QS*year,
QS*mgen*fgen, BohS*fgen, BWS*year Note: QS
here is the full set of 5 off-diagonal, symmetric ethnic interactions
(including BW and MOh, see log), and BohS is the sex- specific interaction
between Black men and Other Hispanic women, and BWS is the sex specific
interaction between Black men and White women. |
156 |
69 |
107.3 |
0.0022 |
-816.2 |
0.0514 |
|
|
|
|
|
|
|
|
|
1) Fill in the above table, models
1-8
2) Does racial endogamy vary
significantly between groups? What is
the statistical test that answers that question?
Yes; Model 6 improves dramatically
on the fit of Model 5
3) Does racial endogamy vary
significantly over time? More so for some groups than for others?
Yes; Model 7b improves quite a lot
on Model 6 (an improvement of more than 1100 on 10 degrees of freedom). Black endogamy declines the most (but was the
largest to start with), from log odds ratio of 7.73 in 1970 to 7.73-1.39=6.34
in 1990. Between 1970 and 1980,
'Oth-NH', a category that includes mostly Asians and Native Americans declines
the most in log odds ratio terms, from 3.186-1.029=2.157. You could chart the racial endogamy of all 5
groups over time, 70-80-90, and show how all kinds of racial endogamy (here
measured jointly with a bunch of controls) decline sharply over time.
4) Does
The comparison of models 9a and 9b
demonstrates a very significant effect of U.S. nativity on racial endogamy, but
models 9a-c have a lot of terms in them and that makes interpretation messy. A
simple approach would be to take model 7b, and add ethintct*mgen and
ethintct*fgen, and look at the interaction terms. In fact what one sees is that
5) Based on models 1-8, which would
you say is a more powerful force in the marriage market- racial endogamy or the
division between Blacks and Whites? Why?
Racial endogamy is stronger than the
Black- White divide. In Models 4 and 8,
the Racial endogamy terms are generally much larger in
absolute value (representing stronger changes in the log odds ratio of
marriage) than the Black- White interaction.
Furthermore, the racial or ethnic endogamy terms contribute more to the
goodness of fit (compare Model 6 to Model 4) than the Black- White term (which
is important in its own right, but not quite as important).
6) Which of the models 1-8 fits the
best by LRT and by BIC? Do any of them
fit reasonably well?
Model 8 is the best fitting by LRT
and BIC, but it's not nearly good enough
7) What is the difference between
treating year as a continuous vs categorical variable in interactions with
ethnic endogamy? How do models 7a and 7b differ? How do you interpret this
difference?
Since year takes on 3 values in the
dataset, ethinct*year adds 5x2=10 terms to the model compared to the base
values of ethinct in model 6- i.e. change from 1970 to 1980, and change from
1970 to 1990. If we treat year as a continuous variable, ethinct*@year adds 5
terms to the model, because the change over time for each ethinct term is
assumed to be linear with time. So there is a 5 df
difference between 7a and 7b, depending on whether we assume that ethnic
endogamy changes in a linear way over time, or whether we assume the decline in
ethnic endogamy over time is non-linear enough to account for each year
separately. Since model 7b fits substantially better than model 7a (a
difference in -2LL of almost 200 on 5 df, and model 7b
has lower BIC), this tells us that the decline in ethnic endogamy over time is
not quite linear.
8) Construct a model that fits
better (by BIC or LRT) than any of the models 1-8. What have you added to the
previous models?
Models 9a-9c fit substantially
better than models 1-8. Models 9b and 9c fit well by the BIC, and model 9c
approaches a good fit by the LRT, which is not easy to obtain in a large
dataset like this. Models 9b and 9c push the data to its limits, so the
'difficult' option speeds up the likelihood maximization considerably. We only
start to make real progress in fitting the data when we add the 4way interactions
of meth*mgen*fgen*year and feth*mgen*fgen*year and the partial 5 way
ethinct*mgen*fgen*year (partial 5 way because ethinct accounts for part of the
saturated interaction meth*feth.
9) Now here are some more abstract
questions about a hypothetical dataset with 3 variables: A (5 categories) B(4 Categories) and C (3 categories). Total number of cells is 5*4*3=60. Fill in the following table.
|
Model # |
Model
Description |
Terms in
model |
Residual
df |
|
1 |
A
(constant plus 4 terms to fit the 5 categories of A- there's always one
excluded category) |
5 |
55 |
|
2 |
A,B (add
in the 3 terms to fit the 4 categories of B) |
8 |
52 |
|
3 |
A*B
(Saturated interaction takes the full 5*4 terms to fit the 20 cells of A*B) |
20 |
40 |
|
4 |
A*B,C
(Adds to model 3 the 2 terms to fit the 3 categories of C) |
22 |
38 |
|
5 |
A*B, B*C, A*C (There are a couple of ways of thinking
about this one. This model has the
constant, plus the direct of effects of A, B, and C (4+3+2) plus the
saturated interactions between A*B (4*3 terms), B*C (3*2 terms) and A*C (4*2
terms for a total of 1+9+12+6+8=36 terms |
36 |
24 |
|
6 |
A*B*C
(Saturated model has one term for every cell in the dataset) |
60 |
0 |