Education 161 Winter 2000 Assignment 3 Due Feb 22, 2000 Note data files are available in one of two locations: path: /usr/class/ed161/[data file] or using web-services at URL http://www.stanford.edu/class/ed161/hw/[data file] 1. For the NELS data (see file description in Course Examples Index) obtain the correlation between 10th grade science achievement scores and 8th grade science scores. Does this correlation change when it is computed for males and females separately? [note you will likely want to use the minitab copy command along with the "use" subcommand] -------------------------------------------------------------------- 2. In the file 'hw3p2.dat' are X (C1) and Y (C2). Estimate E(Y|X) and give interval estimates for the intercept and slope parameters. Examine the effects of anomolous and/or influential observations on the fits and the parameter estimates. ----------------------------------------------------------- 3. The file 'prognosis.dat' contains data on days hospitalized (X in C1) and a prognosis index (Y in C2) for 15 severely injured patients. A hospital administrator wants to develop a prediction equation for the long term prognosis using the length of the hospital stay. (a) Develop a prediction equation by straightening the scatterplot and using a straight-line fit. Give the fit and an interval estimate for a patient hospitalized 10 days. Repeat for 60 days hospitalization. (b) For this same problem develop a prediction equation for the long term prognosis by fitting a polynomial. Compare the fits and a interval estimate for expected prognosis for a patient hospitalized 10 days from the two approaches-- polynomial fit vs straightening the scatterplot and using a straight-line fit in part a. Repeat the comparison for 60 days hospitalization. -------------------------------------------- 4. Bodyfat data revisited By referring to Course Example file bodyfat.out or by redoing the analyses, use this example to once-again illustrate the vagaries of multiple regression coefficients (and improper attempts to interpret them). Which of the three predictors--triceps X1, thigh X2 or midarm X3-- is the best single predictor of bodyfat? What is the regression coefficient for that predictor in a single predictor eqaution? What is the corresponding t-statistic for that coefficient? Now consider the regression using both triceps and thigh as predictors. Compare the coefficients (and their t-statistics) from this multiple regression with the corresponding single predictor equations. Now consider the multiple regression using all three predictors. For triceps and thigh, compare the coefficients (and their t-statistics) from this multiple regression with the results from the previous regression equations. To decrease bodyfat does one puff up one's thighs? ------------------------------------------------------------ 5. Patient Satisfaction Data. The data reside in file patient.dat A hospital adminstrator wished to study the relation between patient satisfaction Y (in C1) and X1 patients age (in C2), X2 an index of severity of illness (in C3), and X3 anxiety level (in c4) where larger values of Y X2 X3 indicate more satisfaction, more severe illness and more anxiety. a. Prepare a stem-and-leaf plot for each of the predictor variables. Are any noteworthy features revealed by these plots? b. Fit multiple regression model (flat plane) for three predictor variables to the data and state the estimated regression function. How is the coefficient for X2 interpreted here? c. Obtain the residuals and prepare a box plot of the residuals. Do there appear to be any outliers? d. Using the regression model in part b using three predictor variables , Test whether there is a regression relation; use Type I error rate = .10. State the alternatives, decision rule, and conclusion. e. for the fit in part b verify that the regression coefficients can be obtained from straight line fits to the corresponding partial regression plots. Use the coefficient for X2 as your example. --------------------------------------------------- 6. Consider a one-way classification with four levels (I = 4). We are given the population cell means (mu(1) through mu(4)) as: 7, 9, 6, 15. Consider the general linear model setup (with 3 group membership indicators) E(Y|G1,G2,G3) = beta0 + beta1*G1 + beta2*G2 + beta3*G3 where G1 = 1 if treatment 2 G1 = 0 otherwise G2 = 1 if treatment 3 G2 = 0 otherwise G3 = 1 if treatment 4 G3 = 0 otherwise a. Determine the values for the 4 betas in the regression model b. Express mu(3) - mu(2) in terms of the betas. Check by numerical substitution. --------------------------------------------------------------- 7. File salary.dat contains data from a salary survey discussed in lecture: C1 is experience, c2 is education level (1 for HS, 2 for BS, 3 for advanced degree), c3 indicate management position (=1) or not, and c4 is the outcome measure salary. First, code the 3 levels of education using 2 group membership indicators (so that education is not used as an interval scale). In the solutions we use HS as the base --0 0 code. What is the single best predictor of salary? Predict salary using experience, education, and management. Add to the model two management-education interaction terms. Do these terms add significantly to the prediction? Give an interval estimate of the value of an additional year of experience. Repeat for an advanced degree in addition to the BS-- (i.e comparison asked for here is the comparison between advanced and H.S, *not* to indicate I want a differential between advanced deg and B.S. That's a harder thing to do in this coding although it can be done) -------------------------------------------------------------------- 8. (former quiz question) A study of several hundred professors' salaries in a large American university in 1969 (AER, 1973, p.469) yielded the following prediction equation: S = 1900 + 230*B + 18*A + 100*E + 490*D + 190*Y + 50*T - 2400*X where S is annual salary, B is number of books written, A number of ordinary articles, E number of excellent articles, D number of Ph.D.'s supervised, Y years experience, T = 1 if student evaluations above median, 0 otherwise, X = 1 if female, 0 otherwise. For a prof with B=A=E=D=X=1 and Y=5, what's the expected change in salary if she goes from very good to poor student evaluations? Mean salaries were $16,100 for males and $11,200 for females. What is the value of the slope from a simple S on X regression? ------------------------------------------------------------------- 9. A researcher is studying the effect of an incentive on the retention of subject matter and is also interested in the role of time devoted to study. Subjects are randomly assigned to two groups, one receiving (C3 = 1) and the other not receiving (C3 = 0) an incentive. Within these groups, subjects are randomly assigned to 5, 10, 15, or 20 minutes of study (C2) of a passage specifically prepared for the experiment. At the end of the study period, a test of retention (C1) is administered. We treat the study time as a covariate for investigating the differential effects of the incentive. Part I: ANCOVA Use the Minitab output below to answer the following questions. (This is a quiz question from prior year) (for reference raw data are in file retention.dat) What is the slope of the C1 on C2 regression line for the 12 subjects in the incentive group? What is the correlation between C1 and C2 for the incentive group? Construct a 99% confidence interval for the analysis of covariance treatment effect. MTB > ancova c1 = c3; SUBC> covariates c2; SUBC> means c3. Analysis of Covariance for C1 Source DF ADJ SS MS Covariates 1 42.008 42.008 C3 1 100.042 100.042 Error 21 30.575 1.456 Total 23 172.625 Covariate Coeff Stdev t-value C2 0.2367 0.0441 5.371 ADJUSTED MEANS C3 N C1 0 12 5.8333 1 12 9.9167 MTB > describe c1-c2; SUBC> by c3. C3 N MEAN MEDIAN STDEV C1 0 12 5.833 5.500 1.850 1 12 9.917 10.000 1.782 C2 0 12 12.50 12.50 5.84 1 12 12.50 12.50 5.84 MTB > let c4 = c2*c3 MTB > regress c1 3 c3 c2 c4 The regression equation is C1 = 2.50 + 4.83 C3 + 0.267 C2 - 0.0600 C4 Predictor Coef Stdev Constant 2.5000 0.8646 C3 4.833 1.223 C2 0.26667 0.06314 C4 -0.06000 0.08929 MTB > regress c1 2 c3 c2 The regression equation is C1 = 2.87 + ???? C3 + ????? C2 Predictor Coef Stdev Constant 2.8750 0.6517 C3 ?????? 0.4926 C2 ??????? 0.04406 ---------------------------------------- END HW3