Education 161 Winter 2000 Assignment 3 Solutions Feb 22, 2000 Problem 1. NELS data elaine47:~> minitab MTB > read '[path]/nels.dat' c1-c18 4804 ROWS READ ROW C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 1 124915 26.79 19.86 37.42 0.11 -0.05 1 0 1 0 0 2 124916 21.63 20.63 35.61 -0.52 -0.30 1 0 1 0 0 3 124922 20.34 19.81 26.13 0.78 -0.86 0 1 1 0 0 4 124932 21.83 14.71 22.15 -0.16 -0.72 0 1 1 0 0 . . . ROW C12 C13 C14 C15 C16 C17 C18 1 0 1 0 0 0 0 0 2 0 0 0 0 0 1 0 3 1 0 0 0 0 0 0 4 1 0 0 0 0 0 0 . . . MTB > note let's look at gender first MTB > tally c13; SUBC> all. C13 COUNT CUMCNT PERCENT CUMPCT 0 2503 2503 52.10 52.10 1 2301 4804 47.90 100.00 N= 4804 MTB > note So we have 52% females in sample MTB > note lets look at science ach measures MTB > describe c1 c2 N MEAN MEDIAN TRMEAN STDEV SEMEAN C1 4804 4757590 4713049 4793801 2661069 38393 C2 4804 22.296 22.185 22.270 6.059 0.087 MIN MAX Q1 Q3 C1 124915 9196999 2510105 7236356 C2 10.060 34.680 17.392 27.090 MTB > describe c2 c3 N MEAN MEDIAN TRMEAN STDEV SEMEAN C2 4804 22.296 22.185 22.270 6.059 0.087 C3 4804 19.463 19.050 19.351 4.857 0.070 MIN MAX Q1 Q3 C2 10.060 34.680 17.392 27.090 C3 9.460 32.880 15.650 22.967 MTB > boxplot c2 -------------------- ---------------I + I--------------- -------------------- ----+---------+---------+---------+---------+---------+--C2 10.0 15.0 20.0 25.0 30.0 35.0 MTB > boxplot c3 ---------------- ------------I + I-------------------- ---------------- ------+---------+---------+---------+---------+---------+C3 10.0 15.0 20.0 25.0 30.0 35.0 MTB > note rather nice and symmetric MTB > plot c2 c3 - *2 2 **426 **2432 3 - * * 2**43*566696++8+5++77443 3 32.0+ * *2 ****33+9897+++++++++78+8247* 4 - ** 2*423337875++++++++++++9+479225 C2 - * * * 34347464+4++++++++++98++54342 2 - * * 2**3244433+8+7+++++++++++++8+674*33 * - **343242284+++++++++++++++++9+3853 2 2 *2 * 24.0+ **32 364637++++++++++++++++++9645* 22** * - *3*433445++++++++++++++++9++++78**2 * - 2 44478+++++++++++++++++979372* ** * - *34*26+9++++++++++++++9+5752432 *** * - 352+6++++++++++++++96+9+3623**2 2 * * 16.0+ *6+8+++++++++++++++7693*2*2 *2 * * - **69++++++++++++95+5252*5 *2* * * * - *24+8++++++++682+75*44 **2**2* * - 297+6+++8+965354644322** *** * * * - **** * 2 * * * ------+---------+---------+---------+---------+---------+C3 10.0 15.0 20.0 25.0 30.0 35.0 MTB > corr c2 c3 Correlation of C2 and C3 = 0.742 MTB > note you do see some floor and ceiling effects in the plot. MTB > note numbers in the plot indicate overstriking data points MTB > note + indicates >9 points MTB > note now create male data MTB > copy c2-c3 c4-c5; SUBC> use c13 =1. MTB > describe c4-c5 N MEAN MEDIAN TRMEAN STDEV SEMEAN C4 2301 23.360 23.920 23.439 6.281 0.131 C5 2301 20.106 20.020 20.034 5.127 0.107 MIN MAX Q1 Q3 C4 10.060 34.680 18.435 28.330 C5 9.490 32.810 16.045 23.810 MTB > note now create female data MTB > copy c2-c3 c6-c7; SUBC> use c13 = 0. MTB > describe c6-c7 N MEAN MEDIAN TRMEAN STDEV SEMEAN C6 2503 21.317 20.790 21.215 5.675 0.113 C7 2503 18.872 18.280 18.746 4.516 0.090 MIN MAX Q1 Q3 C6 10.260 34.680 16.820 25.690 C7 9.460 32.880 15.350 22.040 It's always a good idea to check/double-check any operation on data: describe within gender provides an easy check here: MTB > describe c2-c3; SUBC> by c13. C13 N MEAN MEDIAN TRMEAN STDEV SEMEAN C2 0 2503 21.317 20.790 21.215 5.675 0.113 1 2301 23.360 23.920 23.439 6.281 0.131 C3 0 2503 18.872 18.280 18.746 4.516 0.090 1 2301 20.106 20.020 20.034 5.127 0.107 C13 MIN MAX Q1 Q3 C2 0 10.260 34.680 16.820 25.690 1 10.060 34.680 18.435 28.330 C3 0 9.460 32.880 15.350 22.040 1 9.490 32.810 16.045 23.810 MTB > corr c4 c5 Correlation of C4 and C5 = 0.744 MTB > plot c4 c5 - 2 ** *5 ***322 3 - * ***42*353574783938+46443 2 32.0+ ** * 2376455+875+8+755797244* 2 - * 2* 2*5735++77+79+779878*45222 C4 - * * *32234337 897++++5+93448323** * - * * *3* 4**28694++++8++5++76+43452 23 - **223** *5*3877+89++89+6++6+44 33* * 2 *2 * 24.0+ **22 2625*347+885+++8+++978986422* *2 * - *2*422*2394+849+695+8+7+44866523* * - 2 34*52*773+8+557+736565232*3 * - **2**334799++9++6+999263322*2 2 - 24*652559+7+88+76753 326*2 * * * * * 16.0+ 443766659+8+2296554433 ** * - *246+67+787774*4*7* 2* 4 ** - *2235+9756+6323*644 *2 **2* 2* * - *5642258464 2232432222* *** * * - * ** * 2 * * ------+---------+---------+---------+---------+---------+C5 10.0 15.0 20.0 25.0 30.0 35.0 MTB > corr c6 c7 Correlation of C6 and C7 = 0.729 MTB > plot c6 c7 - * 2 4** *** - * * * 2*3*22335222*3* * 32.0+ * * * *** 634423355278692*2* 3 2 - * 2*2*3*22*4 44935486374622334 3 C6 - * 2**24 3*546766+++7+664+222 3* * - 2* *4 32*2263757++++89748448222** * - *2 *32*3377798997++++49+66563522 * 24.0+ * * 2*24747+9++++++++8+6943223 * * * - * **322665++9+++++++9++5+44555 *2 - * 3269869+9++++++++5+5747242* * * - 22 *385+++++8++++++6+3+253*23 *** * - ***5*+77+++++++++++66+762422*** * 16.0+ *26587++++++9++++75326 ** 2 ** * * - * 4598++++++++++547423*** ** * * * - 293589+9++945*53**32 * - *4*6496446563*222*2* * * - * * ------+---------+---------+---------+---------+---------+C7 10.0 15.0 20.0 25.0 30.0 35.0 So in summary to compare overall and within gender correlations: MTB > corr c2 c3 Correlation of C2 and C3 = 0.742 MTB > corr c4 c5 Correlation of C4 and C5 = 0.744 MTB > corr c6 c7 Correlation of C6 and C7 = 0.729 Remarkably close MTB > stop ----------------------------------------------------------------------- 2. MTB > [read data from fil2 hw3p2.dat....] 100 ROWS READ ROW C1 C2 1 133.590 69.346 2 91.888 41.231 3 153.121 65.063 4 88.221 53.736 . . . Let's take a look at the data: MTB > plot c2 c1 105+ - * C2 - * - * * * ** - ** * * * * * 70+ * * *3 - * ** *2 * 2 2* ** * * - * * * 2** **3* *** * - * 2* *** * - ** *** 3** 2 ** 35+ * 3 * * 2 *** * - * * ** ** - * * ** - * - * 0+ - - - * ------+---------+---------+---------+---------+---------+C1 35 70 105 140 175 210 Let's do a regression of the full data set. The command below puts the studentized residuals in c3 the raw residuals in c4 and the fit for the straight-line regression in c5. MTB > regress c2 1 c1 c3 c5; SUBC> residuals c4. The regression equation is C2 = 6.56 + 0.435 C1 Predictor Coef Stdev t-ratio Constant 6.557 5.206 1.26 C1 0.43532 0.04808 9.05 s = 14.79 R-sq = 45.5% R-sq(adj) = 45.0% **one item to note here. Minitab has the "habit" that I find annoying to report the squared multiple correlation in a percentage metric. It would be much better just to say the the R-squared is .455 instead of 45.5%. But they are not incorrect; it behooves the user not to report a value of 45.5 for a number that is bounded in [0, 1]. Take heed.******************************************************** Analysis of Variance SOURCE DF SS MS Regression 1 17919 17919 Error 98 21425 219 Total 99 39344 Unusual Observations Obs. C1 C2 Fit Stdev.Fit Residual St.Resid 10 29 53.69 19.10 3.90 34.59 2.43RX 36 83 82.95 42.48 1.80 40.47 2.76R 49 81 4.85 41.86 1.84 -37.01 -2.52R 58 111 86.70 55.01 1.52 31.70 2.16R 66 82 10.64 42.09 1.82 -31.45 -2.14R 71 194 59.76 90.99 4.58 -31.23 -2.22RX 77 33 -17.99 20.82 3.72 -38.81 -2.71RX R denotes an obs. with a large st. resid. X denotes an obs. whose X value gives it large influence. (note fit is proportional to X in straight-line regr) so here is plot of raw residuals vs X. MTB > plot c4 c1 - C4 - * - * - * 25+ * ** * - ** * * - * * * * * * * * - ** * 2* * - * ** * *22 * * *3 ** 0+ 3 * *** * **2* * 2 * * - * * 2*2 2 ** * ** - * 2 * * * - * ** ** ** * - * * *2 * -25+ * - * * - * - * - ------+---------+---------+---------+---------+---------+C1 35 70 105 140 175 210 for this regression using the full data set (n = 100) we obtain the following inferences from original regression separate 95% interval estimates for pop. slope and intercept have endpoints: 6.56 +/- 1.99*5.206 (-3.8, 16.92) intercept .435 +/- 1.99*.048 (.34, .53) slope Instead we could Bonferroni the two inferences to control overall confidence coeff or do a joint estimate (refer to sec 5.1) For the purpose of the exercise let's proceed to remove (i.e. set aside) the three cases with the large residuals and influence (cases 10 71 77) refer to MT handbook pp. 15-17 for use of DELETE command. redo regression with reduced data set MTB > delete 10 71 77 c1 c2 MTB > regress c2 1 c1 c3 c5; SUBC> residuals c4. The regression equation is C2 = 3.30 + 0.470 C1 Predictor Coef Stdev t-ratio Constant 3.295 5.351 0.62 C1 0.47008 0.04952 9.49 s = 13.63 R-sq = 48.7% R-sq(adj) = 48.1% For this data set n = 97 residual variance decreases slightly and R-square increases slightly yet standard error of coeff's increase slightly--that's due to decreasing the SSX in the deletion of cases with high influence. Analysis of Variance SOURCE DF SS MS Regression 1 16734 16734 Error 95 17642 186 Total 96 34376 Unusual Observations Obs. C1 C2 Fit Stdev.Fit Residual St.Resid 35 83 82.95 42.08 1.76 40.86 3.02R 48 81 4.85 41.42 1.80 -36.57 -2.71R 57 111 86.70 55.61 1.43 31.09 2.29R 65 82 10.64 41.67 1.78 -31.03 -2.30R 83 172 80.26 83.94 3.60 -3.69 -0.28 X R denotes an obs. with a large st. resid. X denotes an obs. whose X value gives it large influence. MTB > plot c4 c1 - C4 - * - - * 25+ * ** * - ** * - * * ** * * * * - ** * * * * * * - * ** * ** * 22 * * * * 0+ * * * * * * 2 **2 * * 2* * * - * 3 ** 2 * * * * ** * - * 2 * * - * * ** ** ** * * - * *2 -25+ * * - * - * - - ----+---------+---------+---------+---------+---------+--C1 50 75 100 125 150 175 Let's carry this further for illustration Further deletion of cases with large residuals; And repeat regression with sample of size 93. MTB > delete 35 48 57 65 c1 c2 MTB > regress c2 1 c1 c3 c5; SUBC> residuals c4. The regression equation is C2 = 4.49 + 0.458 C1 Predictor Coef Stdev t-ratio Constant 4.490 4.719 0.95 C1 0.45826 0.04339 10.56 s = 11.81 R-sq = 55.1% R-sq(adj) = 54.6% Analysis of Variance SOURCE DF SS MS Regression 1 15562 15562 Error 91 12695 140 Total 92 28257 Unusual Observations Obs. C1 C2 Fit Stdev.Fit Residual St.Resid 26 108 79.20 53.96 1.23 25.24 2.15R 28 87 18.33 44.46 1.45 -26.12 -2.23R 42 62 59.13 32.86 2.24 26.27 2.26R 74 158 100.52 76.93 2.61 23.59 2.05R 79 172 80.26 83.11 3.14 -2.86 -0.25 X 81 104 75.77 52.06 1.23 23.72 2.02R R denotes an obs. with a large st. resid. X denotes an obs. whose X value gives it large influence. MTB > plot c4 c1 - * C4 - ** * - - ** * * 15+ * * * * * - * * * * - ** * * * * - * * * ** * * - ** *** * ** * * *** * 0+ * * * *2 * * * - * * * * * * * * * * - 3 * 2 * ** - * * * * - * * * -15+ * * * * * - * * * - * *2 - * - * ----+---------+---------+---------+---------+---------+--C1 50 75 100 125 150 175 Herre's a summary table of the three fits beta(0)hat beta(1)hat R-sq s ------------- ------------- ---- --- 6.557 (5.206) 0.435 (0.048) .455 14.79 3.295 (5.351) 0.470 (0.050) .487 13.63 [3 cases removed] 4.490 (4.719) 0.458 (0.043) .551 11.81 [4 more cases removed] Removal of outliers slightly reduced the standard errors of the estimates. Really doesn't make that much difference here, even though the outliers appeared large. The major effect appears to be on the magnitude of the intercept, but the intercept has a big standard error, so that small changes aren't of much import. It is not necessarily recommended that all (or even any) cases with large residuals be set aside (i.e. deleted and analysis redone); the purpose of this exercise is to look at the effects of so doing and just to practice and illustrate doing the computations. ASIDE: OUTPUT from RLINE In class I mentioned robust/resistant methods for straight-line regression. The Minitab implementation of such is RLINE; other packages--e.g. Splus--have more extensive methods of this type RLINE gives results closest it seems to the OLS regression on the full n=100 data set. note:HG p. 187 discusses briefly the use and interpretation of residuals and residual plots. -------------------------------------------------------------------- ----------------------------------------------------------------------- 3. This problem is concerned with using hospital stay (X) to estimate prognosis (Y). 15 ROWS READ ROW C1 C2 1 2 54 2 5 50 3 7 45 4 10 37 . . . ****** Straight-line Model ****** MTB > name c1 'Days' MTB > name c2 'P-Index' MTB > plot c2 c1 P-Index - * - * - 45+ * - - - * * - 30+ - - * - * - * 15+ * - * * - * * - * - * 0+ +---------+---------+---------+---------+---------+------Days 0 12 24 36 48 60 MTB > name c22 'Resids' MTB > regress c2 1 c1 c20 c21; SUBC> residuals c22. The regression equation is P-Index = 46.5 - 0.753 Days Predictor Coef Stdev t-ratio Constant 46.460 2.762 16.82 Days -0.75251 0.07502 -10.03 s = 5.891 R-sq = 88.6% R-sq(adj) = 87.7% Analysis of Variance SOURCE DF SS MS Regression 1 3492.1 3492.1 Error 13 451.2 34.7 Total 14 3943.3 MTB > plot c22 c1 - * Resids - * - * - 5.0+ - * * - * - - * 0.0+ - * - * - * - -5.0+ * * - - * * * - - +---------+---------+---------+---------+---------+------Days 0 12 24 36 48 60 Straight-line Model Prediction SUBC> predict 10; SUBC> predict 60. Fit Stdev.Fit 95% C.I. 95% P.I. 38.94 2.18 ( 34.23, 43.64) ( 25.36, 52.51) 1.31 2.67 ( -4.46, 7.08) ( -12.67, 15.29) From the plot of Prognosis Index on Days and that of Residuals a straight-line model is not your best bet. (even though healthy R-square for the straight-line fit) TRANSFORMATION TO STRAIGHTEN From the plot of Prognosis Index vs Days it appears that a log transformation or a square-root transformation on x may "straighten" the relationship between the two. recall Rule of Bulge would indicate Xdown or Ydown ****** Log Transformation ****** MTB > let c4 = loge(c1) here for use in the predict subcommand we need to define our stated X-values (hospital stay) into the new metric MTB > let k1 = loge(10) MTB > let k2 = loge(60) MTB > name c4 'ln-Days' MTB > plot c2 c4 P-Index - * - * - 45+ * - - - * * - 30+ - - * - * - * 15+ * - * * - * * - * - * 0+ ----+---------+---------+---------+---------+---------+--ln-Days 0.70 1.40 2.10 2.80 3.50 4.20 MTB > regress c2 1 c4 c20 c21; SUBC> residuals c22. The regression equation is P-Index = 72.3 - 16.0 ln-Days Predictor Coef Stdev t-ratio Constant 72.283 2.704 26.73 ln-Days -15.9662 0.8386 -19.04 s = 3.241 R-sq = 96.5% R-sq(adj) = 96.3% Analysis of Variance SOURCE DF SS MS Regression 1 3806.8 3806.8 Error 13 136.5 10.5 Total 14 3943.3 Unusual Observations Obs. ln-Days P-Index Fit Stdev.Fit Residual St.Resid 1 0.69 54.000 61.216 2.159 -7.216 -2.99RX R denotes an obs. with a large st. resid. X denotes an obs. whose X value gives it large influence. MTB > plot c22 c4 - * - 3.5+ * * - Resids - * * - * - * 0.0+ * * - * - * * - - * -3.5+ * - - - - -7.0+ * - ----+---------+---------+---------+---------+---------+--ln-Days 0.70 1.40 2.10 2.80 3.50 4.20 SUBC> predict k1; SUBC> predict k2. Fit Stdev.Fit 95% C.I. 95% P.I. 35.520 1.054 ( 33.243, 37.796) ( 28.156, 42.883) 6.912 1.202 ( 4.315, 9.509) ( -0.557, 14.381) The plot is closer to straight-line. But the residual plot still shows a pattern. Nontheless the model works better than straight-line fit to the original data. ****** Square-root transformation ****** just for illustration we'll also show you the same stuff for a square-root transformation of hospital stay ( a reasonable choice also) MTB > let c5 = sqrt(c1) MTB > let k3 = sqrt(10) MTB > let k4 = sqrt(60) MTB > name c5 'Days^.5' MTB > plot c2 c5 P-Index - * - * - 45+ * - - - * * - 30+ - - * - * - * 15+ * - * * - * * - * - * 0+ --------+---------+---------+---------+---------+--------Days^.5 1.5 3.0 4.5 6.0 7.5 MTB > regress c2 1 c5 c20 c21; SUBC> residuals c22. The regression equation is P-Index = 63.2 - 7.75 Days^.5 Predictor Coef Stdev t-ratio Constant 63.225 2.271 27.84 Days^.5 -7.7481 0.4097 -18.91 s = 3.262 R-sq = 96.5% R-sq(adj) = 96.2% Analysis of Variance SOURCE DF SS MS Regression 1 3805.0 3805.0 Error 13 138.3 10.6 Total 14 3943.3 MTB > plot c22 c5 - * Resids - - * - * 3.0+ - * - * - * - * * 0.0+ * - - - * - * -3.0+ * - * - * * - - --------+---------+---------+---------+---------+--------Days^.5 1.5 3.0 4.5 6.0 7.5 SUBC> predict k3; SUBC> predict k4. Fit Stdev.Fit 95% C.I. 95% P.I. 38.723 1.171 ( 36.193, 41.254) ( 31.235, 46.212) 3.209 1.357 ( 0.276, 6.141) ( -4.425, 10.842) The residual plot still shows a curvilinear pattern. b) In this part we will redo this exercise using a quadratic regression function. That may? work better... From the plot of Prognosis Index on Days and that of Residuals a straight-line model is not a good bet. (even though healthy R-square for the straight-line fit) Try adding in a quadratic term in Days: ******** Quadratic Model ****** MTB > let c3 = c1*c1 MTB > name c3 'Days^2' MTB > regress c2 2 c1 c3 c20 c21; SUBC> residuals c22. The regression equation is P-Index = 55.8 - 1.71 Days + 0.0148 Days^2 Predictor Coef Stdev t-ratio Constant 55.822 1.649 33.85 Days -1.7103 0.1248 -13.70 Days^2 0.014807 0.001868 7.93 s = 2.455 R-sq = 98.2% R-sq(adj) = 97.9% Analysis of Variance SOURCE DF SS MS Regression 2 3871.0 1935.5 Error 12 72.3 6.0 Total 14 3943.3 SOURCE DF SEQ SS Days 1 3492.1 Days^2 1 378.9 MTB > plot c22 c1 - Resids - * - - * 2.5+ * - - * - * * - * 0.0+ * - - * * * - * - -2.5+ * - * - * - - +---------+---------+---------+---------+---------+------Days 0 12 24 36 48 60 To obtain predictions for 10 and 60 days use PREDICT subc in regress Note here you need to enter x and x-squared for predict to work. SUBC> predict 10 100; SUBC> predict 60 3600. Fit Stdev.Fit 95% C.I. 95% P.I. 40.200 0.921 ( 38.194, 42.207) ( 34.485, 45.915) 6.511 1.292 ( 3.695, 9.328) ( 0.465, 12.558) Examination of the residuals and the squared multiple correlation show that this model fits the data much better. Compare the predictions and interval estimates from the straight-line fit to the original data. The interval estimates for the quadratic model are about one half the width of those from the straight-line model. Moreover for an extreme X value (e.g. 60) the point estimate from the straight-line model is too small as the model does not account for the curvature in the relationship. Fitting the quadratic seems to work slightly better than the transformations. note: HG section 8.28 explains the transformation of nonlinear relationships into linear relationships. HG does not discuss the distinction between a CI for the mean of Y for a given value of X vs. a CI for the prediction for the response of a new observation. see Minitab help for PREDICT ----------------------------------------------------------------------------- 4. Bodyfat data revisited I redid the analyses for this, which are fully available from the text example or from bodyfat.out Just to take up more bytes, recreate the relevant parts here. MTB > read '[path]/bodyfat.dat' c1-c4 20 ROWS READ ROW C1 C2 C3 C4 1 19.5 43.1 29.1 11.9 2 24.7 49.8 28.2 22.8 3 30.7 51.9 37.0 18.7 4 29.8 54.3 31.1 20.1 . . . MTB > name c1 'triceps' c2 'thigh' c3 'midarm' c4 'bodyfat' To find the single best predictor of bodyfat, look at the correlations of each predictor separately with bodyfat: MTB > corr c1-c4 triceps thigh midarm thigh 0.924 midarm 0.458 0.085 bodyfat 0.843 0.878 0.142 The correlations show that thigh circumference is the best single predictor (i.e., the regression of bodyfat on thigh will have a higher R-squared than will the regression of bodyfat on either of the remaining predictors.) Triceps skinfold thickness is a close second. Regression of bodyfat on thigh: MTB > regress 'bodyfat' 1 'thigh' The regression equation is bodyfat = - 23.6 + 0.857 thigh Predictor Coef Stdev t-ratio p Constant -23.634 5.657 -4.18 0.001 thigh 0.8565 0.1100 7.79 0.000 s = 2.510 R-sq = 77.1% R-sq(adj) = 75.8% The regression coefficient for thigh in the single-predictor equation is 0.8565, with a standard error of 0.1100. The corresponding t-statistic is 7.79, (which the p-value of 0.000 shows is significant at any Type I error rate we might choose.) Here's the regression of bodyfat on triceps: MTB > regress 'bodyfat' 1 'triceps' The regression equation is bodyfat = - 1.50 + 0.857 triceps Predictor Coef Stdev t-ratio p Constant -1.496 3.319 -0.45 0.658 triceps 0.8572 0.1288 6.66 0.000 s = 2.820 R-sq = 71.1% R-sq(adj) = 69.5% Triceps also looks like a good predictor when it is used by itself. ------------------ Regression using both triceps and thigh as predictors: MTB > regress 'bodyfat' 2 'triceps' 'thigh' The regression equation is bodyfat = - 19.2 + 0.222 triceps + 0.659 thigh Predictor Coef Stdev t-ratio p Constant -19.174 8.361 -2.29 0.035 triceps 0.2224 0.3034 0.73 0.474 thigh 0.6594 0.2912 2.26 0.037 s = 2.543 R-sq = 77.8% R-sq(adj) = 75.2% Now the coefficient for thigh is 0.6594, which is fairly close to the value of 0.8565 from the one-predictor equation. The standard error has increased from 0.1100 to 0.2912 (variance increased by a factor of 7), resulting in a decrease in the t-statistic from 7.79 to 2.26. Thus the t-statistic from the 2-predictor equation is almost 3.5 times smaller than that from the 1-predictor equation. The p-value reveals that the coefficient for thigh would not be considered significantly different from 0 if we used a familywise Type I error rate of .05. The coefficient for triceps changes substantially when we move from the 1-predictor to the 2-predictor equation. It decreases from 0.8572 to 0.2224, and its standard error increases from 0.1288 to 0.3034 (variance increases by a factor of about 5.5). The t-statistic decreases from 6.66 to 0.73, so the new t-statistic is 9 times smaller than that from the single-predictor equation. Triceps becomes nonsignificant (p=0.474). Here's the regression using all three predictors. MTB > regress 'bodyfat' 3 'triceps' 'thigh' 'midarm' * NOTE * triceps is highly correlated with other predictor variables * NOTE * thigh is highly correlated with other predictor variables * NOTE * midarm is highly correlated with other predictor variables The regression equation is bodyfat = 117 + 4.33 triceps - 2.86 thigh - 2.19 midarm Predictor Coef Stdev t-ratio p Constant 117.08 99.78 1.17 0.258 triceps 4.334 3.016 1.44 0.170 thigh -2.857 2.582 -1.11 0.285 midarm -2.186 1.595 -1.37 0.190 s = 2.480 R-sq = 80.1% R-sq(adj) = 76.4% Now the coefficient for thigh circumference has become negative: -2.857, as compared with 0.8565 from the single-predictor equation and 0.6594 from the 2-predictor equation. Its standard error is much larger than before (2.582), resulting in a t-statistic that is quite small: -1.11, as compared with 7.79 and 2.26 from the previous two equations. So even though the coefficient is negative, we can't say that it's significantly different from zero (p=0.285). In other words, thigh appears to be a useless predictor once we include midarm and triceps, even though it was the single best predictor of bodyfat when used alone. The coefficient for triceps remained positive (4.334), and is substantially larger than the coefficients for triceps in the 1- and 2- predictor equations (0.8572 and 0.2224, respectively). Its standard error increased quite a bit as well (3.016, compared with 0.1288 and 0.3034 in the previous 2 equations). Its t-statistic is 1.44, which is larger than that for the 2-predictor equation (0.73) but still not significant (p=0.170) despite the increase in the size of the coefficient. ----------------------------------------------------------------- 5. Patient Satisfaction for reference : C1 = Y (patient satisfaction) C2 = X1 (patient's age) C3 = X2 (severity of illness) C4 = X3 (anxiety level) part a MTB > read '[path]/patient.dat' into c1-c4 23 ROWS READ ROW C1 C2 C3 C4 1 48 50 51 2.3 2 57 36 46 2.3 3 66 40 48 2.2 4 70 41 44 1.8 . . . MTB > stem c2-c4 Stem-and-leaf of C2 N = 23 Leaf Unit = 1.0 5 2 89999 5 3 7 3 33 8 3 4 10 3 66 11 3 8 (2) 4 01 10 4 233 7 4 45 5 4 5 4 9 4 5 0 3 5 23 1 5 5 Stem-and-leaf of C3 N = 23 Leaf Unit = 1.0 1 4 3 2 4 4 4 4 66 9 4 88899 (6) 5 000111 8 5 23 6 5 445 3 5 6 2 5 8 1 6 1 6 2 Stem-and-leaf of C4 N = 23 Leaf Unit = 0.010 2 18 00 3 19 0 4 20 0 6 21 00 10 22 0000 (5) 23 00000 8 24 0000 4 25 0 3 26 3 27 3 28 3 29 000 Looking at these stem-and-leaf plots we see that X2 (severity of illness) is approximately normally distributed, X1 (patient's age) is approximately uniformly distributed, and X3 (anxiety level) has several large extreme points at 29. We should remain aware of how these points are influencing the regression equation. -------------------- part b fit the multiple regression with all linear terms MTB > regress c1 3 c2 c3 c4 c5 c6; SUBC> residuals c7. The regression equation is C1 = 163 - 1.21 C2 - 0.666 C3 - 8.6 C4 Predictor Coef Stdev t-ratio p Constant 162.88 25.78 6.32 0.000 C2 -1.2103 0.3015 -4.01 0.001 C3 -0.6659 0.8210 -0.81 0.427 C4 -8.61 12.24 -0.70 0.490 s = 10.29 R-sq = 67.3% R-sq(adj) = 62.1% Analysis of Variance SOURCE DF SS MS F p Regression 3 4133.6 1377.9 13.01 0.000 Error 19 2011.6 105.9 Total 22 6145.2 SOURCE DF SEQ SS C2 1 3678.4 C3 1 402.8 C4 1 52.4 The estimate for beta(2) in this equation (i.e. -.666) is the sample coefficient for X2 when X1 X2 X3 are used to predict Y. It can be interpreted (through partial regression plots, adjusted variables) as the slope of the regression line produced by the regression of outcome variable (Y) on the variable X2 dot X1 X3, i.e. the adjusted variable X2 with the associations between X2 and X1 and X3 removed. In order to understand this statement more clearly, let's derive the partial regression coefficient for X2, using the procedure shown in class (and the marks examples). The partial regression slope which we are looking for in this case is c1 c3 dot c2 c4 (read Y on X2 dot X1 X3). First step in this procedure will be to regress the outcome variable in c1 upon the predictor variables in c2 and c4. MTB > regress c1 on 2 c2 c4; SUBC> residuals c8. The regression equation is C1 = 147 - 1.24 C2 - 15.9 C4 Predictor Coef Stdev t-ratio p Constant 147.08 16.73 8.79 0.000 C2 -1.2434 0.2961 -4.20 0.000 C4 -15.891 8.256 -1.92 0.069 s = 10.20 R-sq = 66.1% R-sq(adj) = 62.7% Our next step is to regress X2 onto X1 and X3. Residuals from this regression represent the part of in the X2 values not predictable by X1 and X3 values. MTB > regress c3 2 c2 c4; SUBC> residuals c9. The regression equation is C3 = 23.7 + 0.0496 C2 + 10.9 C4 Predictor Coef Stdev t-ratio p Constant 23.728 4.597 5.16 0.000 C2 0.04962 0.08135 0.61 0.549 C4 10.929 2.268 4.82 0.000 s = 2.802 R-sq = 63.8% R-sq(adj) = 60.2% In order to get the partial regression slope, c1 c3 dot c2 c4, we simply regress the residuals from the first regression onto the residuals from the second regression. A plot of c8 vs c9 is the scatterplot for this regression coefficient. MTB > regress c8 1 c9 The regression equation is C8 = 0.00 - 0.666 C9 Predictor Coef Stdev t-ratio p Constant 0.000 2.041 0.00 1.000 C9 -0.6659 0.7809 -0.85 0.403 s = 9.787 R-sq = 3.3% R-sq(adj) = 0.0% You can note that the coefficient of c9 in this equation is identical to the coefficient of c3 in the original full regression equation. The small R-sq for this regression (i.e. .03) suggests that in combination with X1 and X3, X2 is pretty useless in providing additional information about Y. note: HG sections 8.17-8.19 show the development of a multiple regression equation with 2 predictor variables, a procedure that can be extended to more predictors. The case study at the end of chapter 8 is another example. ------------------------------ part c produce a boxplot of the residuals from the full regression model. Here we use the simple residuals ei. Recall, that we placed these residuals in c7. MTB > boxp c7 ------------------------ ----------------I + I-------------- ------------------------ +---------+---------+---------+---------+---------+------C7 -18.0 -12.0 -6.0 0.0 6.0 12.0 MTB > stem c7 Stem-and-leaf of C7 N = 23 Leaf Unit = 1.0 1 -1 6 5 -1 4321 7 -0 76 10 -0 310 (6) 0 012234 7 0 577 4 1 2334 Both the boxpot and the stem-and-leaf plot suggest that there are no outliers. ------------------------------------------- part d. Examine whether or not there is a general regression relation between the three predictor variables (X1, X2 and X3) and the outcome variable (Y). Our null hypothesis in this case is that parameters beta(1), beta(2), and beta(3) all are equal to zero. The alternative hypothesis is that any one of these three coefficients is not equal to zero. The test statistic for this hypothesis test is calculated as follows: MSReg 1377.9 ----- = ------ = 13.01 MSE 105.9 As you may have noticed, minitab calculates this test statistic for you. It should be compared with critical value F (.90;3,19). You can get an exact value for this using the minitab invcdf command. MTB > invcdf .90; SUBC> f 3, 19. 0.9000 2.3970 Since 13.01 > 2.397, we can reject the null hypothesis of no regression relation. As was pointed out in class, this isn't telling us anything very exciting. This test doesn't tell us anything about the individual predictor variables (X1, X2, X3). note: Minitab Help (windows) explains the interpretation of the anova table in the output. See "interpreting result-simple linear regression." ******************************************* part e Since we already have the two needed sets of residuals for the partial regression plot for X2 as a predictor of Y, using X1 X2 X3, we can simply give the plot command. MTB > plot c8 c9 - * - 12+ * * - * * C8 - * - 2 * - * * * 0+ * * * - - * * - - * * -12+ * - * - * * - - ------+---------+---------+---------+---------+---------+C9 -4.0 -2.0 0.0 2.0 4.0 6.0 We can obtain the regression plots for the other two independent variables in a similar way. For X1, we first need to regress the outcome variable (Y) on X2 and X3 to obtain the set of residuals MTB > regress c1 2 c3 c4; SUBC> residuals c10. The regression equation is C1 = 164 - 1.11 C3 - 20.2 C4 Predictor Coef Stdev t-ratio p Constant 164.23 34.15 4.81 0.000 C3 -1.111 1.078 -1.03 0.315 C4 -20.23 15.76 -1.28 0.214 s = 13.63 R-sq = 39.5% R-sq(adj) = 33.4% Unusual Observations Obs. C3 C1 Fit Stdev.Fit Residual St.Resid 11 48.0 89.00 62.33 5.26 26.67 2.12R 17 56.0 79.00 51.42 4.62 27.58 2.15R R denotes an obs. with a large st. resid. Next we need to regress X1 onto X2 and X3. MTB > regress c2 2 c3 c4; SUBC> residuals c11. The regression equation is C2 = - 1.1 + 0.368 C3 + 9.60 C4 Predictor Coef Stdev t-ratio p Constant -1.12 19.12 -0.06 0.954 C3 0.3681 0.6034 0.61 0.549 C4 9.599 8.823 1.09 0.290 s = 7.632 R-sq = 26.1% R-sq(adj) = 18.8% We can get the partial regression plot for X1 as a predictor by plotting these two sets of residuals. MTB > plot c10 c11 - * C10 - * - - 16+ * - * - * 2 * - - 0+ * - * * * - * - * * 2 * * * - * * -16+ - * - ------+---------+---------+---------+---------+---------+C11 -10.0 -5.0 0.0 5.0 10.0 15.0 Looks like a reasonable scatterplot with a strong negative slope Similarly, for X3 we first regress Y on X1 and X2. MTB > regress c1 2 c2 c3; SUBC> residuals c12. The regression equation is C1 = 167 - 1.26 C2 - 1.09 C3 Predictor Coef Stdev t-ratio p Constant 166.59 24.91 6.69 0.000 C2 -1.2605 0.2892 -4.36 0.000 C3 -1.0893 0.5514 -1.98 0.062 s = 10.16 R-sq = 66.4% R-sq(adj) = 63.1% Next, we regress X3 onto X1 and X2. MTB > regress c4 2 c2 c3; SUBC> residuals c13. The regression equation is C4 = - 0.431 + 0.00582 C2 + 0.0492 C3 Predictor Coef Stdev t-ratio p Constant -0.4314 0.4608 -0.94 0.360 C2 0.005821 0.005350 1.09 0.290 C3 0.04916 0.01020 4.82 0.000 s = 0.1880 R-sq = 65.2% R-sq(adj) = 61.7% Unusual Observations Obs. C2 C4 Fit Stdev.Fit Residual St.Resid 6 49.0 2.9000 2.5085 0.0600 0.3915 2.20R R denotes an obs. with a large st. resid. MTB > plot c12 c13 - * - * 12+ * * - C12 - * * - * * * - * * ** 0+ * - * * - - * - * -12+ * 2 - * - * - - ------+---------+---------+---------+---------+---------+C13 -0.30 -0.15 0.00 0.15 0.30 0.45 Of the three partial regression plots, only the one for X1 reveals a somewhat strong negative association. The other two reveal weak negative associations. This may suggest that either or both of X2 and X3 may not be very useful in improving prediction of Y. From the full multiple regression note that X1 is the only predictor with a big t-statistic. ----------------------------------------------- 6. part a. Below you will find a summary table of group indicator values for each of the four different treatments. Treatment G1 G2 G3 1 0 0 0 2 1 0 0 3 0 1 0 4 0 0 1 We know that the expected value for any given treatment is simply the population mean for that treatment. Therefore, E(Y|G1=0, G2=0, G3=0) = mu(1) E(Y|G1=1, G2=0, G3=0) = mu(2) E(Y|G1=0, G2=1, G3=0) = mu(3) E(Y|G1=0, G2=0, G3=1) = mu(4) Furthermore, we can represent the expected value for each treatment in terms of general linear model parameters, by plugging the appropriate values for G1, G2, and G3 into the given model. Namely, E(Y|G1,G2,G3) = beta0 + beta1*G1 + beta2*G2 + beta3*G3 Treatment 1 ----------- E(Y|G1=0, G2=0, G3=0) = beta0 + beta1*0 + beta2*0 + beta3*0 = beta0 Treatment 2 ----------- E(Y|G1=1, G2=0, G3=0) = beta0 + beta1*1 + beta2*0 + beta3*0 = beta0 + beta1 Treatment 3 ----------- E(Y|G1=0, G2=1, G3=0) = beta0 + beta1*0 + beta2*1 + beta3*0 = beta0 + beta2 Treatment 4 ----------- E(Y|G1=0, G2=0, G3=1) = beta0 + beta1*0 + beta2*0 + beta3*1 = beta0 + beta3 Hence, mu(1) = beta0 mu(2) = beta0 + beta1 mu(3) = beta0 + beta2 mu(4) = beta0 + beta3 Using the information that mu(1) = 7, mu(2) = 9, mu(3) = 6, mu(4) = 15, we conclude: beta0 = mu(1) = 7 beta1 = mu(2) - mu(1) = 9 - 7 = 2 beta2 = mu(3) - mu(1) = 6 - 7 = -1 beta3 = mu(4) - mu(1) = 15 - 7 = 8 --------------------------- part b. mu(3) - mu(2) = beta0 + beta2 - (beta0 + beta1) = beta2 - beta1 = -1 - 2 = -3 Double-checking these answers using our population cell means: mu(3) - mu(2) = 6 - 9 = -3 It checks out. note: HG section 8.22 is useful for understanding the coding of catagorical variables. ------------------------------------------------------------------------- 7. read 'salary.dat' c1-c4 46 ROWS READ ROW C1 C2 C3 C4 1 1 1 1 13876 2 1 3 0 11608 3 1 3 1 18701 4 1 2 0 11283 . . . MTB > name c1 'exp' c2 'ed' c3 'man' c4 'salary' note: to look at categorical variables, MTB table or tally commands will give you relevant counts, relative frequencies etc. Kind of the equivalent of describe command for measured variables. MTB > table c2 ROWS: ed COUNT 1 14 2 19 3 13 ALL 46 MTB > # We need to dummy code the education categories so as to represent three different groups (not levels of an interval variable) make HS the "base" category MTB > code (1 3) 0 (2) 1 c2 c5; > BS=1, Otherwise = 0. MTB > name c5 'ed BS' MTB > table c5 ROWS: ed BS COUNT 0 27 1 19 ALL 46 MTB > code (1 2) 0 (3) 1 c2 c6 MTB > name c6 'ed Ad Deg' >Adv = 1, 0 otherwise MTB > name c6 'ed Adv' To consider the single best predictor of salary, we look at the correlations MTB > corr c1 c3 c4 exp man man -0.051 salary 0.539 0.730 From the above, Corr(salary, management) = 0.730 note that corr(salary, management) is a point biserial correlation (measured variable, salary, correlated with a truly dichotomous variable, managment or not). And for the dummy variable, education, we look at MTB > regress c4 2 c5 c6 The regression equation is salary = 14942 + 3345 ed BS + 3351 ed Adv Predictor Coef Stdev t-ratio Constant 14942 1217 12.27 ed BS 3345 1604 2.09 ed Adv 3351 1754 1.91 s = 4554 R-sq = 10.9% R-sq(adj) = 6.8% Analysis of Variance SOURCE DF SS MS Regression 2 109134646 54567323 Error 43 891962928 20743324 Total 45 1001097576 SOURCE DF SEQ SS ed BS 1 33425874 ed Adv 1 75708771 Unusual Observations Obs. ed BS salary Fit Stdev.Fit Residual St.Resid 42 1.00 27837 18286 1045 9551 2.15R R denotes an obs. with a large st. resid. The correlation here is the sqrt(0.109) = 0.330 Thus it looks like management status is the best predictor of salary, at least among these variables. Predicting salary from all three variables, MTB > regress c4 4 c1 c3 c5 c6 The regression equation is salary = 8036 + 546 exp + 6884 man + 3144 ed BS + 2996 ed Adv Predictor Coef Stdev t-ratio Constant 8035.6 386.7 20.78 exp 546.18 30.52 17.90 man 6883.5 313.9 21.93 ed BS 3144.0 362.0 8.69 ed Adv 2996.2 411.8 7.28 s = 1027 R-sq = 95.7% R-sq(adj) = 95.3% Analysis of Variance SOURCE DF SS MS Regression 4 957816856 239454214 Error 41 43280720 1055627 Total 45 1001097576 And adding two interaction terms for management by education MTB > let c7=c5*c3 MTB > let c8=c6*c3 MTB > name c7 'BS*man' MTB > name c8 'Adv*man' MTB > regress c4 6 c1 c3 c5 c6 c7 c8 The regression equation is salary = 9473 + 497 exp + 3981 man + 1382 ed BS + 1731 ed Adv + 4903 BS*man + 3066 Adv*man Predictor Coef Stdev t-ratio Constant 9472.69 80.34 117.90 exp 496.987 5.566 89.28 man 3981.4 101.2 39.35 ed BS 1381.67 77.32 17.87 ed Adv 1730.7 105.3 16.43 BS*man 4902.5 131.4 37.32 Adv*man 3066.0 149.3 20.53 s = 173.8 R-sq = 99.9% R-sq(adj) = 99.9% Analysis of Variance SOURCE DF SS MS Regression 6 999919408 166653234 Error 39 1178168 30209 Total 45 1001097576 The model with 4 predictors has R-sq = 0.957 The model with 6 predictors has R-sq = 0.999 The change in R-sq = 0.042 The test statistic to use is F = [delta(R-sq)/# variables added]/[(1-R-sq(full))/(N- #variables in full-1)] = (0.042/2) / (1 - 0.999)/(46 - 6 -1) = 819 The critical value is F(0.95, 2,39) = 3.24 So we conclude that the interaction terms add to the prediction of salary significantly. The regression of salary on experience, management and education is MTB > regress c4 4 c1 c3 c5 c6 The regression equation is salary = 8036 + 546 exp + 6884 man + 3144 ed BS + 2996 ed Adv Predictor Coef Stdev t-ratio Constant 8035.6 386.7 20.78 exp 546.18 30.52 17.90 man 6883.5 313.9 21.93 ed BS 3144.0 362.0 8.69 ed Adv 2996.2 411.8 7.28 s = 1027 R-sq = 95.7% R-sq(adj) = 95.3% Analysis of Variance SOURCE DF SS MS Regression 4 957816856 239454214 Error 41 43280720 1055627 Total 45 1001097576 We answer the last two parts using the model without those interaction terms which is basically an arbitrary choice of convenience. Would also be reasonable to use the full models and calculate this increment separately for managers and nonmanagers etc. From the regression coefficient, we see that one more year of experience is worth, on average, $546.18 A 95% confidence interval is thus given by 546.18 +/- t(0.975, 41)*30.52 ===> (484.50, 607.86) The "value" of an advanced degree is $2996.2, so a 95% confidence interval for this estimate is 2996.2 +/- t(0.025, 41)*411.8 ===> (2163.95, 3828.46) Note; Value is calibrated according to the coding of the educational groups. The phrasing of the final question "Repeat for an advanced degree in addition to the BS" can be subject easily to a different reading than I had intended. The comparison between advanced and H.S, is what is done above (and what I had in mind). note: HG section 8.22 is useful for understanding the coding of categorical predictor variables. HG p. 187 gives an example of using a regression equation to predict changes on a dependent variable for various configurations of predictors. ------------------------------------------------------- 8. The prediction equation to work from is S = 1900 + 230*B + 18*A + 100*E + 490*D + 190*Y + 50*T - 2400*X The coefficient of T in the regression equation is 50. It tells us that the fit will show there will be a 50 unit decrease in salary for someone whose student evaluations change from very good (T=1) to poor (T=0). This is true no matter what the values of the other variables are (of course this conclusion applies only to this model, with its set of predictors). partial regression coefficient of T here could be interpreted as the average change in salary between the two levels of student evaluation, which is 50. -------------------- To do this problem the HARD way, you can plug in the values of the other variables that the question provides for us (B=A=E=D=X=1 and Y=5) into the regression equation: S = 1900 + 230(1) + 18(1) + 100(1) + 490(1) + 190(5) + 50(T) - 2400(1) = 1288 + 50(T) For a professor with good student evaluations (T = 1): S = 1288 + 50(1) = 1338 For a professor with bad student evaluations (T = 0): S = 1288 So the expected change in salary is 1288-1338 = -50. (That is, a decrease of 50). ------------------------------------ model for the S on X regression is E(S|X) = beta(0) + beta(1)*X To calculate the value of the estimate of the slope (beta(hat)(1)), first look at the expected values of X within each group and realize that they are equal to the means of S within each group. sample value of E(S|X=0) = beta(hat)(0) = mu(hat)(0) E(S|X=1) = beta(hat)(0) + beta(hat)(1) = mu(hat)(1) Where: mu(hat)(0) = the mean salary for males = 16,100 mu(hat)(1) = the mean salary for females = 11,200 To calculate the slope, beta(hat)(1), we subtract the two group means: beta(hat)(1) = mu(hat)(1) - mu(hat)(0) = 11,200 - 16,100 = -4900 Another way to do this problem is to realize that the line for the regression of S on X will pass through the points (0, 16100) and (1, 11200). The slope = (change in S)/(change in X) = (16100 -11200)/(0 - 1) = -4900. ----------------------------------------------------------------- 9. To calculate the slope for the C1 on C2 regression for the incentive group, we examine the general model: E(C1|C2,C3) = beta(0) + beta(1)*C3 + beta(2)*C2 + beta(3)*(C2*C3) We have the results of this regression in the output for the regression of C1 on C2, C3, and C4 (i.e., C2*C3). >From substituting 0 and 1 for C3, we know that: E(C1|C2, C3=0) = beta(0) + beta(2)*C2 E(C1|C2, C3=1) = [beta(0) + beta(1)] + [beta(2) + beta(3)]*C2 Thus the sample value of the slope for the incentive group (C3=1) is beta(hat)(2) + beta(hat)(3). slope = .267 + (-.06) = .207 To calculate the correlation between C1 and C2 for the incentive group, we use the equation: slope(C1 on C2) = r(C1,C2)*(s(C1)/s(C2)) You've seen this equation written as: b(yx) = r(xy)*(s(y)/s(x)). The question tells us that the standard deviations of C1 and C2 for the incentive group are 1.782 and 5.84, respectively. Solving for r, we get: .207 = r(C1,C2)*(1.782/5.84) or r(C1,C2) = .207*(5.84/1.782) = .678 Treatment Effect Confidence Interval There are a couple of ways to do this problem. (i) The treatment effect is represented by the coefficient of C3 (the group membership variable) in the regression of C1 on C3 and C2. However, it is blocked out in the output and so all we know is the standard error of the treatment effect (.4926). However, we also know that the treatment effect is the difference between the adjusted means in the ancova output. This is 9.9167 - 5.8333 = 4.0834. A 99% confidence interval is (4.0834) +/- (.4926)(t(.995, df = #obs-#parameters) we can look up t(.995, 21) = 2.831. The 99% confidence interval has endpoints 4.0834 +/- 1.395 or (2.689, 5.478). (ii) Another way to calculate the value of the treatment effect is to notice that the F-value for C3 in the ancova output is the square of the t-ratio for C3 in the regression of C1 on C3 and C2. F = MS(C3)/MS(Error) = 100.042/1.456 = 68.71 t = (68.71)**.5 = 8.29 Since we know the standard error of the coefficient of C3 in the regression of C1 on C3 and C2 and its t-ratio, we can solve for the coefficient: coefficient of C3 = treatment effect = 8.29*.4926 = 4.0836 Now we can calculate the confidence interval as we had done above. (iii) use formula for adjusted mean difference: Ybar(1) - Ybar(0) - betahatpooled( Xbar(1) - Xbar(0)) betahatpooled, pooled slope from ancova routine printout is .2367. Ybar() Xbar(), within-group means on outcome and covariate from describe. Substitute values. note: HG section 21.3 elaborates the ANCOVA model. Section 21.8 explains the calculation of adjusted means and the pooled-within slope Bw.