Education 161 Winter 2000 Assignment 4 Solutions March 7, 2000 Problem 1. read data into C1; output follows 240 ROWS READ C1 1 1 0 1 . . . MTB > table c1 ROWS: C1 COUNT 0 61 1 179 ALL 240 The estimated Pr(Success) = 179/240 = 0.7458 and this has standard error sqrt(p(1-p)/n) = 0.0281 Thus , using the normal approximation, since n=240 and p=0.7458, a 95% confidence interval for pi is given by 0.7458 +/- 1.96* 0.0281 ===> (0.6907, 0.8009). Subcommands of tally would give proportions etc (You could also use describe to get the mean of C1, empirical proportion) you may want to compare with HG sec13.8, Fig 13.5 for exact interval ------------------------------------------------------------------------ Problem 2 Take the 2x2 table and put the counts in two cols MTB > chisquare c1 c2 Expected counts are printed below observed counts VOTE NOVOTE 1 1481 132 1613 Some HS 1438.7 174.3 2 1036 173 1209 No HS 1078.3 130.7 Total 2517 305 2822 ChiSq = 1.25 + 10.28 + 1.66 + 13.71 = 26.90 df = 1 The critical chi-sq(0.95,1) = 3.84. Thus the null hypothesis of no association is rejected. The phi coefficient is one measure of association, given as sqrt(chi-sq/n) = sqrt(26.90/2822) = 0.098. If you really want to work the phi coeff can alternatively be computed by phi = [n(1,1)*n(2,2) - n(2,1)*n(1,2)]/{[n(1+)*n(2+)*n(+1)*n(+2)]**.5} nore: HG sections 7.22 and 13.15 show the calculation of the phi coefficient from contingency tables and the chi-squared statistic respectively. -------------------------------------------------------------------- Problem 3 The completed table is following, including percentages and expected number. 1961 1966 1971 1976 1981 Total ------------------------------------------------------------------------ CONTRI: orignial 196 266 194 276 333 1265 percentage 61.44 54.07 44.60 46.15 36.96 46.08 expected 147.0 226.7 200.5 275.6 415.2 NOT-CON:orignal 123 226 241 322 568 1480 percentage 38.56 45.93 55.40 53.85 63.04 53.92 expected 172.0 265.3 234.5 322.4 485.8 Total 319 492 435 598 901 2745 For example in the class of 1966, 54.07% of alumni contacted were contributors. H0: pi1(1)=...=pi1(5). H1: at least one of the above equalities does not hold. (i.e. not all proportions equal) Expected Counts given above Components of test statistic (from Minitab output) ChiSq = 16.33 + 6.80 + 0.21 + 0.00 + 16.28 + 13.96 + 5.81 + 0.18 + 0.00 + 13.91 = 73.48 df = 4 clearly, reject H0 at level .01. (critical value 13.3) So the proportion varies with the year of graduation. From empirical proportions it appears graduates from earlier years tend to participate more than the recent years. nore: HG section 13.11 works through the smoking-CHD example of the chi-squared test of association. ------------------------------------------------------- Problem 4 MTB > read 'rxc.dat' c1 c2 120 ROWS READ ROW C1 C2 1 0 2 2 3 3 3 1 1 4 1 2 . . . MTB > table c1 c2; SUBC> counts; SUBC> rowpercent; SUBC> colpercent. ROWS: C1 COLUMNS: C2 0 1 2 3 4 ALL 0 1 9 8 9 0 27 3.70 33.33 29.63 33.33 -- 100.00 50.00 39.13 19.51 21.43 -- 22.50 1 1 11 14 16 4 46 2.17 23.91 30.43 34.78 8.70 100.00 50.00 47.83 34.15 38.10 33.33 38.33 2 0 3 15 10 7 35 -- 8.57 42.86 28.57 20.00 100.00 -- 13.04 36.59 23.81 58.33 29.17 0 1 2 3 4 ALL 3 0 0 4 7 1 12 -- -- 33.33 58.33 8.33 100.00 -- -- 9.76 16.67 8.33 10.00 ALL 2 23 41 42 12 120 1.67 19.17 34.17 35.00 10.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 CELL CONTENTS -- COUNT % OF ROW % OF COL **here's output from a chisquare subcommand********************* ROWS: C1 COLUMNS: C2 0 1 2 3 4 ALL 0 1 9 8 9 0 27 0.45 5.18 9.23 9.45 2.70 27.00 1 1 11 14 16 4 46 0.77 8.82 15.72 16.10 4.60 46.00 2 0 3 15 10 7 35 0.58 6.71 11.96 12.25 3.50 35.00 3 0 0 4 7 1 12 0.20 2.30 4.10 4.20 1.20 12.00 0 1 2 3 4 ALL ALL 2 23 41 42 12 120 2.00 23.00 41.00 42.00 12.00 120.00 CHI-SQUARE = 18.984 WITH D.F. = 12 CELL CONTENTS -- COUNT EXP FREQ The critical value of the chi-sq (0.95, 12) = 21.0 Thus the hypothesis of independence cannot be rejected. note minitab sec 11.4 cautions against using chi-square test with many small expected counts. for this table it would be best to collapse/combine some categories. One additional comment : if you do combine columns 1 and 2 (because of low cell counts), the result of the test changes. Observed chi-square is 18.778 with 9 d.f.; critical value is 16.9190 ==> reject Ho. Note the cross-tabulation menu item is useful if you prefer not to type the Minitab commands. --------------------------------------------------------------- 5. LOGISTIC REGRESSION: BINARY OUTCOMES 1. read data file into C1-C2 30 ROWS READ ROW C1 C2 1 0 30 2 1 30 3 0 30 4 0 31 . . . MTB > name c1='norenewl' name c2='duesincr' MTB > print c1 c2 ROW norenewl duesincr 1 0 30 2 1 30 3 0 30 4 0 31 5 0 32 6 0 33 7 1 34 8 0 35 9 0 35 10 1 35 11 1 36 12 0 37 13 0 38 14 1 39 15 0 40 16 1 40 17 1 40 18 0 41 19 1 42 20 1 43 21 1 44 22 0 45 23 1 45 24 1 45 25 0 46 26 1 47 27 1 48 28 0 49 29 1 50 30 1 50 MTB > describe c1 c2 N MEAN MEDIAN TRMEAN STDEV SEMEAN norenewl 30 0.5333 1.0000 0.5385 0.5074 0.0926 duesincr 30 39.67 40.00 39.62 6.34 1.16 MIN MAX Q1 Q3 norenewl 0.0000 1.0000 0.0000 1.0000 duesincr 30.00 50.00 34.75 45.00 MTB > plot c1 c2 - 1.05+ - * * * * * 2 * * * 2 * * 2 norenewl- - - 0.70+ - - - - 0.35+ - - - - 0.00+ 2 * * * 2 * * * * * * * --------+---------+---------+---------+---------+--------duesincr 32.0 36.0 40.0 44.0 48.0 query: what's the c1 c2 correlation? PART A. To do Weighted Least Squares (WLS) regression we need to first do Ordinary Least Squares (OLS) regression. The OLS regression gives us the fits (Yhat) that we then use to construct the weights for WLS. MTB > brief 2 MTB > regress c1 1 c2 c10 c11; SUBC>residuals c12. The regression equation is norenewl = - 0.600 + 0.0286 duesincr Predictor Coef Stdev t-ratio p Constant -0.6000 0.5670 -1.06 0.299 duesincr 0.02857 0.01412 2.02 0.053 s = 0.4823 R-sq = 12.8% R-sq(adj) = 9.6% Analysis of Variance SOURCE DF SS MS F p Regression 1 0.9524 0.9524 4.09 0.053 Error 28 6.5143 0.2327 Total 29 7.4667 MTB > name c10='olsstres' c11='olsfit' c12='olsresid' Now we do WLS regression: MTB > name c15='weights' MTB > let c15= 1 / (c11 * (1 - c11)) MTB > print c2 c1 c11 c15 ROW duesincr norenewl olsfit weights 1 30 0 0.257143 5.23504 2 30 1 0.257143 5.23504 3 30 0 0.257143 5.23504 4 31 0 0.285714 4.90000 5 32 0 0.314286 4.64015 6 33 0 0.342857 4.43841 7 34 1 0.371429 4.28322 8 35 0 0.400000 4.16667 9 35 0 0.400000 4.16667 10 35 1 0.400000 4.16667 11 36 1 0.428571 4.08333 12 37 0 0.457143 4.02960 13 38 0 0.485714 4.00327 14 39 1 0.514286 4.00327 15 40 0 0.542857 4.02961 16 40 1 0.542857 4.02961 17 40 1 0.542857 4.02961 18 41 0 0.571429 4.08333 19 42 1 0.600000 4.16667 20 43 1 0.628571 4.28322 21 44 1 0.657143 4.43841 22 45 0 0.685714 4.64015 23 45 1 0.685714 4.64015 24 45 1 0.685714 4.64015 25 46 0 0.714286 4.90000 26 47 1 0.742857 5.23504 27 48 1 0.771429 5.67130 28 49 0 0.800000 6.25000 29 50 1 0.828571 7.04023 30 50 1 0.828571 7.04023 MTB > regress c1 1 c2 c20 c21; SUBC>weights c15; SUBC>residuals c22. **note here's the WLS fit******** The regression equation is norenewl = - 0.589 + 0.0282 duesincr Predictor Coef Stdev t-ratio p Constant -0.5892 0.5345 -1.10 0.280 duesincr 0.02820 0.01313 2.15 0.041 **note slight decrease in s.e. of parameters estimates from OLS******** Analysis of Variance SOURCE DF SS MS F p Regression 1 4.974 4.974 4.61 0.041 Error 28 30.187 1.078 Total 29 35.160 Unusual Observations Obs.duesincr norenewl Fit Stdev.Fit Residual St.Resid 28 49.0 0.000 0.792 0.145 -0.792 -2.04R R denotes an obs. with a large st. resid. MTB > name c20='wlsstres' c21='wlsfit' c22='wlsresid' PART B. What is the largest residual from the WLS fit? MTB > print c2 c1 c11 c12 c21 c22 ROW duesincr norenewl olsfit olsresid wlsfit wlsresid 1 30 0 0.257143 -0.257143 0.256704 -0.256704 2 30 1 0.257143 0.742857 0.256704 0.743296 3 30 0 0.257143 -0.257143 0.256704 -0.256704 4 31 0 0.285714 -0.285714 0.284902 -0.284902 5 32 0 0.314286 -0.314286 0.313101 -0.313101 6 33 0 0.342857 -0.342857 0.341299 -0.341299 7 34 1 0.371429 0.628571 0.369497 0.630503 8 35 0 0.400000 -0.400000 0.397696 -0.397696 9 35 0 0.400000 -0.400000 0.397696 -0.397696 10 35 1 0.400000 0.600000 0.397696 0.602304 11 36 1 0.428571 0.571429 0.425894 0.574106 12 37 0 0.457143 -0.457143 0.454092 -0.454092 13 38 0 0.485714 -0.485714 0.482291 -0.482291 14 39 1 0.514286 0.485714 0.510489 0.489511 15 40 0 0.542857 -0.542857 0.538687 -0.538687 16 40 1 0.542857 0.457143 0.538687 0.461313 17 40 1 0.542857 0.457143 0.538687 0.461313 18 41 0 0.571429 -0.571429 0.566886 -0.566886 19 42 1 0.600000 0.400000 0.595084 0.404916 20 43 1 0.628571 0.371429 0.623282 0.376718 21 44 1 0.657143 0.342857 0.651481 0.348519 22 45 0 0.685714 -0.685714 0.679679 -0.679679 23 45 1 0.685714 0.314286 0.679679 0.320321 24 45 1 0.685714 0.314286 0.679679 0.320321 25 46 0 0.714286 -0.714286 0.707877 -0.707877 26 47 1 0.742857 0.257143 0.736076 0.263924 27 48 1 0.771429 0.228571 0.764274 0.235726 28 49 0 0.800000 -0.800000 0.792472 -0.792472 29 50 1 0.828571 0.171429 0.820671 0.179329 30 50 1 0.828571 0.171429 0.820671 0.179329 This shows that the largest residuals for both OLS and WLS fits to the data belong to subject 28, who renewed membership even though his/her dues were increased by $49. The WLS residual is 0 - 0.792472 = -.792472 . Either scan by eye the resid cols or use MAX or MIN commands (or DESCRIBE) on residuals stored in a column. PART C. Because E(Y|X) is Pr(Y=1|X), if dues are increased by $40 (i.e. X=40), the probability that members will not renew (i.e. Y=1) is given by the regression fit: Y = - 0.5892 + 0.0282 (40) = .539 (note this is the WLS fitted value for X=40 in the output above) PART D. At what level of dues increase will 75% of the members not renew A rough answer can be obtained from the plot of the fit. MTB > plot c21 c2 - 2 0.80+ * - * wlsfit - * * - 3 - * * 0.60+ * - * - * 3 - * - * * 0.40+ 3 - * * - * - * - 3 0.20+ --------+---------+---------+---------+---------+--------duesincr 32.0 36.0 40.0 44.0 48.0 more accurately, solve .75 = - 0.5892 + 0.0282 (X); to give X = 47.49 or with a little more precision solving for X in the log odds linear form: ln(.75/.25)= -4.80751+.125078(X) ==> X=47.2195. From the output above and this plot we see that the probability that members won't renew is .75 when the increase in dues is between $47 and $48, so we can expect 75% of the members not to renew when the increase is in this range. PART E. Use logistic fit to repeat parts b,c,d Minitab logistic fit using blog (see help blog in Minitab) note Mintab and SAS give same parameter estimates etc MTB > blog c1 = c2 Binary Logistic Regression Link Function: Logit Response Information Variable Value Count C1 1 16 (Event) 0 14 Total 30 Logistic Regression Table Odds 95% CI Predictor Coef StDev Z P Ratio Lower Upper Constant -4.808 2.656 -1.81 0.070 C2 0.12508 0.06676 1.87 0.061 1.13 0.99 1.29 Log-Likelihood = -18.732 Test that all slopes are zero: G = 3.991, DF = 1, P-Value = 0.046 ===================================== For reference here is SAS output and batch file for this problem, following the form of progsas.* examples from lecture these were done on leland SAS logistic 1 17:06 Sunday, March 12, 1995 The LOGISTIC Procedure Data Set: WORK.RENEWAL Response Variable: RENEW Response Levels: 2 Number of Observations: 30 Link Function: Logit Response Profile Ordered Value RENEW Count 1 1 16 2 0 14 Criteria for Assessing Model Fit Intercept Intercept and Criterion Only Covariates Chi-Square for Covariates AIC 43.455 41.465 . SC 44.857 44.267 . -2 LOG L 41.455 37.465 3.991 with 1 DF (p=0.0458) Score . . 3.827 with 1 DF (p=0.0504) Analysis of Maximum Likelihood Estimates Parameter Standard Wald Pr > Standardized Odds Variable DF Estimate Error Chi-Square Chi-Square Estimate Ratio INTERCPT 1 -4.8075 2.6558 3.2769 0.0703 . 0.008 DUES 1 0.1251 0.0668 3.5104 0.0610 0.437388 1.133 Association of Predicted Probabilities and Observed Responses Concordant = 68.3% Somers' D = 0.402 Discordant = 28.1% Gamma = 0.417 Tied = 3.6% Tau-a = 0.207 (224 pairs) c = 0.701 SAS logistic 2 17:06 Sunday, March 12, 1995 OBS RENEW DUES _LEVEL_ PHAT LOW UP 1 0 30 1 0.25823 0.07591 0.59602 2 1 30 1 0.25823 0.07591 0.59602 3 0 30 1 0.25823 0.07591 0.59602 4 0 31 1 0.28291 0.09405 0.59989 5 0 32 1 0.30895 0.11554 0.60477 6 0 33 1 0.33628 0.14051 0.61092 7 1 34 1 0.36474 0.16886 0.61869 8 0 35 1 0.39418 0.20014 0.62852 9 0 35 1 0.39418 0.20014 0.62852 10 1 35 1 0.39418 0.20014 0.62852 11 1 36 1 0.42441 0.23348 0.64092 12 0 37 1 0.45522 0.26762 0.65645 13 0 38 1 0.48637 0.30101 0.67556 14 1 39 1 0.51763 0.33207 0.69844 15 0 40 1 0.54875 0.35959 0.72480 16 1 40 1 0.54875 0.35959 0.72480 17 1 40 1 0.54875 0.35959 0.72480 18 0 41 1 0.57949 0.38291 0.75373 19 1 42 1 0.60963 0.40201 0.78392 20 1 43 1 0.63896 0.41732 0.81389 21 1 44 1 0.66728 0.42943 0.84238 22 0 45 1 0.69445 0.43897 0.86845 23 1 45 1 0.69445 0.43897 0.86845 24 1 45 1 0.69445 0.43897 0.86845 25 0 46 1 0.72033 0.44649 0.89158 26 1 47 1 0.74482 0.45243 0.91159 27 1 48 1 0.76786 0.45712 0.92854 28 0 49 1 0.78940 0.46084 0.94265 29 1 50 1 0.80944 0.46378 0.95426 30 1 50 1 0.80944 0.46378 0.95426 SAS command file options linesize = 80 pagesize=80; data renewal; infile 'dues.dat'; input renew dues ; run; proc logistic descending data=renewal; model renew = dues ; title 'SAS logistic '; output out=predict p=phat l=low u=up; run; proc print data=predict; run; ================================================================== (* use coefficients from PROC LOGIST or cut-and paste from SAS output to get fit values array below *) MTB > let c25=expo(-4.80751 + .125078*c2) MTB > let c31=c25 / (1 + c25) MTB > let c32=c1-c31 MTB > name c31='logstfit' MTB > name c32='logstresid' MTB > print c2 c1 c11 c21 c31 ROW duesincr norenewl olsfit wlsfit logstfit 1 30 0 0.257143 0.256704 0.258234 2 30 1 0.257143 0.256704 0.258234 3 30 0 0.257143 0.256704 0.258234 4 31 0 0.285714 0.284902 0.282906 5 32 0 0.314286 0.313101 0.308954 6 33 0 0.342857 0.341299 0.336276 7 34 1 0.371429 0.369497 0.364738 8 35 0 0.400000 0.397696 0.394179 9 35 0 0.400000 0.397696 0.394179 10 35 1 0.400000 0.397696 0.394179 11 36 1 0.428571 0.425894 0.424408 12 37 0 0.457143 0.454092 0.455214 13 38 0 0.485714 0.482291 0.486367 14 39 1 0.514286 0.510489 0.517626 15 40 0 0.542857 0.538687 0.548747 16 40 1 0.542857 0.538687 0.548747 17 40 1 0.542857 0.538687 0.548747 18 41 0 0.571429 0.566886 0.579492 19 42 1 0.600000 0.595084 0.609632 20 43 1 0.628571 0.623282 0.638958 21 44 1 0.657143 0.651481 0.667283 22 45 0 0.685714 0.679679 0.694448 23 45 1 0.685714 0.679679 0.694448 24 45 1 0.685714 0.679679 0.694448 25 46 0 0.714286 0.707877 0.720326 26 47 1 0.742857 0.736076 0.744817 27 48 1 0.771429 0.764274 0.767854 28 49 0 0.800000 0.792472 0.789400 29 50 1 0.828571 0.820671 0.809442 30 50 1 0.828571 0.820671 0.809442 So logistic fit not terribly deviant from fit to straight line response function. See also plots of fits below (logistic shows slight curvature even at this poor graphical resolution) MTB > plot c11 c2 - 2 0.80+ * - * * olsfit - * - 3 - * * 0.60+ * - 3 * - * - * - * * 0.40+ 3 - * * - * - * - 3 0.20+ --------+---------+---------+---------+---------+--------duesincr 32.0 36.0 40.0 44.0 48.0 MTB > plot c21 c2 - 2 0.80+ * - * wlsfit - * * - 3 - * * 0.60+ * - * - * 3 - * - * * 0.40+ 3 - * * - * - * - 3 0.20+ --------+---------+---------+---------+---------+--------duesincr 32.0 36.0 40.0 44.0 48.0 MTB > plot c31 c2 - 0.80+ * 2 - * * logstfit - * - * 3 - * 0.60+ * - 3 * - * - * - * * 0.40+ 3 - * - * * - * - 3 0.20+ --------+---------+---------+---------+---------+--------duesincr 32.0 36.0 40.0 44.0 48.0 Let's look at residuals which we have stored up MTB > print c2 c1 c12 c22 c32 ROW duesincr norenewl olsresid wlsresid logstresid 1 30 0 -0.257143 -0.256704 -0.258234 2 30 1 0.742857 0.743296 0.741766 3 30 0 -0.257143 -0.256704 -0.258234 4 31 0 -0.285714 -0.284902 -0.282906 5 32 0 -0.314286 -0.313101 -0.308954 6 33 0 -0.342857 -0.341299 -0.336276 7 34 1 0.628571 0.630503 0.635262 8 35 0 -0.400000 -0.397696 -0.394179 9 35 0 -0.400000 -0.397696 -0.394179 10 35 1 0.600000 0.602304 0.605821 11 36 1 0.571429 0.574106 0.575592 12 37 0 -0.457143 -0.454092 -0.455214 13 38 0 -0.485714 -0.482291 -0.486367 14 39 1 0.485714 0.489511 0.482374 15 40 0 -0.542857 -0.538687 -0.548747 16 40 1 0.457143 0.461313 0.451253 17 40 1 0.457143 0.461313 0.451253 18 41 0 -0.571429 -0.566886 -0.579492 19 42 1 0.400000 0.404916 0.390368 20 43 1 0.371429 0.376718 0.361042 21 44 1 0.342857 0.348519 0.332717 22 45 0 -0.685714 -0.679679 -0.694448 23 45 1 0.314286 0.320321 0.305552 24 45 1 0.314286 0.320321 0.305552 25 46 0 -0.714286 -0.707877 -0.720326 26 47 1 0.257143 0.263924 0.255183 27 48 1 0.228571 0.235726 0.232146 28 49 0 -0.800000 -0.792472 -0.789400 29 50 1 0.171429 0.179329 0.190558 30 50 1 0.171429 0.179329 0.190558 --What is the largest residual from the logistic fit? The output above shows that again it is the residual for subject 28: 0 - 0.789400 = -0.789400, slightly smaller than the WLS residual for this subject. Doing a DESCRIBE c32 would show the value. --If dues are increased by $40 the probability that members will not renew (i.e. Y=1) is given, for the logistic model, by: pi(hat) = expo(-4.80751 + .125078*X) / (1 + expo(-4.80751 + .125078*X)) = expo(.19561) / (1 + expo(.19561)) = 1.216 / 2.216 = .5487 (note this is the logistic fitted value for X=40 in the output above; this recomputation is just for exposition) --With the logistic fit, at what level of dues increase will 75% of the members not renew membership (i.e. will Y=1)? From the listing of the fits for the logistic function in c31 above we see (again) that the probability that members won't renew is .75 when the increase is between $47 and $48, so we can expect 75% of the members not to renew when the increase is in this range. The actual value for a .75 probability of nonrenewal is about $47.22 which can be obtained most simply from numerical trial and error (I did it graphically from a good plot of the logistic function). PART F. What's the probability of nonrenewal (i.e. Y=1) when there is no dues increase (i.e. X=0) under the WLS and log models? When X=0 in the WLS model, Y = - 0.589 + 0.0282 (0) = -.589 which is meaningless since probabilities can not be less than zero or more than one -- dues increase of zero is too far out of the range of the data for the linear WLS model to apply. In the logistic model, when X=0: pie(hat) = = expo(-4.80751 + .125078*0) / (1 + expo(-4.80751 + .125078*0)) = expo(-4.80751) / (1 + expo(-4.80751)) = .0081681 / 1.0081681 = .0081019 So the logistic model tells us the probability of no increase in dues leading to nonrenewal is very small. So setting the negative value given by the straight- line functional form to 0 is not a bad approximation. Note: The footnote on HG page 434 provides some basic information on logarithms. HG section 8.29 covers the basics on logistic regression. The Minitab help files for blog etc are another useful resource on this topic. (* =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= *) END