SOLUTIONS Start-up Review Problems-- Ed257 D Rogosa January 1999 1. (Reference NWK Sec 16.8 pp. 681-5) The ANOVA table is as follows, with calculations below. SOURCE SS df MS Between 80 4 20 Within 400 40 10 Total 480 44 SSW = SST - SSB = 480-80= 400 df(within) = total n - (# of groups) = (44 + 1) - (4 + 1) = 40 or df(total) - df(between) MS = SS/df so that MSB = 80/4 = 20 and MSW = 400/40 = 10. The omnibus null hypothesis is Ho: mu(1) = mu(2) = ... = mu(5) i.e that all 5 population means are equal, versus an alternative hypothesis that not all are equal. The test statistic is the ratio of the mean squares = 20/10 = 2. The critical value for Type I error rate .10 is F(0.90, 4,40) = 2.09, (Table A.4 rough interpolation) Or use Minitab: MTB >invcdf .90; SUBC>f 4 40. 0.9000 2.0909 Since 2<2.09, we do not reject the null hypothesis. NOTE: since subscripts cannot be displayed in this text mode we will usually employ parens to indicate subscripts etc -- e.g. mu(1). ----------------------------------------------------------------------------- 2. a) The model for this problem is as follows: Y(ij) = mu + alpha(i) + epsilon(ij) where i = 1,2,3 (3 groups) j = 1,2, ... n(i) where n(1)=12, n(2)=14, n(3)=11 Y(ij) = jth employee's response in the ith group mu = overall mean alpha(i) = effect of the ith group epsilon(ij) = random error (individual differences) associated with the jth employee in the ith group refer to NWK formula (16.62) on p.693 (An alternative model in terms of the cell means rather than main effects could be written: Y(ij) = mu(i) + epsilon(ij) where i = 1,2,3 j = 1,2,...,n(i) Y(ij) = jth employee's response in the ith group mu(i) = mean of the ith group epsilon(ij) = random error associated with the jth employee in the ith group ) b) We are given G(1) G(2) G(3) n(i) 12 14 11 y(i)bar 25.2 32.6 28.1 (sample means) s(i)^2 3.6 4.8 5.3 (sample variances) n = 12+14+11= 37 First calculate the grand mean: ybar = 28.862 To calculate grand mean, weight each group mean by its sample size, add, and divide by total n: ybar = [25.2(12) + 32.6(14) + 28.1(11)]/(12+14+11) = 28.86 Degrees of freedom between is 2, and within is 34. Form SSB by deviating the group means from the grand mean (28.86), squaring the deviations, multiplying by the group size, and summing over the three groups (SSB=362.98). MSB is 181.47 (362.93/2). Now, SSW = (n(1)-1)s(1)^2 + (n(2)-1)s(2)^2 + (n(3)-1)s(3)^2 =11(3.6) + 13(4.8) + 10(5.3) = 155 MSW is the weighted average (by sample size) of the within-group variances = 4.559 which is found by divide SSW by dfw: 155/34 = 4.559. Hence the ANOVA table is SOURCE SS df MS Between 362.98 2 181.49 Within 155 34 4.558 Total 517.98 36 Test statistic = MSB/MSW = 39.81 The 99th percentile point of F(2,34) is 5.30. (by simple interpolation since F(0.99,2,30)=5.39 and F(0.99,2,40)=5.18) Since 39.81 > 5.30 we reject the null hypothesis of equal means in all groups. ---------------------------------------------------------------------------- 3. a) MTB > read '/usr/class/ed257/HW/knee.dat' c1 c2 24 ROWS READ ROW C1 C2 1 29 1 2 42 1 3 38 1 4 40 1 . . . MTB > describe c1; SUBC> by c2. C2 N MEAN MEDIAN TRMEAN STDEV SEMEAN C1 1 8 38.00 40.00 38.00 5.48 1.94 2 10 32.00 31.00 31.62 3.46 1.10 3 6 24.00 22.50 24.00 4.43 1.81 C2 MIN MAX Q1 Q3 C1 1 29.00 43.00 32.00 42.00 2 28.00 39.00 29.00 35.00 3 20.00 32.00 20.75 27.50 The group means are 38, 32, and 24 for the below average, average, and above average groups, respectively. Variances are 30.03, 11.97, and 19.62. (Note: The group means and SDs are also displayed under the ANOVA table) b) MTB > dotplot c1; SUBC> by c2. C2 1 (below average) . . . : : . -----+---------+---------+---------+---------+---------+-C1 C2 2 (average) . : . : . : . -----+---------+---------+---------+---------+---------+-C1 C2 3 (above average) . . . . . . -----+---------+---------+---------+---------+---------+-C1 20.0 25.0 30.0 35.0 40.0 45.0 These plots illustrate the clustering of the observations in each group about the group means. The small sample sizes make it difficult to detect outliers or heteroskedasticity (unequal group variances), although the observations in the below average group appear to be somewhat more spread out than are those in the other groups. c) MTB > oneway c1 c2 resids in c3 fits in c4; SUBC> tukey. (Note: the above command tells Minitab to store the residuals in C3 and the fitted values (which are just the group means) in C4. The words "resids in" and "fits in" are unnecessary; could just write MTB >oneway c1 c2 c3 c4) ANALYSIS OF VARIANCE ON C1 SOURCE DF SS MS F p C2 2 672.0 336.0 16.96 0.000 ERROR 21 416.0 19.8 TOTAL 23 1088.0 INDIVIDUAL 95 PCT CI'S FOR MEAN BASED ON POOLED STDEV LEVEL N MEAN STDEV -------+---------+---------+--------- 1 8 38.000 5.477 (----*-----) 2 10 32.000 3.464 (----*----) 3 6 24.000 4.427 (-----*-----) -------+---------+---------+--------- POOLED STDEV = 4.451 24.0 30.0 36.0 The omnibus null hypothesis is Ho: mu(1)=mu(2)=mu(3) We test this against the alternative Ha: not all mu(i) are equal Test statistic is MSB/MSW = 336/19.8 = 16.96. Find critical value F(.95,2,21): MTB > invcdf .95; SUBC> f 2 21. 0.9500 3.4668 Since 16.96 > 3.4668, we reject the omnibus null hypothesis and conclude that there are differences among the three groups. d) Resids are stored in C3 & fits in C4, from oneway command above. MTB > plot c3 c4 - * - * 6.0+ - * C3 - 2 - * 2 2 - * 0.0+ * - * 2 - * * - 2 3 - -6.0+ - - * - * - ----+---------+---------+---------+---------+---------+--C4 25.0 27.5 30.0 32.5 35.0 37.5 We could also plot C3 against C2, or produce aligned dotplots of the residuals for each group. Here's how to obtain residuals the long way (remember residuals are just the differences between each observation and the group mean): MTB > unstack c1 c3-c5; SUBC> subscripts c2. MTB > let c6=c3-mean(c3) MTB > let c7=c4-mean(c4) MTB > let c8=c5-mean(c5) MTB > stack c6-c8 c9 MTB > plot c9 c2. The plots suggest that the variability of the observations in the below average group is greater than that for the other groups (the dotplots and a quick look at the descriptive statistics support this). Since the sample sizes are a bit unequal, if one wanted to be very careful, the best analysis here would be to use something like BMDP7D which we illustrated with the IBS data to use a one-way anova method that did not require the equal variance assumption. -------------------------------------------------------------------- PROBLEM 4 Could You Get In? a. Median and Quartiles of GPA Using just the scatterplot from the problem, we can get a roughy graphical answer: Note that each tick mark represents an increment of 0.16. Median = (10th + 11th)/2 = (2.24 +2.56)/2 = 2.4 Q1= (5th + 6th)/2 = (1.92 + 2.08)/2 = 2.0 Q3= (15th + 16th)/2 = (3.04 + 3.04)/2 = 3.04 we should mention that alternative formulas for quartiles will produce slightly different results; e.g., the formula Q1=x[(n+1)/4] (integer part) & Q3=x[3*(n+1)/4] (integer part) produce 1.92 for q1. If you use the actual data in the indicated file 95revp1.dat you should obtain from MTB> describe c1 values 2.400 for median, 1.925 for Q1, and 3.075 for Q3. b. fit for 5.0 GPA = -1.70+0.840*Test = -1.70+0.840*5.0 = 2.5 fit for 6.0 GPA = -1.70+0.840*6.0=3.34 observed at 6.0 was 3.36 residual equals observed value minus fitted value: 3.36-3.34=0.02 c. The regression line passes through the sample means. So Sample Mean GPA = -1.70 + 0.840*5 = 2.5 d. Correlation is the square root of the R-squared value expressed in decimal form. (In the two variable case.) So corr(GPA, Test) = sqrt(0.654) = 0.81 ______________________________________________________________________ Problem 5 An alternative to transformations is to fit a polynomial. MTB > name c7 'dayssq' MTB > let c7 = c1*c1 MTB > regress 'size' 2 c1 c7 c20 c21 The regression equation is size = 8.97 - 1.37 days + 0.0588 dayssq Predictor Coef Stdev t-ratio p Constant 8.972 4.119 2.18 0.057 days -1.3750 0.3213 -4.28 0.002 dayssq 0.058771 0.005858 10.03 0.000 s = 1.278 R-sq = 99.5% R-sq(adj) = 99.4% Analysis of Variance SOURCE DF SS MS F p Regression 2 2846.7 1423.3 871.59 0.000 Error 9 14.7 1.6 Total 11 2861.4 SOURCE DF SEQ SS days 1 2682.3 dayssq 1 164.4 Unusual Observations Obs. days size Fit Stdev.Fit Residual St.Resid 10 35.0 30.300 32.842 0.518 -2.542 -2.18R R denotes an obs. with a large st. resid. Notice the t statistic for the coefficient of the quadratic term in this model. It is highly significant, indicating a substantial quadratic component to these data. (Also notice the success of this model in general -- Examine the plot of residuals as a function of fitted values to see if there is any trend or pattern in the data unaccounted for by the present model. MTB > plot c20 c21 - - * 1.5+ - * C20 - - * - * 0.0+ ** * * - * - * - * - -1.5+ - - * - - +---------+---------+---------+---------+---------+------C21 0 10 20 30 40 50 Note that there is no simple relationship between fitted values and residuals, suggesting that this model is adequate. Is there a relationship between residuals adjacent in time? --------------------------------------------------------------------------- 6. Take the 2x2 table and put the counts in two cols MTB > chisquare c1 c2 Expected counts are printed below observed counts VOTE NOVOTE 1 1481 132 1613 Some HS 1438.7 174.3 2 1036 173 1209 No HS 1078.3 130.7 Total 2517 305 2822 ChiSq = 1.25 + 10.28 + 1.66 + 13.71 = 26.90 df = 1 The critical chi-sq(0.95,1) = 3.84. Thus the null hypothesis of no association is rejected. The phi coefficient is one measure of association, given as sqrt(chi-sq/n) = sqrt(26.90/2822) = 0.098. If you really want to work the phi coeff can alternatively be computed by phi = [n(1,1)*n(2,2) - n(2,1)*n(1,2)]/{[n(1+)*n(2+)*n(+1)*n(+2)]**.5} What's the odds ratio for voting for this table?