SOLUTIONS
                 Start-up Review Problems-- Ed257
                      D Rogosa
                     January 1999

 
1. (Reference NWK Sec 16.8 pp. 681-5)
 
The ANOVA table is as follows, with calculations below.
 
   SOURCE        SS      df      MS
   Between       80       4      20
   Within       400      40      10
   Total        480      44
 
SSW = SST - SSB = 480-80= 400

df(within) = total n - (# of groups) = (44 + 1) - (4 + 1) = 40
         or df(total) - df(between)
MS = SS/df so that MSB = 80/4 = 20 and MSW = 400/40 = 10.
 
The omnibus null hypothesis is Ho: mu(1) = mu(2) = ... = mu(5)
i.e that all 5 population means are equal, versus an alternative hypothesis
that not all are equal. The test statistic is the ratio of the mean squares
= 20/10 = 2.
The critical value for Type I error rate .10 is F(0.90, 4,40) = 2.09,
(Table A.4 rough interpolation)

Or use Minitab:
MTB >invcdf .90;
SUBC>f 4 40.
0.9000   2.0909

 Since 2<2.09, we do not reject the null hypothesis.

NOTE: since subscripts cannot be displayed in this text mode we will usually
employ parens to indicate subscripts etc -- e.g. mu(1).
-----------------------------------------------------------------------------

2. 
a) The model for this problem is as follows:
   Y(ij) = mu + alpha(i) + epsilon(ij)
where
    i = 1,2,3  (3 groups)
    j = 1,2, ... n(i)  where n(1)=12, n(2)=14, n(3)=11
   Y(ij) = jth employee's response in the ith group
   mu = overall mean
 alpha(i) = effect of the ith group
epsilon(ij) = random error (individual differences) associated with
              the jth employee in the ith group
refer to NWK formula (16.62) on p.693

(An alternative model in terms of the cell means rather
than main effects could be written:
Y(ij) = mu(i) + epsilon(ij)
where
   i = 1,2,3
   j = 1,2,...,n(i)
   Y(ij) = jth employee's response in the ith group
   mu(i) = mean of the ith group
   epsilon(ij) = random error associated with the jth employee in the
    ith group  
)
 
b) We are given
         G(1)    G(2)     G(3)
 n(i)     12      14       11
 y(i)bar  25.2   32.6      28.1  (sample means)
 s(i)^2    3.6    4.8       5.3  (sample variances)
 
  n = 12+14+11= 37
First calculate the grand mean: ybar =  28.862
To calculate grand mean, weight each group mean by its sample size,
add, and divide by total n:
ybar = [25.2(12) + 32.6(14) + 28.1(11)]/(12+14+11) = 28.86
Degrees of freedom between is 2, and within is 34.
 
 Form SSB by deviating
the group means from the grand mean (28.86), squaring the deviations,
multiplying by the group size, and summing over the three groups
(SSB=362.98).   MSB is 181.47 (362.93/2).
Now, SSW = (n(1)-1)s(1)^2 + (n(2)-1)s(2)^2 + (n(3)-1)s(3)^2
         =11(3.6) + 13(4.8) + 10(5.3)
         = 155
MSW is the weighted average (by sample size) of the
within-group variances = 4.559  which is found by
divide SSW by dfw: 155/34 = 4.559.  

Hence the ANOVA table is
SOURCE        SS     df    MS
Between      362.98   2    181.49
Within       155     34    4.558
Total        517.98  36
 
Test statistic = MSB/MSW = 39.81
 The 99th percentile point of F(2,34) is 5.30. 
(by simple interpolation since F(0.99,2,30)=5.39 and F(0.99,2,40)=5.18)
 
Since 39.81 > 5.30 we reject the null hypothesis of equal means in all
groups.
---------------------------------------------------------------------------- 

3. 
 

a) MTB > read '/usr/class/ed257/HW/knee.dat' c1 c2
      24 ROWS READ

  ROW    C1   C2

    1    29    1
    2    42    1
    3    38    1
    4    40    1
   .  .  .

 MTB > describe c1;
 SUBC> by c2.

                C2        N     MEAN   MEDIAN   TRMEAN    STDEV   SEMEAN
 C1              1        8    38.00    40.00    38.00     5.48     1.94
                 2       10    32.00    31.00    31.62     3.46     1.10
                 3        6    24.00    22.50    24.00     4.43     1.81

                C2      MIN      MAX       Q1       Q3
 C1              1    29.00    43.00    32.00    42.00
                 2    28.00    39.00    29.00    35.00
                 3    20.00    32.00    20.75    27.50

The group means are 38, 32, and 24 for the below average, average, and
above average groups, respectively.  Variances are 30.03, 11.97, and 19.62.

(Note:  The group means and SDs are also displayed under the ANOVA table)

b)

 MTB > dotplot c1;
 SUBC> by c2.

  C2
  1 (below average)
                                  . .               .   :   : .
           -----+---------+---------+---------+---------+---------+-C1
  C2
  2 (average)
                                . : . :   .   :       .
           -----+---------+---------+---------+---------+---------+-C1
  C2
  3 (above average)
                . . . .     .           .
           -----+---------+---------+---------+---------+---------+-C1
             20.0      25.0      30.0      35.0      40.0      45.0

These plots illustrate the clustering of the observations in each
group about the group means.  The small sample sizes make it difficult
to detect outliers or heteroskedasticity (unequal group variances),
although the observations in the below average group appear to be
somewhat more spread out than are those in the other groups.

c)

 MTB > oneway c1 c2 resids in c3 fits in c4;
 SUBC> tukey.

(Note: the above command tells Minitab to store the residuals in C3
and the fitted values (which are just the group means) in C4.  The
words "resids in" and "fits in" are unnecessary; could just write
MTB >oneway c1 c2 c3 c4)

 ANALYSIS OF VARIANCE ON C1
 SOURCE     DF        SS        MS        F        p
 C2          2     672.0     336.0    16.96    0.000
 ERROR      21     416.0      19.8
 TOTAL      23    1088.0
                                    INDIVIDUAL 95 PCT CI'S FOR MEAN
                                    BASED ON POOLED STDEV
  LEVEL      N      MEAN     STDEV  -------+---------+---------+---------
      1      8    38.000     5.477                           (----*-----)
      2     10    32.000     3.464                 (----*----)
      3      6    24.000     4.427   (-----*-----)
                                    -------+---------+---------+---------
 POOLED STDEV =    4.451                24.0      30.0      36.0


The omnibus null hypothesis is
Ho: mu(1)=mu(2)=mu(3)

We test this against the alternative
Ha:  not all mu(i) are equal

Test statistic is MSB/MSW = 336/19.8 = 16.96.

Find critical value F(.95,2,21):

 MTB > invcdf .95;
 SUBC> f 2 21.
     0.9500    3.4668

Since 16.96 > 3.4668, we reject the omnibus null hypothesis and 
conclude that there are differences among the three groups.

d)  Resids are stored in C3 & fits in C4, from oneway command above.

 MTB > plot c3 c4

          - *
          -                                 *
       6.0+
          -                                                         *
  C3      -                                                         2
          - *                               2                       2
          -                                 *
       0.0+                                                         *
          - *                               2
          - *                               *
          - 2                               3
          -
      -6.0+
          -
          -                                                         *
          -                                                         *
          -
            ----+---------+---------+---------+---------+---------+--C4
             25.0      27.5      30.0      32.5      35.0      37.5

We could also plot C3 against C2, or produce aligned dotplots of the
residuals for each group.

Here's how to obtain residuals the long way (remember residuals are
just the differences between each observation and the group mean):
 MTB > unstack c1 c3-c5;
 SUBC> subscripts c2.
 MTB > let c6=c3-mean(c3)
 MTB > let c7=c4-mean(c4)
 MTB > let c8=c5-mean(c5)
 MTB > stack c6-c8 c9
 MTB > plot c9 c2.

The plots suggest that the variability of the observations in the
below average group is greater than that for the other groups (the
dotplots and a quick look at the descriptive statistics support this).
Since the sample sizes are a bit unequal, if one wanted to be very careful,
the best analysis here would be to use something like BMDP7D
which we illustrated with the IBS data to use a one-way anova method
that did not require the equal variance assumption.

--------------------------------------------------------------------
PROBLEM 4

Could You Get In?

a. Median and Quartiles of GPA
Using just the scatterplot from the problem, we can get a
roughy graphical answer:
   Note that each tick mark represents an increment of 0.16.

   Median = (10th + 11th)/2 = (2.24 +2.56)/2 = 2.4

   Q1= (5th + 6th)/2 = (1.92 + 2.08)/2 = 2.0

   Q3= (15th + 16th)/2 = (3.04 + 3.04)/2 = 3.04
we should mention that alternative formulas for quartiles will 
produce slightly different results; e.g., the formula
Q1=x[(n+1)/4] (integer part) &  Q3=x[3*(n+1)/4] (integer part) 
 produce 1.92 for q1.  

If you use the actual data in the indicated file 95revp1.dat
you should obtain from
     MTB> describe c1
     values 2.400 for median, 1.925 for Q1, and 3.075 for
     Q3.
 

b. fit for 5.0
   GPA = -1.70+0.840*Test
       = -1.70+0.840*5.0
       = 2.5

   fit for 6.0
   GPA = -1.70+0.840*6.0=3.34
   
   observed at 6.0 was 3.36
 
   residual equals observed value minus fitted value:
     3.36-3.34=0.02

  
c. The regression line passes through the sample means. So

   Sample Mean GPA = -1.70 + 0.840*5
                   = 2.5
   
   
d. Correlation is the square root of the R-squared value expressed in 
   decimal form. (In the two variable case.)

   So corr(GPA, Test) = sqrt(0.654) = 0.81

______________________________________________________________________


Problem 5

An alternative to  transformations is to fit a polynomial.  

 MTB > name c7 'dayssq' 
 MTB > let c7 = c1*c1
 MTB > regress 'size' 2 c1 c7 c20 c21
 
 The regression equation is
 size = 8.97 - 1.37 days + 0.0588 dayssq
 
 Predictor       Coef       Stdev    t-ratio        p
 Constant       8.972       4.119       2.18    0.057
 days         -1.3750      0.3213      -4.28    0.002
 dayssq      0.058771    0.005858      10.03    0.000
 
 s = 1.278       R-sq = 99.5%     R-sq(adj) = 99.4%
 
 Analysis of Variance
 
 SOURCE       DF          SS          MS         F        p
 Regression    2      2846.7      1423.3    871.59    0.000
 Error         9        14.7         1.6
 Total        11      2861.4
 
 SOURCE       DF      SEQ SS
 days          1      2682.3
 dayssq        1       164.4
 
 Unusual Observations
 Obs.    days      size       Fit Stdev.Fit  Residual   St.Resid
  10     35.0    30.300    32.842     0.518    -2.542     -2.18R 
 
 R denotes an obs. with a large st. resid.
 
Notice the t statistic for the coefficient of the quadratic term
in this model.  It is highly significant, indicating a substantial quadratic
component to these data.  (Also notice the success of this model in
general -- 

Examine the plot of residuals as a function of fitted values to
see if there is any trend or pattern in the data unaccounted for by the
present model.


 MTB > plot c20 c21 
 
          -
          -                                        *
       1.5+
          -              *
  C20     -
          -     *
          -                      *
       0.0+  **              *                                 *
          -                             *
          -       *
          -         *
          -
      -1.5+
          -
          -                                  *
          -
          -
            +---------+---------+---------+---------+---------+------C21     
            0        10        20        30        40        50
 
Note that there is no simple relationship between fitted values
and residuals, suggesting that this model is adequate.
Is there a relationship between residuals adjacent in time?

---------------------------------------------------------------------------
6.
Take the 2x2 table and put the counts in two cols

 MTB > chisquare c1 c2
 
 Expected counts are printed below observed counts
           VOTE     NOVOTE

     1     1481      132     1613          Some HS
         1438.7    174.3
 
     2     1036      173     1209           No HS
         1078.3    130.7
 
 Total     2517      305     2822
 
 ChiSq =   1.25 +  10.28 +
           1.66 +  13.71 = 26.90
 df = 1
 
 The critical chi-sq(0.95,1) = 3.84. Thus the null hypothesis
 of no association is rejected.

 The phi coefficient is one measure of association, given as
  sqrt(chi-sq/n) = sqrt(26.90/2822) = 0.098.
If you really want to work the phi coeff can alternatively be computed
by
phi = [n(1,1)*n(2,2) - n(2,1)*n(1,2)]/{[n(1+)*n(2+)*n(+1)*n(+2)]**.5}

What's the odds ratio for voting for this table?