COURSE EXAMPLES AND FILES ED257 2003

COURSE EXAMPLES AND FILES --- ED257 D Rogosa

*.dat are ASCII data files. Output from computer packages (e.g. MINITAB) are typically *.lis, *.out, *.log. Links in this file take you directly to the specific data or data analysis example.
This file is cumulative; I'll add entries as we move to that material.

(Note: Additional examples not in electronic form will be introduced throughout the course during lectures. Also, additional data sets and output in electronic form for homework assignments, solutions, etcetera will be described in those documents and posted to the course directory at the appropriate point in the course.)

I. Design and Analysis of Comparative Studies (Experiments)

NAME DESCRIPTION

alphatot.tab Tabulation of total error rate
probabilities for c inferences each done at level
alph: tot = 1 - (1 - alph)^c is solved for
alph. Mathematica script appended.

counsel.dat A 3 x 10 mixed model
design with 2 replications per cell. Fixed factor
in c1 has 3 levels, random factor in c2 has 10
levels. The fixed factor is 3 different
methods/strategies of counseling, and the
random factor represents 10 counselors sampled
from a population of counselors. Six clients
from each counselor are divided amongst the 3
counseling strategies. The outcome measure in
c3 is a self-report of neurotic symptoms. The
mixed model analysis (using MINITAB) is
described in lecture.

harr.dat Data obtained from the
Hopkins&Glass textbook. Their description is
"Harrington (1968) experimented with the order
of 'mental organizers' that structure the
material for the learner. A group of 30 persons
were randomly split into three groups of 10 each.
Group I received organizing material before
studying instructional material on mathematics;
Group II received the 'organizer' after studying
the mathematics; Group III received the math
materials but no organizing materials. Scores
are from a 10-item mathematics test on the
instructional content.
The data are in "unstacked" form in c1-c3.
harr.lis One-way anova (MINITAB) on harr.dat.
harr1v.out BMDP1V output for harr.dat using
orthogonal contrasts.
hartukey.lis Minitab implementation of Tukey post-hoc
comparison procedures with the harr.dat data.

integ.dat 2 x 2 fixed effects with 50
replications per cell. Data obtained from early
Minitab Handbook which gives the following
description: "A researcher at Columbia
University was interested in the effect of
school integration on racial attitudes. He
gave an "ethnocentrism" test to four groups of
children: black children in a segregated school,
white children in a segregated school, black
children in an integrated school, and white
children in an integrated school. 'Ethnocentrism'
is defined as the tendency of children to prefer
to associate with, and respect, other children
of the same ethnic group to those of another
ethnic group. Thus, students who score high on
this test have a stronger preference for their
own race." The data are in stacked form,
with the test score in c1, schooltype in c2
(1 = integrated, 2 = segregated) and race in c3
(1 = black, 2 = caucasian).
integ.lis Cell means and anova table
(from MINITAB) for integ.dat.

rand2way.dat The data are from a 3 x 3
design with 2 replications per cell. A classic
measurement study design, as in generalizability
theory G-studies. The actual data are from
Ott's text, with the following (appetizing)
description: "Consider an experiment to examine
the effects of different analysts and subjects
in chemical analyses for the DNA content of
plaque. Three female subjects (ages 18-20
years) were chosen for the study. Each subject
was allowed to maintain her usual diet,
supplemented with 30 mg of sucrose per day. No
toothbrushing or mouthwashing was allowed during
the study. At the end of the week, plaque was
scraped from the entire dentition of each subject
was divided into six samples.
Each of the analysts chosen at random made a DNA
concentration determination on two samples for
each subject. Data are in units of 10
micrograms. The DNA concentrations are in c1,
analysts in c2 (1,2,3), subjects in c3 (1,2,3).
rand2way.lis Analysis of rand2way.dat using
MINITAB. Table of cell means, random effects
anova including variance components estimation.

scitest.dat Data collected as part of study designed
to investigate the feasibility and technical
quality of science performance assessments. Two
tasks, called Radiation and Rate of Cooling,
were developed from a common "task shell";
in other words, they were designed to be as
parallel as possible in the science processes
tested and in the format of stimulus materials
and required response. They can be thought of
as two sample tasks from a "universe" of similar,
parallel tasks. The investigators treat task as
a random factor because they could imagine
creating additional tasks out of the task shell
from which these two came. This data set
contains the scores of thirty students, assumed
to be drawn at random from the population of
students,each tested on both tasks. Three
raters scored the responses; each paper was
scored by two of the three raters. The students
come from three different schools, ten from each.
Scores are in C1, student ID is in C2, task (1
for Radiation, 2 for Rate of Cooling) is in C3,
rater is in C4, and school is in C5.
scitest.lis Minitab output from a 2-way random effects anova
with outcome the score on the science test, with
the two random factors being student and task.
So the design is 30x2 with 2 replications per
cell.

smsg.dat Used in Part I review and analysis of covariance).
Data from a mathematics curriculum evaluation,
circa 1961. Purpose of the large scale study was
to compare mathematics achievement in a
traditional ninth-grade algebra course with
that in an alternative course developed by the
School Mathematics Study Group (SMSG). 43
teachers from schools across the US
participated; by random assignment there were
21 SMSG (new math) classrooms with 22 traditional
math classrooms.
Columns c1 and c3 contain group indicator
variables; c3 = 1 is SMSG classroom and c3 = 0
is traditional.
The post-instruction outcome measure (classroom
average) on math achievement given at the end of
the school year is in c2; this test was a
traditional algebra test published by the
Cooperative Test division of Educational Testing
Service.
In c4 is a pre-instruction ("pre-test") measure
of knowledge of number systems.
smsg.lis Used in Part I review. Descriptive and
inferential two-group comparisons for the outcome
measure (c2) in smsg.dat.

sunburn.dat Two-way mixed example; taken from Sunscreen ex.
Ott p.770
A corporation is interested in comparing two
different sunscreens (s1 and s2). A random
sample of 10 females (ages 20-25 years)
participated in the study. For each person two
1" x 1" squares were marked off on either side
of the back, under the shoulder but above the
small of the back. Sunscreen s1 was randomly
assigned to the two squares on one side of the
back, with s2 on the other two squares. Exposure
to the sun was for a two-hour period.
The outcome was change (postexposure minus
preexposure) in a reading based on the color of
skin in a square. So we have 10 levels of the
random column factor subjects, two levels of the
fixed row factor, sunscreen, and two replications
per cell. In file sunburn.dat we have the
outcome measure in c1, the type of sunscreen
(s1 =1, s2=2) in c2, the person (i.e. female
tanning subject) in c3.
sunburn.lis Minitab output for the
mixed model analysis of the sunburn.dat data,
a 2X10 design with 2 replications per cell.

unbalanc.dat
Data for a 2 x 3 fixed effects
design, having between 1 and 3 replications per
cell. The data are shown and described in Table
20.1 and section 20.2 of our NWK text. The first
part of this data file has the outcome measure
(growth rate in response to therapy) in c4, the
row factor (subject gender 1,2) in c1, the column
factor (degree of depressed development;
severe = 1, moderate = 2, mild = 3) in c2, and
the replication indicator in c3.
This data structure is set up for the GLM
approach to the analysis of unbalanced designs.
The second part of the data file is set up for
the application of the approximate analysis based
on cell means; cell means in c1, row factor in
c2, column factor in c3.
unbalanc.log
Analyses of the data in unbalanc.dat.
First is shown the GLM analysis (cf. MTB version
7 manual p. 8-27). Second the approximate cell
means analysis is constructed and then compared
with GLM results.

stress.dat Data are from a 2x2x2 fixed
effects design with 3 replications per cell.
Data are shown in Table 22.2 and described in
Section 22.2 of our NWK text. The outcome
measure is exercise tolerance from a stress
test in c1, with gender (male = 1, female = 2)
in c2, body fat level (low = 1, high = 2) in c3
and smoking history (light = 1, heavy = 2) in c4.
stress.lis Analysis of the 3-way design from
stress.dat. Description using versions of
MINITAB Table command along with Layout
subcommand (cf. MTB version 7 manual pages
11-9,11-12). Three-way analysis of variance
using anova command.

***Randomized Blocks***

bhhtab71.dat Data from a 5 x 4 randomized
block design with 5 levels of the blocking
variable and 4 levels of the (fixed) treatment
variable. One replication per cell.
The data are from the Box, Hunter and Hunter
text with the following description:
"In this example a process for the manufacture of
penicillin was being investigated, and the yield
was the response of primary interest. There were
4 variants of the basic process to be studied.
It was known that an important raw material, corn
steep liquor, was quite variable. Fortunately
blends sufficient for four runs could be made,
thus supplying the opportunity to run all 4
treatments with each of the 5 blocks (blends of
corn steep liquor). The experiment was protected
from extraneous unknown sources of bias by
running the treatments in random order within
each block." The yield is in c1, block indicator
in c2, and treatment indicator in c3.
bhhtab71.lis Description and analysis of variance
on randomized block design data in bhhtab71.dat.

dental.dat Randomized block example, factorial treatment
structure From NWK prob DENTAL PAIN.
The "learning statistics is like pulling teeth"
analogy is irresistable.
An anesthesiologist made a comparative study of
the effects of acupuncture and codiene on
postoperative dental pain in male subjects. The
four treatments were (1) placebo treatment-- a
sugar capsule and two inactive acupuncture
points, (2) codiene treatment only--a codeine
capsule and two inactive acupuncture points; (3)
acupucture only--a sugar capsule and two active
acupuncture points (4) both codeine and
acupuncture. These 4 conditions have a 2x2
factorial structure.
Thirty-two subjects were grouped into 8 blocks
of four according to an initial evaluation of
their level of pain tolerance. The subjects in
each block were then randomly assigned to the 4
treatments. Pain relief scores were obtained 2
hours after dental treatment. Data were
collected on a double-blind basis. In file
dental.dat c1 is pain relief score (higher
means more pain relief), c2 is block c3 is
codiene c4 is acupuncture--for c3 and c4, 1=no.
dental.lis Minitab analysis for randomixed block design of
dental.dat.

***Nested Designs***

training.dat NESTED DESIGN,
training school example, from NWK Chap 28.
Description p.970: A large manufacturing company
operates 3 regional training schools for
mechanics, one in each of its operating
districts. The schools have two instructors each
who teach classes of about 15 mechanics in 3-week
sessions.
The company was concerned about the effect of
School (factor A) and instructor (factor B) on
the learning achieved. To investigate these
effects, classes in each district were formed in
the usual way and then randomly assigned to one
of the two instructors in the school [making
class the "unit of analysis"]. This design was
implemented for two 3-week sessions, and at the
end of each session a suitable measure of
learning for the class was obtained.
Data are given in training.dat: C1 has class
learning score, C2 is School (1,2,3), C3 is
instructor (1,2), and C4 is class (first or
second 3-week period).
training.lis Data analysis using Minitab for training.dat,
including the nested design anova.

NWK 28.9 Cross-nested design ("three factor partially nested design")
Data for decision making example
Minitab analysis for decision making example, NWK Fig 28.7

schoolcn.dat Crossed and Nested factors--
teaching methods, schools, teachers
students, Can you work it out?

This example is taken from a well-known
educational statistics textbook: Hopkins&Glass.
On the theme that you should rejoice that we use
NWK instead, I found six (and counting) major errors
in this text's exposition and solution for this single
example.

The example involves the comparison of 5 teaching
methods. Two Schools (considered to be sampled
at random) each employ these five teaching
methods--i.e. each of the 5 teaching methods
appears with each of the two schools--5x2
combinations. Within *each* of the two schools,
3 teachers are chosen at random, so we have three
teachers chosen in School 1 and three different
teachers chosen in School 2. Each teacher employs
each of the 5 teaching methods, and the outcome
data are mastery scores (mastery or not)for
three students for each teacher-method combination
within each school. NWK calls this a
partially nested or crossed-nested design:
section 28.9 pp. 1149-1154 (minitab p1153)

In file schoolcn.dat the columns are outcome; method;
school; teacher(within school); student replication.

Note that the outcome measure is 0/1 ; this text goes
on to assert "Balanced anova designs have been shown
to yield accurate results even with dichotomous
dependent variables [refs]..." For the present we
will take them at their word.

In file schoolcn.sol we answer the following questions

a. Table the means for method crossed with school, and
construct the corresponding profile plot. Do there
appear to be main effects or interactions?
b. Obtain means for each teacher(within school); do
there appear to be teacher effects?
c. Construct an appropriate anova model and obtain
the corresponding anova table for this design.
d. Carry out the series of statistical tests for the
terms (effects) identified in your model in part c;
state your results, being careful to control the
overall Type I error rate.

schoolcn.sol Analyses for
cross-nested schoolcn example

***Repeated Measures***

drugrep.dat Example from Winer Sec 4.3, Table 4.3-1
A study of the effects 4 drugs upon reaction time
to a series of standardized tasks was undertaken
with 5 subjects all of whom had been well-trained
in these tasks.
The 5 subjects are a random sample from a
population of interest to the experimenter. Each
subject was observed under each of the drugs; the
order that the drugs were administrered was
randomized. Time separation between doses was
employed. The outcomes (C1 in drugrep.dat) were
mean reaction time on the series of standardized
tasks; in drugrep.dat C2 (1,2,3,4,5) is the
person and C3 (1,2,3,4) is the drug.
The drug data comprise a oneway repeated measures
classification with 4 levels representing the
reaction times associated with 4 types of drug.
drugrep.lis Minitab analysis for
repeated measures design for drugrep.dat.

bloodflow.dat bloodflow example NWK sec 29.3

Section 29.3 Two-Factor Experiments with Repeated Measures on Both Factors 1181
TABLE 29.7 Data for Blood Flow Example.

Subject Treatment
A1B1 A1B2 A2B1 A2B2
1 2 10 9 25
2 �1 8 6 21
3 0 11 8 24

10 �2 10 10 28
11 2 8 10 25
12 �1 8 6 23

A clinician studied the effects of two drugs used either alone or
together on the blood flow in human subjects. Twelve healthy
middle-aged males participated in the study and they are viewed
as a random sample from a relevant population of middle-aged
males. The four treatments used in the study are defined as
follows:

A1B1 placebo (neither drug)
A1B2 drugB alone
A2B1 drugAalone
A2B2 bothdrugsAandB

The 12 subjects received each of the four treatments in
independently randomized orders. The response variable is the
increase in blood flow from before to shortly after the
administration of the treatment. The treatments were administered
on successive days. This prevented any carryover effects because
the effect of each drug is short-lived. The experiment was
conducted in a double-blind fashion so that neither the physician
nor the subject knew which treatment was administered when the
change in blood flow was measured.

Table 29.7 and bloodflow.dat contains the data for this study.
A negative entry denotes a decrease in blood flow. Figure 29.5
and bloodflow.lis contains the MINITAB output for the fit of
repeated measures model (29.10). Included in the output are the
expected mean squares for the specified ANOVA model. As explained
in Chapter 28, each term in an expected mean square is
represented in the MINITAB output by (1) the numeric code, in
parentheses, for the variance of the model term, and (2) the
preceding number which is the numerical multiple. When the model
term is fixed, the letter Q is used in the printout

bloodflow.lis Minitab analyses of bloodflow

Brogan Kutner Example Pre-post Repeated Measures
Brogan Kutner Analyses Minitab and SAS repeated measures analyses

shoes.dat "It's gotta be the shoes"
Athletic Shoe sales example from NWK Chap 29.4
Between subjects factor, repeated measures on
one-factor: A national retail chain
wanted to study the effects of two advertising
campaigns (factor A) on the sales of athletic
shoes over time (factor B). Ten similar test
markets (subjects S) were randomly chosen to
participate in the study (each campaign used in
5 of these markets). Sales data (c1 in shoes.dat)
were collected for 3 two-week periods (two weeks
prior to campaign, two-weeks during, two weeks
after; coded 1,2,3 in c3 in shoes.dat). In
shoes.dat c2 indicates the ad campaign (1,2)
and c4 indicates test market site (1,2,3,4,5).
shoes.lis The minitab analysis replicates NWK
'sales' is the outcome measure, 'ad' is type of
advertising campaign; 'time' is the repeated
measures factor; and 'subj' is test market site.

***** PART II ANALYSIS OF ASSOCIATIONS: REGRESSION and CORRELATION *******

corr.dat 28 bivariate observations,
test 1 in c1, test 2 in c2.
corr.out Simple plotting,
correlation, and straight-line regression
analyses of corr.dat.
corrres.lis Illustration of different types of
residual scores using corr.dat data.
See NWK text Chap 9 (esp Sec. 9.2).
predict.lis Illustration of PREDICT subcommand
(cf. MTB ver 7 manual 7-10,11) using corr.dat.

welfare.dat Children's Welfare in California.
Data collected by the Oakland-based
"Children Now" from government resources over the
past four years to comprise a "year-in-the-life"
composite index of children's welfare. Data are
presented on a county-by-county basis.
c1: County ranking on Welfare index
c2: Median family income
c3: Median family income ranking
welfare.lis Illustrates descriptive univariate analyses
(stem-and-leaf etc) and correlation and
regression analyses and plots.

coleman.dat Data from the Coleman report used
to illustrate multiple regression.
File coleman.dat contains data from a random
sample of 20 schools (from the East) from the
1966 Coleman Report.
The outcome measure C7 is the verbal mean test
score for all sixth graders in the school. The
predictor variables are: C2, staff salaries
per pupil, C3, percent white collar fathers for
the sixth graders; C4 is a SES composite measure
(deviation) for the sixth graders, C5 Mean
teacher's verbal test score, C6 6th grade mean
mother's educational level (1 unit=2 school yrs)

bodyfat.dat Data taken from NWK text,
Table 8.1. Measurement data in which 3
relatively inexpensive methods of assessment
are compared with the "gold standard" of
accurate measurement.
Description: "data for a study of the relation of
the amount of body fat to several possible
explanatory, independent variables, based on a
sample of 20 healthy females 25-34 years old.
The possible independent variables are triceps
skinfold thickness, thigh circumference, and
midarm circumference."
c1 has triceps, c2 has thigh, c3 has midarm, c4
has amount of body fat.
bodyfat.out Illustrates multiple regression
procedures in NWK text Sec. xx, and residual
diagnostics.

marks.log Uses marks.dat to illustrate
properties of multiple regression (and partial
correlation) coefficients and diagnostics for
same via adjusted variables approach.
marksnew.log Repeats, revises aspects of the
marks.log analyses to match partial regression
slopes and plots approach in NWK Section 11.1.

nels.dat Contains a subset of observations and variables
from the public release data tape for National
Educational Longitudinal Study of 1988 (NELS:88).
The National Center for Education Statistics
collected data from a representative sample of
8th-graders across the U.S. and followed these
students through grades 10 and 12. At each
grade, students took several achievement tests
and completed surveys that included questions
about their academic, family, and social lives.
The nels.dat data set contains students'
10th-grade scores on the science achievement
test, along with several variables that are
hypothesized to be good predictors of 10th-grade
science achievement.
Student ID is in C1 and 10th-grade science score
is in C2. Four achievement variables from 8th
grade are included: science, reading, math
knowledge, and math reasoning (C3-C6). The
math knowledge and math reasoning scores are
standardized (they have mean zero, variance
one). Indicator variables are included for
advanced "track" (i.e., high school program) and
general track; each student receives a 1 on the
variable if he or she is in that program and a
0 otherwise. Students in the academic track
receive 0's on both variables. These are found
in C7 and C8, respectively. In C9-C12 there are
indicator variables for courses taken - biology
or not in C9, chemistry or not in C10, earth
science or not in C11, and general science or
not in C12. C13 contains an indicator variable
for gender: 1 for males, 0 for females. In
C14-C16 are indicator variables for ethnicity:
Asian or not in C14, African-American or not
in C15, and Latino/Hispanic or not in C16.
Finally, C17 and C18 contain indicator variables
for socio-economic status: Lowest quartile or
not in C17 and highest quartile or not in C18.
grow.dat
Data from the Berkeley Growth Study
(Nancy Bailey). These data are for Child
#8 in the BGS study with age in months in c2
(ranging from 1 to 60) and intellectual
performance in C1.
grow.lis Fitting a score on age regression
for grow.dat, using polynomial regression.

SPRING QTR
dummy.log Single classification anova via
regression with dummy (group membership)
predictor variables. Uses smsg.dat and harr.dat

dum2way.dat The response data in c1 are obtained from the
following 2x3 design. An experiment was
conducted to examine the effects of different
levels of reinforcement and different levels of
isolation on children's ability to recall. A
single analyst was to work with a random sample
of 30 children selected from a relatively
homogeneous group of fourth-grade students. Two
levels of reinforcement (none and verbal) and
three levels of isolation (20, 40, and 60
minutes) were to be used.
Students were randomly assigned to the six
treatment groups, with a total of six students
being assigned to each group. Each student was
to spend a 30-minute session with the analyst.
During this time the student was to memorize a
specific passage, with reinforcement provided
as dictated by the group to which the student
was assigned. Following the 30-minute session,
the student was isolated for the time specified
for his or her group and then tested for recall
of the memorized passage.
These data appear in the accompanying table.

Time of Isolation (Minutes)
Level of
Reinforcement 20 40 60

26 19 30 36 6 10
None 23 18 25 28 11 14
28 25 27 24 17 19

15 16 24 26 31 38
Verbal 24 22 29 27 29 34
25 21 23 21 35 30

Clearly, both factors are fixed factors. In this
data file the responses above are in c1 with row
(1,2) in c2 and column (1,2,3) in c3. In c10-c14
are the dummy (0,1) codings for the regression
version of a two-way anova.
dum2way.lis Constructs the dummy variables in
dum2way.dat. Carries out regression and GLM
analyses of the 2x3 fixed effects design.

ancova.log
Illustration of 2-group, pre-post analysis of
covariance with data from smsg.dat. First the
multiple regression approach is shown,
followed by the MINITAB ancova routine
for comparison.

ancvdrug.dat Data taken from Ott's text
to illustrate a 2-group, pre-post design. The
description of these data is: "An investigator is
interested in comparing two drug products (A and
B) in overweight female volunteers. The
experiment calls for 20 randomly selected
subjects who are at least 25% overweight. Ten
of these women are to be randomly assigned to
product 1 and the remaining 10 to product 2.
The response of interest is a score on a rating
scale used to measure the mood of a subject. To
obtain a score, a subject must complete a
checklist indicating how each of 50 adjectives
describes her mood at that time.
On the study day, all 20 volunteers are required
to complete the checklist at 8 AM. Then each
subject is given the prescribed medication
(product 1 or 2). Each subject is required to
complete the checklist again at 10 AM. The 8AM
score is in c1, the 10 AM score
in c2 and the group membership indicator
(1 = product 1; 0 = product 2) in c3.
ancvdrug.lis
Description of 2-group pre-post data in
ancvdrug.dat. Analysis of covariance is carried
out with multiple regression, dummy-variable
approach and then compared with MINITAB ancova
command.

huitema.dat
Three groups, each of size 10,
single outcome, 2 covariates. Taken from the
Huitema text with the description: "The
investigator is concerned with the effects of
three different types of study objectives on
student achievement in freshman biology. The
three types of objectives are:
1.General--students are told to know and
understand everything in the text.
2.Specific--students are provided with a clear
specification of the terms and concepts they are
expected to master and of the testing format.
3.Specific with study time allocations--the
amount of time that should be spent on each
topic is provided in addition to specific
objectives that describe the type
of behavior expected on examinations.
The dependent variable is the biology
achievement test.
A population of freshman students scheduled to
enroll in biology is defined, and 30 students
are randomly selected. The investigator obtains
aptitude test scores and scores from an academic
motivation test for all students before the
investigator randomly assigns 10 students to each
of the three treatments. Treatments are
administered, and scores on the dependent
variable are obtained for all students."
In the data file, the dependent variable is in
c1, aptitude test in c2, academic motivation in
c3, and group membership variable (1,2,3) in
c6. In c4-c5 are two 0,1 dummy variables that
define the group membership in c6.
huitema.lis Description of data in huitema.dat.
Carries out ancova for the 3-group two-covariate
design using MINITAB ancova and multiple
regression approach.

cnrl.dat
2-group data (10 cases per group)
with single outcome and single covariate taken
from Rogosa (1980). Outcome in c1, covariate
in c2, group membership (1,0) in c3.
cnrl.lis Description of cnrl.dat.
Carries out computations needed for Comparing
Nonparallel Regression Lines procedures.

nwkt12p1.dat Data from NWK text,
now Chapter 8, formerly Table 12.1.
"A hospital surgical unit was interested in
predicting survival in patients undergoing a
particular type of liver operation. A random
selection of 54 patients was available for
analysis. From each patient record, the
following information was extracted from the
preoperation evaluation: blood clotting score,
prognostic index, enzyme function test, liver
function test. The dependent variable is
survival time."
Blood clotting score is in c1, prognostic index
in c2, enzyme function test in c3, liver
function test in c4, survival time in c5 and
log10survival in c6.
stepw.lis Uses nwkt12p1.dat to illustrate
stepwise regression variable selection procedures
(Forward stepwise, Backward Elimination.)
Reproduces results in NWK .
breg.lis Uses nwkt12p1.dat to illustrate
"best subsets" variable selection procedure
(using breg in MINITAB). Reproduces results in
NWK.

pcamarks.dat
Data from 18 students in the
prior (many years ago) 2-quarter version of part
of this course (i.e. Education 250A,B). c1-c6
are the scores on the six graded homework
assignments; c7 has the final exam for 250A,
c8 has the midterm in 250B, and c9 has the
outcome score the final exam in 250B.
pca257.lis Uses composite construction
and principal components (using MINITAB pca) to
examine data reduction procedures for the
predictors in pcamarks.lis.

Path Analysis First path analysis example from
lecture: 4 variables, SES IQ nAch GPA
Path Analysis Second path analysis example from
lecture: three longitudinal observations.

************** PART III ANALYSIS OF CATEGORICAL DATA ****************

Agresti Supplement Tables from the Appendix
of "An Introduction to Categorical Data Analysis,"
by Alan Agresti, published by John Wiley and Sons, Inc.,
January 1996. The tables show SAS code for the analyses
conducted in that text, and contain the major data sets
from that text.

Exact Confidence Interval for Proportion SAS implementation in PROC FREQ for
Exact Confidence Interval for Proportion, see Agresti Ch.1
(also Mathematica handout).

Generalized Linear Models: Logistic and Poisson Regression

coupon.dat Dichotomous outcome, single
quantitative predictor (*with replication*).
From NWK supplement (or the NWK regression book),
the description is:
"In a study of the effectiveness of coupons
offering a price reduction on a given product,
1,000 homes were selected and a coupon and
advertising material for the product were mailed
to each. The coupons offered different price
reductions (5,10,15,20, and 30 cents), and 200
homes were assigned at random to each of the
price reduction categories. The independent
variable in this study is the amount of price
reduction, and the dependent variable is a binary
variable indicating whether or not the coupon
was redeemed within a six-month period."
The price reduction is in c1, number of
households (200) in c2, and number redeemed from
the 200 households in c3.
coupon.lis Logit transformation and
OLS and WLS fits to coupon.dat.

program.dat Dichotomous outcome, single
quantitative predictor. From NWK supplement
(or the NWK regression book), the description is:
"A small-scale investigation was undertaken to
study the effect of computer programming
experience on ability to complete a complex
programming task, including debugging, within
a specified time.
Twenty-five persons were selected for the study.
They had varying amounts of programming
experience (measured in months of experience).
All persons were given the same programming task.
The results are coded in binary fashion; if the
task was completed successfully in the allotted
time, it was scored 1, and if the task was not
completed successfully, it was scored 0."
Months of experience are in c1, and the binary
outcome measure is in c2.
program.lis Plots and description
of program.dat. OLS and WLS fits of straight-line
functional form.
BMDPLR logistic regression fit (presented in
class) compared with straight-line fit.
NEW! Minitab blog binary logistic regression.

progsas.sas contains the SAS instructions to carry out
a logistic regression for the data in program.dat.
progsas.lst SAS output obtained from the command
line statement: "sas progsas" on an elaine.
Contains the logistic regression parameter
estimates and fits.

disease.dat Data set, 98 cases, shown
in NWK Table 14.3 and App.C.3. In disease.dat
C1 is Age, C2 and C3 the SES indicators (see p.582)
C4 City Sector, C5 disease status.
diseaseselect.lis Comparison
for variable selection of various logistic
regression models for disease data following
NWK sec 14.5.

Poisson Regression Construction of Artificial Data
using Minitab and SAS analysis using PROC GENMOD

Contingency Tables and Log-linear Models

draft.cnt Draft lottery data from 1971. Rows are
months Jan-Dec and columns are #days with
highest risk C1 (numbers 1-122), numbers
123-244 in C2 and lowest risk
(numbers 245-366) in C3.
draft.lis Chi-square test for independence
(fairness) for draft lottery data.

Aspirin and MI Data and SAS analysis for Aspirin Use
and Myocardial Infarction, Agresti Section 2.2.2

Lung Cancer Data and SAS analysis for Smoking and
Lung Cancer example

Tea Tasting Data and SAS analysis Fishers Tea Tasting
example; Fisher's Exact test, Agresti Section 2.6.1

Bayes Rule and Conditional Probability: At-risk Students example

Matched Pairs, McNemar's test Data and SAS analysis using PROC FREQ
of matched pairs data. Example is approval rating
(approve/disapprove) data from 1600 individuals at
two times. see Agresti Ch 9

CMH analysis Ex Data and SAS analysis SAS file
for CMH analysis (Cochran-Mantel-Haenszel Statistics) for
meta analysis of Chinese smoking data in Agresti Table 3.3

CMH analysis for Migraine Ex Data and SAS analysis
Cochran-Mantel-Haenszel Statistics for Migraine Ex.
2x2 factorial design--Gender by Treatment (Active, Placebo)
with binary outcome Improve (Better, Same).

Belief in Afterlife Ex Data and SAS GENMOD analysis loglinear model
for Agresti Ch. 2 2x2 Example.

Death Penalty Ex Cross-classification Tables for Death Penalty Data.
Illustration of Simpson's Paradox. 2x2x2 Table: Death Penalty
dp (yes/no); Defendant Race defr, Victim Race victr,
(# white=1, black=2)

PROC GENMOD code for Migraine Ex SAS run file for all partial and saturated
log-linear models for Migraine Ex.
2x2 factorial design--Gender by Treatment (Active, Placebo)
with binary outcome Improve (Better, Same).
PROC GENMOD output for Migraine Ex Resulting SAS output for all partial and
saturated log-linear models for Migraine Ex.
Selected models for Migraine Ex SAS code and output for selected log-linear models
for Migraine Ex. Subset of examples above.

--------------
Alcohol, Cigarette, and Marijuana Use Example
Agresti Table 6.3 A survey conducted in 1992 by the
Wright State University School of Medicine and the
United Health Services in Dayton, Ohio. Among other
things, the survey asked students in their final year
of high school in a nonurban area near Dayton, Ohio
whether they had ever used alcohol, cigarettes, or marijuana.
Denote the variables in this 2 X 2 X 2 table by A for alcohol use,
C for cigarette use, and M for marijuana use.
Table 6.3 Alcohol (A), Cigarette (C), and Marijuana (M) Use
for High School Seniors
Marijuana Use
Alcohol Cigarette
Use Use Yes No
---------------------------------------
Yes Yes 911 538
No 44 456
No Yes 3 43
No 2 279
--------------------------------------------

PROC GENMOD code for A C M Ex SAS run file for all partial and saturated
log-linear models for A C M Ex.
PROC GENMOD output for A C M Ex Resulting SAS output for all partial and
saturated log-linear models for A C M Ex.
Drugs AC AM CM SAS GENMOD analysis for best loglinear model
(AM, AC, CM) shown in Agresti Table 6.7
Drugs AM CM SAS analysis for (poor-fitting) loglinear model
(AM, CM) shown in Agresti Table 6.7

Trend in 2xC tables Agresti section 2.5.2 Alcohol and Infant Malformation Example
Table 2.7 refers to a prospective study of maternal drinking and
congenital malformations. After the first three months of pregnancy,
the women in the sample completed a questionnaire about alcohol
consumption. Following childbirth, observations were recorded
on presence or absence of congenital sex organ malformations.
Table 2.7 Infant Malformation and Mothers Alcohol Consumption

Alcohol Malformation Percentage
Consumption Absent Present Total Present
0 17,066 48 17,114 0.28
less1 14,464 38 14,502 0.26
1�2 788 5 793 0.63
3�5 126 1 127 0.79
6 37 1 38 2.63
Source: B. I. Graubard and E. L. Kom, Biometrics 43:471�476 (1987).

SAS output illustrates Cochran-Armitage test for trend.

Linear Association Models for Ordinal Data Agresti, Chapter 7. Data from
the 1991 General Social Survey, illustrates the
inadequacy of ordinary loglinear models for analyzing
ordinal data. Subjects were asked their opinion about a man
and woman having sex relations before marriage, with possible
responses �always wrong,� �almost always wrong,� �wrong only sometimes,�
and �not wrong at all.� They were also asked if they �strongly disagree,�
�disagree,� �agree, or �strongly agree� that methods of birth control
should be made available to teenagers between the ages of 14 and 16.
Both classifications have ordered categories.
SAS analysis using PROC GENMOD compare independence model and
linear association model