(Note: Additional examples not in electronic form will be introduced throughout the course during lectures. Also, additional data sets and output in electronic form for homework assignments, solutions, etcetera will be described in those documents and posted to the course directory at the appropriate point in the course.)
NAME DESCRIPTION
mlapair.dat Paired pre-test post-test data example.
Story from textbook (MM p517).EXAMPLE 8.3
"The National Endowment for the Humanities
sponsors summer institutes to improve the skills
of high school teachers of foreign languages. One such
institute hosted 20 French teachers for 4 weeks. At
the beginning of the period, the teachers were given
the Modern Language Association s (MLA) listening test
of under-standing of spoken French. After 4 weeks of
immersion in French in and out of class, the listening
test was given again. (The actual spoken French in
the two tests was different, so that taking the first
test should not improve the score on the second test.)
The maximum possible score on the test is 36.
mlapair.lis Analysis of Paired pre-test post-test data
example using Minitab.
mlasign.lis Nonparametric analysis of Paired pre-test
post-test data via sign test procedures using Minitab.
smsg.dat Used in Part I and analysis of covariance).
Data from a mathematics curriculum evaluation,
circa 1961. Purpose of the large scale study was
to compare mathematics achievement in a
traditional ninth-grade algebra course with
that in an alternative course developed by the
School Mathematics Study Group (SMSG). 43
teachers from schools across the US
participated; by random assignment there were
21 SMSG (new math) classrooms with 22 traditional
math classrooms.
Columns c1 and c3 contain group indicator
variables; c3 = 1 is SMSG classroom and c3 = 0
is traditional.
The post-instruction outcome measure (classroom
average) on math achievement given at the end of
the school year is in c2; this test was a
traditional algebra test published by the
Cooperative Test division of Educational Testing
Service.
In c4 is a pre-instruction ("pre-test") measure
of knowledge of number systems.
smsg.lis Used in Part I review. Descriptive and
inferential two-group comparisons for the outcome
measure (c2) in smsg.dat.
drptwot.dat Two group comparison example.
Story from textbook (MM p542).EXAMPLE 8.8
"An educator believes that new directed reading activities
in the classroom will help elementary school pupils
improve some aspects of their reading ability. She
arranges for a third-grade class of 21 students to follow
these activities for an 8-week period. A control classroom
of 23 third graders follows the same curriculum without
the activities. At the end of the 8 weeks are given a
Degree of Reading Power (DRP) test, which measures the
aspects of reading ability that the treatment is designed to
improve. data are in unstacked form with treatment in C1 and
control in C2.
drptwot.lis Two sample Analysis of Paired pre-test post-test data
example using Minitab.
alphatot.tab Tabulation of total error rate
probabilities for c inferences each done at level
alph: tot = 1 - (1 - alph)^c is solved for
alph. Mathematica script appended.
harr.dat Data obtained from the
Hopkins&Glass textbook. Their description is
"Harrington (1968) experimented with the order
of 'mental organizers' that structure the
material for the learner. A group of 30 persons
were randomly split into three groups of 10 each.
Group I received organizing material before
studying instructional material on mathematics;
Group II received the 'organizer' after studying
the mathematics; Group III received the math
materials but no organizing materials. Scores
are from a 10-item mathematics test on the
instructional content.
The data are in "unstacked" form in c1-c3.
harr.lis One-way anova (MINITAB) on harr.dat.
harr1v.out BMDP1V output for harr.dat using
orthogonal contrasts.
hartukey.lis Minitab implementation of Tukey post-hoc
comparison procedures with the harr.dat data.
ibs.dat Used in Part I.A.1.
These are waiting-time data under
three different protocols. Data are in stacked
form. The actual data are from Ott's text and
are described as follows:
"Irritable bowel syndrome (IBS) is a non-
specific intestinal disorder characterized by
abdominal pain and irregular bowel habits. Each
person in a random sample of 24 patients having
periodic attacks of IBS was randomly assigned to
one of three treatment groups. The number of
hours of relief while on therapy is recorded
for each patient." Outcome in c1, group
indicator in c2.
ibsbmd7d.log Part I.A.1. BMDP7D output for ibs.dat. Implements
Levene's test. Implements two versions of
one-way anova (Welch, Brown-Forsythe) that do
not assume equal within-group variances.
ibslev.lis Part I.A.1. Gives description of ibs.dat;
implements in Minitab two forms of Levene's
test for equal within-group variances.
ibstrans.log Part I.A.1. Carries out (in MINITAB) natural log
transformation of ibs.dat outcome to stabilize
variance. Compares anova on raw and transformed
data.
clergy.lis Part I.A.4. Illustration of Kruskal-Wallis test
(in MINITAB), non-parametric alternative to
one-way anova. Comparison with standard anova
on ranked data.
Data taken from Ott text: "Three random samples
of clergyman were drawn: one containing 10
Methodist ministers, the second containing 10
Catholic priests, the third containing 10
Pentecostal ministers. Each of the clergyman
was examined with a test to measure
his knowledge about causes of mental illness.
bakery.dat 3 x 2 fixed effects with 2 replications per cell.
The Castle Bakery Company
supplies wrapped Italian bread to a large number
of supermarkets in a metropolitan area. An
experimental study was made of the effects of
height of the shelf display (factor A: bottom,
middle, top in c2) and the width of the shelf
display (factor B: regular, wide in c3) on sales
of this bakery’s bread during the experimental
period (c1, measured in cases). Twelve supermarkets,
similar in terms of sales volume and clientele, were
utilized in the study. The six treatments were
assigned at random to two stores each according to
a completely randomized design, and the display of
the bread in each store followed the treatment
specifications for that store. Sales of the bread
were recorded, and these results are presented in
bakery.dat.
bakery.lis Table of cell means and two-way fixed effects anova
for bakery.dat.
integ.dat 2 x 2 fixed effects with 50
replications per cell. Data obtained from early
Minitab Handbook which gives the following
description: "A researcher at Columbia
University was interested in the effect of
school integration on racial attitudes. He
gave an "ethnocentrism" test to four groups of
children: black children in a segregated school,
white children in a segregated school, black
children in an integrated school, and white
children in an integrated school. 'Ethnocentrism'
is defined as the tendency of children to prefer
to associate with, and respect, other children
of the same ethnic group to those of another
ethnic group. Thus, students who score high on
this test have a stronger preference for their
own race." The data are in stacked form,
with the test score in c1, schooltype in c2
(1 = integrated, 2 = segregated) and race in c3
(1 = black, 2 = caucasian).
integ.lis Cell means and anova table
(from MINITAB) for integ.dat.
scitest.dat Data collected as part of study designed
to investigate the feasibility and technical
quality of science performance assessments. Two
tasks, called Radiation and Rate of Cooling,
were developed from a common "task shell";
in other words, they were designed to be as
parallel as possible in the science processes
tested and in the format of stimulus materials
and required response. They can be thought of
as two sample tasks from a "universe" of similar,
parallel tasks. The investigators treat task as
a random factor because they could imagine
creating additional tasks out of the task shell
from which these two came. This data set
contains the scores of thirty students, assumed
to be drawn at random from the population of
students,each tested on both tasks. Three
raters scored the responses; each paper was
scored by two of the three raters. The students
come from three different schools, ten from each.
Scores are in C1, student ID is in C2, task (1
for Radiation, 2 for Rate of Cooling) is in C3,
rater is in C4, and school is in C5.
scitest.lis Minitab output from a 2-way random effects anova
with outcome the score on the science test, with
the two random factors being student and task.
So the design is 30x2 with 2 replications per
cell.
sunburn.dat Two-way mixed example; taken from Sunscreen ex.
Ott p.770
A corporation is interested in comparing two
different sunscreens (s1 and s2). A random
sample of 10 females (ages 20-25 years)
participated in the study. For each person two
1" x 1" squares were marked off on either side
of the back, under the shoulder but above the
small of the back. Sunscreen s1 was randomly
assigned to the two squares on one side of the
back, with s2 on the other two squares. Exposure
to the sun was for a two-hour period.
The outcome was change (postexposure minus
preexposure) in a reading based on the color of
skin in a square. So we have 10 levels of the
random column factor subjects, two levels of the
fixed row factor, sunscreen, and two replications
per cell. In file sunburn.dat we have the
outcome measure in c1, the type of sunscreen
(s1 =1, s2=2) in c2, the person (i.e. female
tanning subject) in c3.
sunburn.lis Minitab output for the
mixed model analysis of the sunburn.dat data,
a 2X10 design with 2 replications per cell.
unbalanc.dat
Data for a 2 x 3 fixed effects
design, having between 1 and 3 replications per
cell. The data are shown and described in Table
20.1 and section 20.2 of NWK text. The first
part of this data file has the outcome measure
(growth rate in response to therapy) in c4, the
row factor (subject gender 1,2) in c1, the column
factor (degree of depressed development;
severe = 1, moderate = 2, mild = 3) in c2, and
the replication indicator in c3.
This data structure is set up for the GLM
approach to the analysis of unbalanced designs.
The second part of the data file is set up for
the application of the approximate analysis based
on cell means; cell means in c1, row factor in
c2, column factor in c3.
unbalanc.log
Analyses of the data in unbalanc.dat.
First is shown the GLM analysis (cf. MTB version
7 manual p. 8-27). Second the approximate cell
means analysis is constructed and then compared
with GLM results.
stress.dat Data are from a 2x2x2 fixed
effects design with 3 replications per cell.
Data are shown in Table 22.2 and described in
Section 22.2 of NWK text. The outcome
measure is exercise tolerance from a stress
test in c1, with gender (male = 1, female = 2)
in c2, body fat level (low = 1, high = 2) in c3
and smoking history (light = 1, heavy = 2) in c4.
stress.lis Analysis of the 3-way design from
stress.dat. Description using versions of
MINITAB Table command along with Layout
subcommand (cf. MTB version 7 manual pages
11-9,11-12). Three-way analysis of variance
using anova command.
*************************** PART II ********************************************
CORRELATION and REGRESSION
corr.dat 28 bivariate observations,
test 1 in c1, test 2 in c2.
corr.out Simple plotting,
correlation, and straight-line regression
analyses of corr.dat.
corrres.lis Illustration of different types of
residual scores using corr.dat data.
See NWK text Chap 9 (esp Sec. 9.2).
predict.lis Illustration of PREDICT subcommand
(cf. MTB ver 7 manual 7-10,11) using corr.dat.
welfare.dat Children's Welfare in California.
Data collected by the Oakland-based
"Children Now" from government resources over the
past four years to comprise a "year-in-the-life"
composite index of children's welfare. Data are
presented on a county-by-county basis.
c1: County ranking on Welfare index
c2: Median family income
c3: Median family income ranking
welfare.lis Illustrates descriptive univariate analyses
(stem-and-leaf etc) and correlation and
regression analyses and plots.
coleman.dat Data from the Coleman report used
to illustrate multiple regression.
File coleman.dat contains data from a random
sample of 20 schools (from the East) from the
1966 Coleman Report.
The outcome measure C7 is the verbal mean test
score for all sixth graders in the school. The
predictor variables are: C2, staff salaries
per pupil, C3, percent white collar fathers for
the sixth graders; C4 is a SES composite measure
(deviation) for the sixth graders, C5 Mean
teacher's verbal test score, C6 6th grade mean
mother's educational level (1 unit=2 school yrs)
bodyfat.dat Data taken from NWK text,
Table 8.1. Measurement data in which 3
relatively inexpensive methods of assessment
are compared with the "gold standard" of
accurate measurement.
Description: "data for a study of the relation of
the amount of body fat to several possible
explanatory, independent variables, based on a
sample of 20 healthy females 25-34 years old.
The possible independent variables are triceps
skinfold thickness, thigh circumference, and
midarm circumference."
c1 has triceps, c2 has thigh, c3 has midarm, c4
has amount of body fat.
bodyfat.out Illustrates multiple regression
procedures in NWK text Sec. xx, and residual
diagnostics.
marks.dat Used in Part II. Data from 17 students in
a prior (many years ago) 2-qtr version of part
of this course (i.e. Education 250A,B). c2 has
the sum of the scores on the six graded homework
assignments; c1 has the final exam for 250A, c3
has the midterm in 250B, and c4 has the outcome
score, the final exam in 250B.
marks.log Uses marks.dat to illustrate
properties of multiple regression (and partial
correlation) coefficients and diagnostics for
same via adjusted variables approach.
marksnew.log Repeats, revises aspects of the
marks.log analyses to match partial regression
slopes and plots approach in NWK Section 11.1.
nels.dat Contains a subset of observations and variables
from the public release data tape for National
Educational Longitudinal Study of 1988 (NELS:88).
The National Center for Education Statistics
collected data from a representative sample of
8th-graders across the U.S. and followed these
students through grades 10 and 12. At each
grade, students took several achievement tests
and completed surveys that included questions
about their academic, family, and social lives.
The nels.dat data set contains students'
10th-grade scores on the science achievement
test, along with several variables that are
hypothesized to be good predictors of 10th-grade
science achievement.
Student ID is in C1 and 10th-grade science score
is in C2. Four achievement variables from 8th
grade are included: science, reading, math
knowledge, and math reasoning (C3-C6). The
math knowledge and math reasoning scores are
standardized (they have mean zero, variance
one). Indicator variables are included for
advanced "track" (i.e., high school program) and
general track; each student receives a 1 on the
variable if he or she is in that program and a
0 otherwise. Students in the academic track
receive 0's on both variables. These are found
in C7 and C8, respectively. In C9-C12 there are
indicator variables for courses taken - biology
or not in C9, chemistry or not in C10, earth
science or not in C11, and general science or
not in C12. C13 contains an indicator variable
for gender: 1 for males, 0 for females. In
C14-C16 are indicator variables for ethnicity:
Asian or not in C14, African-American or not
in C15, and Latino/Hispanic or not in C16.
Finally, C17 and C18 contain indicator variables
for socio-economic status: Lowest quartile or
not in C17 and highest quartile or not in C18.
grow.dat
Data from the Berkeley Growth Study
(Nancy Bailey). These data are for Child
#8 in the BGS study with age in months in c2
(ranging from 1 to 60) and intellectual
performance in C1.
grow.lis Fitting a score on age regression
for grow.dat, using polynomial regression.
dummy.log Single classification anova via
regression with dummy (group membership)
predictor variables. Uses smsg.dat and harr.dat
ancova.log
Illustration of 2-group, pre-post analysis of
covariance with data from smsg.dat. First the
multiple regression approach is shown,
followed by the MINITAB ancova routine
for comparison.
ancvdrug.dat Data taken from Ott's text
to illustrate a 2-group, pre-post design. The
description of these data is: "An investigator is
interested in comparing two drug products (A and
B) in overweight female volunteers. The
experiment calls for 20 randomly selected
subjects who are at least 25% overweight. Ten
of these women are to be randomly assigned to
product 1 and the remaining 10 to product 2.
The response of interest is a score on a rating
scale used to measure the mood of a subject. To
obtain a score, a subject must complete a
checklist indicating how each of 50 adjectives
describes her mood at that time.
On the study day, all 20 volunteers are required
to complete the checklist at 8 AM. Then each
subject is given the prescribed medication
(product 1 or 2). Each subject is required to
complete the checklist again at 10 AM. The 8AM
score is in c1, the 10 AM score
in c2 and the group membership indicator
(1 = product 1; 0 = product 2) in c3.
ancvdrug.lis
Description of 2-group pre-post data in
ancvdrug.dat. Analysis of covariance is carried
out with multiple regression, dummy-variable
approach and then compared with MINITAB ancova
command.
huitema.dat
Three groups, each of size 10,
single outcome, 2 covariates. Taken from the
Huitema text with the description: "The
investigator is concerned with the effects of
three different types of study objectives on
student achievement in freshman biology. The
three types of objectives are:
1.General--students are told to know and
understand everything in the text.
2.Specific--students are provided with a clear
specification of the terms and concepts they are
expected to master and of the testing format.
3.Specific with study time allocations--the
amount of time that should be spent on each
topic is provided in addition to specific
objectives that describe the type
of behavior expected on examinations.
The dependent variable is the biology
achievement test.
A population of freshman students scheduled to
enroll in biology is defined, and 30 students
are randomly selected. The investigator obtains
aptitude test scores and scores from an academic
motivation test for all students before the
investigator randomly assigns 10 students to each
of the three treatments. Treatments are
administered, and scores on the dependent
variable are obtained for all students."
In the data file, the dependent variable is in
c1, aptitude test in c2, academic motivation in
c3, and group membership variable (1,2,3) in
c6. In c4-c5 are two 0,1 dummy variables that
define the group membership in c6.
huitema.lis Description of data in huitema.dat.
Carries out ancova for the 3-group two-covariate
design using MINITAB ancova and multiple
regression approach.
*************************** PART III ********************************************
BINARY and CATEGORICAL DATA
Binomial Distribution examples.
binchina.lis You've just entered a class in
ancient Chinese literature. You haven't even
learned the alphabet yet but they've given you
a pop quiz. You'll have to guess on every question.
It's a multiple choice test, with each of the 20
questions having three possible answers. To pass, you
must get at least 12 correct. What are the chances
you'll pass?
binfreet.lis Rick is a basketball player
who makes 75 percent of his free throws over the
course of a season. In a key game Rick shoots 12 free
throws and misses 5 of them. The fans think he failed
because he was nervous. Is it unusual for Rick to
perform this poorly?
binnorm.lis Illustrations of normal
approximations to the binomial.
binsign.lis Sign test example from
GH section 9.11; use of binomial proability.
Poisson Distribution examples.
poisson.lis Illustration of Poisson distribution
and binomial approximations for rare events.
draft.cnt Draft lottery data from 1971. Rows are
months Jan-Dec and columns are #days with
highest risk C1 (numbers 1-122), numbers
123-244 in C2 and lowest risk
(numbers 245-366) in C3.
draft.lis Chi-square test for independence
(fairness) for draft lottery data.
teacher1.dat Part III. Source: U.S. Department of Education,
National Center for Education Statistics,
1987-1988 Schools and Staffing Survey.
Data: Willingness to become a teacher again
for Elementary and Secondary school teachers.
(Data + Output). This example illustrates
cross-classified categorical data, 2x5 table
and chi-square test.
teacher2.dat Part III. 1987-1988 Schools and Staffing Survey
Data:
Gender distribution for teachers in Elementary
and Secondary schools. (Data + Output)
Illustrates 2x2 table and chi-square test.
Agresti Supplement Tables from the Appendix
of "An Introduction to Categorical Data Analysis,"
by Alan Agresti, published by John Wiley and Sons, Inc.,
January 1996. The tables show SAS code for the analyses
conducted in that text, and contain the major data sets
from that text.
Aspirin and MI Data and SAS analysis for Aspirin Use
and Myocardial Infarction, Agresti Section 2.2.2
Lung Cancer Data and SAS analysis for Smoking and
Lung Cancer example
Tea Tasting Data and SAS analysis Fishers Tea Tasting
example; Fisher's Exact test, Agresti Section 2.6.1
program.dat Dichotomous outcome, single
quantitative predictor. From NWK supplement
(or the NWK regression book), the description is:
"A small-scale investigation was undertaken to
study the effect of computer programming
experience on ability to complete a complex
programming task, including debugging, within
a specified time.
Twenty-five persons were selected for the study.
They had varying amounts of programming
experience (measured in months of experience).
All persons were given the same programming task.
The results are coded in binary fashion; if the
task was completed successfully in the allotted
time, it was scored 1, and if the task was not
completed successfully, it was scored 0."
Months of experience are in c1, and the binary
outcome measure is in c2.
program.lis Plots and description
of program.dat. OLS and WLS fits of straight-line
functional form.
BMDPLR logistic regression fit (presented in
class) compared with straight-line fit.
NEW! Minitab blog binary logistic regression.
progsas.sas contains the SAS instructions to carry out
a logistic regression for the data in program.dat.
progsas.lst SAS output obtained from the command
line statement: "sas progsas" on an elaine.
Contains the logistic regression parameter
estimates and fits.
coupon.dat Dichotomous outcome, single
quantitative predictor (*with replication*).
From NWK supplement (or the NWK regression book),
the description is:
"In a study of the effectiveness of coupons
offering a price reduction on a given product,
1,000 homes were selected and a coupon and
advertising material for the product were mailed
to each. The coupons offered different price
reductions (5,10,15,20, and 30 cents), and 200
homes were assigned at random to each of the
price reduction categories. The independent
variable in this study is the amount of price
reduction, and the dependent variable is a binary
variable indicating whether or not the coupon
was redeemed within a six-month period."
The price reduction is in c1, number of
households (200) in c2, and number redeemed from
the 200 households in c3.
coupon.lis Logit transformation and
OLS and WLS fits to coupon.dat.