(Note: Additional examples not in electronic form will be introduced throughout the course during lectures. Also, additional data sets and output in electronic form for homework assignments, solutions, etcetera will be described in those documents and posted to the course directory at the appropriate point in the course.)
NAME DESCRIPTION alphatot.tab Tabulation of total error rate probabilities for c inferences each done at level alph: tot = 1 - (1 - alph)^c is solved for alph. Mathematica script appended. counsel.dat A 3 x 10 mixed model design with 2 replications per cell. Fixed factor in c1 has 3 levels, random factor in c2 has 10 levels. The fixed factor is 3 different methods/strategies of counseling, and the random factor represents 10 counselors sampled from a population of counselors. Six clients from each counselor are divided amongst the 3 counseling strategies. The outcome measure in c3 is a self-report of neurotic symptoms. The mixed model analysis (using MINITAB) is described in lecture. harr.dat Data obtained from the Hopkins&Glass textbook. Their description is "Harrington (1968) experimented with the order of 'mental organizers' that structure the material for the learner. A group of 30 persons were randomly split into three groups of 10 each. Group I received organizing material before studying instructional material on mathematics; Group II received the 'organizer' after studying the mathematics; Group III received the math materials but no organizing materials. Scores are from a 10-item mathematics test on the instructional content. The data are in "unstacked" form in c1-c3. harr.lis One-way anova (MINITAB) on harr.dat. harr1v.out BMDP1V output for harr.dat using orthogonal contrasts. hartukey.lis Minitab implementation of Tukey post-hoc comparison procedures with the harr.dat data. integ.dat 2 x 2 fixed effects with 50 replications per cell. Data obtained from early Minitab Handbook which gives the following description: "A researcher at Columbia University was interested in the effect of school integration on racial attitudes. He gave an "ethnocentrism" test to four groups of children: black children in a segregated school, white children in a segregated school, black children in an integrated school, and white children in an integrated school. 'Ethnocentrism' is defined as the tendency of children to prefer to associate with, and respect, other children of the same ethnic group to those of another ethnic group. Thus, students who score high on this test have a stronger preference for their own race." The data are in stacked form, with the test score in c1, schooltype in c2 (1 = integrated, 2 = segregated) and race in c3 (1 = black, 2 = caucasian). integ.lis Cell means and anova table (from MINITAB) for integ.dat. rand2way.dat The data are from a 3 x 3 design with 2 replications per cell. A classic measurement study design, as in generalizability theory G-studies. The actual data are from Ott's text, with the following (appetizing) description: "Consider an experiment to examine the effects of different analysts and subjects in chemical analyses for the DNA content of plaque. Three female subjects (ages 18-20 years) were chosen for the study. Each subject was allowed to maintain her usual diet, supplemented with 30 mg of sucrose per day. No toothbrushing or mouthwashing was allowed during the study. At the end of the week, plaque was scraped from the entire dentition of each subject was divided into six samples. Each of the analysts chosen at random made a DNA concentration determination on two samples for each subject. Data are in units of 10 micrograms. The DNA concentrations are in c1, analysts in c2 (1,2,3), subjects in c3 (1,2,3). rand2way.lis Analysis of rand2way.dat using MINITAB. Table of cell means, random effects anova including variance components estimation. scitest.dat Data collected as part of study designed to investigate the feasibility and technical quality of science performance assessments. Two tasks, called Radiation and Rate of Cooling, were developed from a common "task shell"; in other words, they were designed to be as parallel as possible in the science processes tested and in the format of stimulus materials and required response. They can be thought of as two sample tasks from a "universe" of similar, parallel tasks. The investigators treat task as a random factor because they could imagine creating additional tasks out of the task shell from which these two came. This data set contains the scores of thirty students, assumed to be drawn at random from the population of students,each tested on both tasks. Three raters scored the responses; each paper was scored by two of the three raters. The students come from three different schools, ten from each. Scores are in C1, student ID is in C2, task (1 for Radiation, 2 for Rate of Cooling) is in C3, rater is in C4, and school is in C5. scitest.lis Minitab output from a 2-way random effects anova with outcome the score on the science test, with the two random factors being student and task. So the design is 30x2 with 2 replications per cell. smsg.dat Used in Part I review and analysis of covariance). Data from a mathematics curriculum evaluation, circa 1961. Purpose of the large scale study was to compare mathematics achievement in a traditional ninth-grade algebra course with that in an alternative course developed by the School Mathematics Study Group (SMSG). 43 teachers from schools across the US participated; by random assignment there were 21 SMSG (new math) classrooms with 22 traditional math classrooms. Columns c1 and c3 contain group indicator variables; c3 = 1 is SMSG classroom and c3 = 0 is traditional. The post-instruction outcome measure (classroom average) on math achievement given at the end of the school year is in c2; this test was a traditional algebra test published by the Cooperative Test division of Educational Testing Service. In c4 is a pre-instruction ("pre-test") measure of knowledge of number systems. smsg.lis Used in Part I review. Descriptive and inferential two-group comparisons for the outcome measure (c2) in smsg.dat. sunburn.dat Two-way mixed example; taken from Sunscreen ex. Ott p.770 A corporation is interested in comparing two different sunscreens (s1 and s2). A random sample of 10 females (ages 20-25 years) participated in the study. For each person two 1" x 1" squares were marked off on either side of the back, under the shoulder but above the small of the back. Sunscreen s1 was randomly assigned to the two squares on one side of the back, with s2 on the other two squares. Exposure to the sun was for a two-hour period. The outcome was change (postexposure minus preexposure) in a reading based on the color of skin in a square. So we have 10 levels of the random column factor subjects, two levels of the fixed row factor, sunscreen, and two replications per cell. In file sunburn.dat we have the outcome measure in c1, the type of sunscreen (s1 =1, s2=2) in c2, the person (i.e. female tanning subject) in c3. sunburn.lis Minitab output for the mixed model analysis of the sunburn.dat data, a 2X10 design with 2 replications per cell. unbalanc.dat Data for a 2 x 3 fixed effects design, having between 1 and 3 replications per cell. The data are shown and described in Table 20.1 and section 20.2 of our NWK text. The first part of this data file has the outcome measure (growth rate in response to therapy) in c4, the row factor (subject gender 1,2) in c1, the column factor (degree of depressed development; severe = 1, moderate = 2, mild = 3) in c2, and the replication indicator in c3. This data structure is set up for the GLM approach to the analysis of unbalanced designs. The second part of the data file is set up for the application of the approximate analysis based on cell means; cell means in c1, row factor in c2, column factor in c3. unbalanc.log Analyses of the data in unbalanc.dat. First is shown the GLM analysis (cf. MTB version 7 manual p. 8-27). Second the approximate cell means analysis is constructed and then compared with GLM results. stress.dat Data are from a 2x2x2 fixed effects design with 3 replications per cell. Data are shown in Table 22.2 and described in Section 22.2 of our NWK text. The outcome measure is exercise tolerance from a stress test in c1, with gender (male = 1, female = 2) in c2, body fat level (low = 1, high = 2) in c3 and smoking history (light = 1, heavy = 2) in c4. stress.lis Analysis of the 3-way design from stress.dat. Description using versions of MINITAB Table command along with Layout subcommand (cf. MTB version 7 manual pages 11-9,11-12). Three-way analysis of variance using anova command. ***Randomized Blocks*** bhhtab71.dat Data from a 5 x 4 randomized block design with 5 levels of the blocking variable and 4 levels of the (fixed) treatment variable. One replication per cell. The data are from the Box, Hunter and Hunter text with the following description: "In this example a process for the manufacture of penicillin was being investigated, and the yield was the response of primary interest. There were 4 variants of the basic process to be studied. It was known that an important raw material, corn steep liquor, was quite variable. Fortunately blends sufficient for four runs could be made, thus supplying the opportunity to run all 4 treatments with each of the 5 blocks (blends of corn steep liquor). The experiment was protected from extraneous unknown sources of bias by running the treatments in random order within each block." The yield is in c1, block indicator in c2, and treatment indicator in c3. bhhtab71.lis Description and analysis of variance on randomized block design data in bhhtab71.dat. dental.dat Randomized block example, factorial treatment structure From NWK prob DENTAL PAIN. The "learning statistics is like pulling teeth" analogy is irresistable. An anesthesiologist made a comparative study of the effects of acupuncture and codiene on postoperative dental pain in male subjects. The four treatments were (1) placebo treatment-- a sugar capsule and two inactive acupuncture points, (2) codiene treatment only--a codeine capsule and two inactive acupuncture points; (3) acupucture only--a sugar capsule and two active acupuncture points (4) both codeine and acupuncture. These 4 conditions have a 2x2 factorial structure. Thirty-two subjects were grouped into 8 blocks of four according to an initial evaluation of their level of pain tolerance. The subjects in each block were then randomly assigned to the 4 treatments. Pain relief scores were obtained 2 hours after dental treatment. Data were collected on a double-blind basis. In file dental.dat c1 is pain relief score (higher means more pain relief), c2 is block c3 is codiene c4 is acupuncture--for c3 and c4, 1=no. dental.lis Minitab analysis for randomixed block design of dental.dat. ***Nested Designs*** training.dat NESTED DESIGN, training school example, from NWK Chap 28. Description p.970: A large manufacturing company operates 3 regional training schools for mechanics, one in each of its operating districts. The schools have two instructors each who teach classes of about 15 mechanics in 3-week sessions. The company was concerned about the effect of School (factor A) and instructor (factor B) on the learning achieved. To investigate these effects, classes in each district were formed in the usual way and then randomly assigned to one of the two instructors in the school [making class the "unit of analysis"]. This design was implemented for two 3-week sessions, and at the end of each session a suitable measure of learning for the class was obtained. Data are given in training.dat: C1 has class learning score, C2 is School (1,2,3), C3 is instructor (1,2), and C4 is class (first or second 3-week period). training.lis Data analysis using Minitab for training.dat, including the nested design anova. NWK 28.9 Cross-nested design ("three factor partially nested design") Data for decision making example Minitab analysis for decision making example, NWK Fig 28.7 schoolcn.dat Crossed and Nested factors-- teaching methods, schools, teachers students, Can you work it out? This example is taken from a well-known educational statistics textbook: Hopkins&Glass. On the theme that you should rejoice that we use NWK instead, I found six (and counting) major errors in this text's exposition and solution for this single example. The example involves the comparison of 5 teaching methods. Two Schools (considered to be sampled at random) each employ these five teaching methods--i.e. each of the 5 teaching methods appears with each of the two schools--5x2 combinations. Within *each* of the two schools, 3 teachers are chosen at random, so we have three teachers chosen in School 1 and three different teachers chosen in School 2. Each teacher employs each of the 5 teaching methods, and the outcome data are mastery scores (mastery or not)for three students for each teacher-method combination within each school. NWK calls this a partially nested or crossed-nested design: section 28.9 pp. 1149-1154 (minitab p1153) In file schoolcn.dat the columns are outcome; method; school; teacher(within school); student replication. Note that the outcome measure is 0/1 ; this text goes on to assert "Balanced anova designs have been shown to yield accurate results even with dichotomous dependent variables [refs]..." For the present we will take them at their word. In file schoolcn.sol we answer the following questions a. Table the means for method crossed with school, and construct the corresponding profile plot. Do there appear to be main effects or interactions? b. Obtain means for each teacher(within school); do there appear to be teacher effects? c. Construct an appropriate anova model and obtain the corresponding anova table for this design. d. Carry out the series of statistical tests for the terms (effects) identified in your model in part c; state your results, being careful to control the overall Type I error rate. schoolcn.sol Analyses for cross-nested schoolcn example ***Repeated Measures*** drugrep.dat Example from Winer Sec 4.3, Table 4.3-1 A study of the effects 4 drugs upon reaction time to a series of standardized tasks was undertaken with 5 subjects all of whom had been well-trained in these tasks. The 5 subjects are a random sample from a population of interest to the experimenter. Each subject was observed under each of the drugs; the order that the drugs were administrered was randomized. Time separation between doses was employed. The outcomes (C1 in drugrep.dat) were mean reaction time on the series of standardized tasks; in drugrep.dat C2 (1,2,3,4,5) is the person and C3 (1,2,3,4) is the drug. The drug data comprise a oneway repeated measures classification with 4 levels representing the reaction times associated with 4 types of drug. drugrep.lis Minitab analysis for repeated measures design for drugrep.dat. bloodflow.dat bloodflow example NWK sec 29.3 Section 29.3 Two-Factor Experiments with Repeated Measures on Both Factors 1181 TABLE 29.7 Data for Blood Flow Example. Subject Treatment A1B1 A1B2 A2B1 A2B2 1 2 10 9 25 2 —1 8 6 21 3 0 11 8 24 10 —2 10 10 28 11 2 8 10 25 12 —1 8 6 23 A clinician studied the effects of two drugs used either alone or together on the blood flow in human subjects. Twelve healthy middle-aged males participated in the study and they are viewed as a random sample from a relevant population of middle-aged males. The four treatments used in the study are defined as follows: A1B1 placebo (neither drug) A1B2 drugB alone A2B1 drugAalone A2B2 bothdrugsAandB The 12 subjects received each of the four treatments in independently randomized orders. The response variable is the increase in blood flow from before to shortly after the administration of the treatment. The treatments were administered on successive days. This prevented any carryover effects because the effect of each drug is short-lived. The experiment was conducted in a double-blind fashion so that neither the physician nor the subject knew which treatment was administered when the change in blood flow was measured. Table 29.7 and bloodflow.dat contains the data for this study. A negative entry denotes a decrease in blood flow. Figure 29.5 and bloodflow.lis contains the MINITAB output for the fit of repeated measures model (29.10). Included in the output are the expected mean squares for the specified ANOVA model. As explained in Chapter 28, each term in an expected mean square is represented in the MINITAB output by (1) the numeric code, in parentheses, for the variance of the model term, and (2) the preceding number which is the numerical multiple. When the model term is fixed, the letter Q is used in the printout bloodflow.lis Minitab analyses of bloodflow Brogan Kutner Example Pre-post Repeated Measures Brogan Kutner Analyses Minitab and SAS repeated measures analyses shoes.dat "It's gotta be the shoes" Athletic Shoe sales example from NWK Chap 29.4 Between subjects factor, repeated measures on one-factor: A national retail chain wanted to study the effects of two advertising campaigns (factor A) on the sales of athletic shoes over time (factor B). Ten similar test markets (subjects S) were randomly chosen to participate in the study (each campaign used in 5 of these markets). Sales data (c1 in shoes.dat) were collected for 3 two-week periods (two weeks prior to campaign, two-weeks during, two weeks after; coded 1,2,3 in c3 in shoes.dat). In shoes.dat c2 indicates the ad campaign (1,2) and c4 indicates test market site (1,2,3,4,5). shoes.lis The minitab analysis replicates NWK 'sales' is the outcome measure, 'ad' is type of advertising campaign; 'time' is the repeated measures factor; and 'subj' is test market site. ***** PART II ANALYSIS OF ASSOCIATIONS: REGRESSION and CORRELATION ******* corr.dat 28 bivariate observations, test 1 in c1, test 2 in c2. corr.out Simple plotting, correlation, and straight-line regression analyses of corr.dat. corrres.lis Illustration of different types of residual scores using corr.dat data. See NWK text Chap 9 (esp Sec. 9.2). predict.lis Illustration of PREDICT subcommand (cf. MTB ver 7 manual 7-10,11) using corr.dat. welfare.dat Children's Welfare in California. Data collected by the Oakland-based "Children Now" from government resources over the past four years to comprise a "year-in-the-life" composite index of children's welfare. Data are presented on a county-by-county basis. c1: County ranking on Welfare index c2: Median family income c3: Median family income ranking welfare.lis Illustrates descriptive univariate analyses (stem-and-leaf etc) and correlation and regression analyses and plots. coleman.dat Data from the Coleman report used to illustrate multiple regression. File coleman.dat contains data from a random sample of 20 schools (from the East) from the 1966 Coleman Report. The outcome measure C7 is the verbal mean test score for all sixth graders in the school. The predictor variables are: C2, staff salaries per pupil, C3, percent white collar fathers for the sixth graders; C4 is a SES composite measure (deviation) for the sixth graders, C5 Mean teacher's verbal test score, C6 6th grade mean mother's educational level (1 unit=2 school yrs) bodyfat.dat Data taken from NWK text, Table 8.1. Measurement data in which 3 relatively inexpensive methods of assessment are compared with the "gold standard" of accurate measurement. Description: "data for a study of the relation of the amount of body fat to several possible explanatory, independent variables, based on a sample of 20 healthy females 25-34 years old. The possible independent variables are triceps skinfold thickness, thigh circumference, and midarm circumference." c1 has triceps, c2 has thigh, c3 has midarm, c4 has amount of body fat. bodyfat.out Illustrates multiple regression procedures in NWK text Sec. xx, and residual diagnostics. marks.log Uses marks.dat to illustrate properties of multiple regression (and partial correlation) coefficients and diagnostics for same via adjusted variables approach. marksnew.log Repeats, revises aspects of the marks.log analyses to match partial regression slopes and plots approach in NWK Section 11.1. nels.dat Contains a subset of observations and variables from the public release data tape for National Educational Longitudinal Study of 1988 (NELS:88). The National Center for Education Statistics collected data from a representative sample of 8th-graders across the U.S. and followed these students through grades 10 and 12. At each grade, students took several achievement tests and completed surveys that included questions about their academic, family, and social lives. The nels.dat data set contains students' 10th-grade scores on the science achievement test, along with several variables that are hypothesized to be good predictors of 10th-grade science achievement. Student ID is in C1 and 10th-grade science score is in C2. Four achievement variables from 8th grade are included: science, reading, math knowledge, and math reasoning (C3-C6). The math knowledge and math reasoning scores are standardized (they have mean zero, variance one). Indicator variables are included for advanced "track" (i.e., high school program) and general track; each student receives a 1 on the variable if he or she is in that program and a 0 otherwise. Students in the academic track receive 0's on both variables. These are found in C7 and C8, respectively. In C9-C12 there are indicator variables for courses taken - biology or not in C9, chemistry or not in C10, earth science or not in C11, and general science or not in C12. C13 contains an indicator variable for gender: 1 for males, 0 for females. In C14-C16 are indicator variables for ethnicity: Asian or not in C14, African-American or not in C15, and Latino/Hispanic or not in C16. Finally, C17 and C18 contain indicator variables for socio-economic status: Lowest quartile or not in C17 and highest quartile or not in C18. grow.dat Data from the Berkeley Growth Study (Nancy Bailey). These data are for Child #8 in the BGS study with age in months in c2 (ranging from 1 to 60) and intellectual performance in C1. grow.lis Fitting a score on age regression for grow.dat, using polynomial regression. SPRING QTR dummy.log Single classification anova via regression with dummy (group membership) predictor variables. Uses smsg.dat and harr.dat dum2way.dat The response data in c1 are obtained from the following 2x3 design. An experiment was conducted to examine the effects of different levels of reinforcement and different levels of isolation on children's ability to recall. A single analyst was to work with a random sample of 30 children selected from a relatively homogeneous group of fourth-grade students. Two levels of reinforcement (none and verbal) and three levels of isolation (20, 40, and 60 minutes) were to be used. Students were randomly assigned to the six treatment groups, with a total of six students being assigned to each group. Each student was to spend a 30-minute session with the analyst. During this time the student was to memorize a specific passage, with reinforcement provided as dictated by the group to which the student was assigned. Following the 30-minute session, the student was isolated for the time specified for his or her group and then tested for recall of the memorized passage. These data appear in the accompanying table. Time of Isolation (Minutes) Level of Reinforcement 20 40 60 26 19 30 36 6 10 None 23 18 25 28 11 14 28 25 27 24 17 19 15 16 24 26 31 38 Verbal 24 22 29 27 29 34 25 21 23 21 35 30 Clearly, both factors are fixed factors. In this data file the responses above are in c1 with row (1,2) in c2 and column (1,2,3) in c3. In c10-c14 are the dummy (0,1) codings for the regression version of a two-way anova. dum2way.lis Constructs the dummy variables in dum2way.dat. Carries out regression and GLM analyses of the 2x3 fixed effects design. ancova.log Illustration of 2-group, pre-post analysis of covariance with data from smsg.dat. First the multiple regression approach is shown, followed by the MINITAB ancova routine for comparison. ancvdrug.dat Data taken from Ott's text to illustrate a 2-group, pre-post design. The description of these data is: "An investigator is interested in comparing two drug products (A and B) in overweight female volunteers. The experiment calls for 20 randomly selected subjects who are at least 25% overweight. Ten of these women are to be randomly assigned to product 1 and the remaining 10 to product 2. The response of interest is a score on a rating scale used to measure the mood of a subject. To obtain a score, a subject must complete a checklist indicating how each of 50 adjectives describes her mood at that time. On the study day, all 20 volunteers are required to complete the checklist at 8 AM. Then each subject is given the prescribed medication (product 1 or 2). Each subject is required to complete the checklist again at 10 AM. The 8AM score is in c1, the 10 AM score in c2 and the group membership indicator (1 = product 1; 0 = product 2) in c3. ancvdrug.lis Description of 2-group pre-post data in ancvdrug.dat. Analysis of covariance is carried out with multiple regression, dummy-variable approach and then compared with MINITAB ancova command. huitema.dat Three groups, each of size 10, single outcome, 2 covariates. Taken from the Huitema text with the description: "The investigator is concerned with the effects of three different types of study objectives on student achievement in freshman biology. The three types of objectives are: 1.General--students are told to know and understand everything in the text. 2.Specific--students are provided with a clear specification of the terms and concepts they are expected to master and of the testing format. 3.Specific with study time allocations--the amount of time that should be spent on each topic is provided in addition to specific objectives that describe the type of behavior expected on examinations. The dependent variable is the biology achievement test. A population of freshman students scheduled to enroll in biology is defined, and 30 students are randomly selected. The investigator obtains aptitude test scores and scores from an academic motivation test for all students before the investigator randomly assigns 10 students to each of the three treatments. Treatments are administered, and scores on the dependent variable are obtained for all students." In the data file, the dependent variable is in c1, aptitude test in c2, academic motivation in c3, and group membership variable (1,2,3) in c6. In c4-c5 are two 0,1 dummy variables that define the group membership in c6. huitema.lis Description of data in huitema.dat. Carries out ancova for the 3-group two-covariate design using MINITAB ancova and multiple regression approach. cnrl.dat 2-group data (10 cases per group) with single outcome and single covariate taken from Rogosa (1980). Outcome in c1, covariate in c2, group membership (1,0) in c3. cnrl.lis Description of cnrl.dat. Carries out computations needed for Comparing Nonparallel Regression Lines procedures. nwkt12p1.dat Data from NWK text, now Chapter 8, formerly Table 12.1. "A hospital surgical unit was interested in predicting survival in patients undergoing a particular type of liver operation. A random selection of 54 patients was available for analysis. From each patient record, the following information was extracted from the preoperation evaluation: blood clotting score, prognostic index, enzyme function test, liver function test. The dependent variable is survival time." Blood clotting score is in c1, prognostic index in c2, enzyme function test in c3, liver function test in c4, survival time in c5 and log10survival in c6. stepw.lis Uses nwkt12p1.dat to illustrate stepwise regression variable selection procedures (Forward stepwise, Backward Elimination.) Reproduces results in NWK . breg.lis Uses nwkt12p1.dat to illustrate "best subsets" variable selection procedure (using breg in MINITAB). Reproduces results in NWK. pcamarks.dat Data from 18 students in the prior (many years ago) 2-quarter version of part of this course (i.e. Education 250A,B). c1-c6 are the scores on the six graded homework assignments; c7 has the final exam for 250A, c8 has the midterm in 250B, and c9 has the outcome score the final exam in 250B. pca257.lis Uses composite construction and principal components (using MINITAB pca) to examine data reduction procedures for the predictors in pcamarks.lis. Path Analysis First path analysis example from lecture: 4 variables, SES IQ nAch GPA Path Analysis Second path analysis example from lecture: three longitudinal observations. ************** PART III ANALYSIS OF CATEGORICAL DATA **************** Agresti Supplement Tables from the Appendix of "An Introduction to Categorical Data Analysis," by Alan Agresti, published by John Wiley and Sons, Inc., January 1996. The tables show SAS code for the analyses conducted in that text, and contain the major data sets from that text. Exact Confidence Interval for Proportion SAS implementation in PROC FREQ for Exact Confidence Interval for Proportion, see Agresti Ch.1 (also Mathematica handout). Generalized Linear Models: Logistic and Poisson Regression coupon.dat Dichotomous outcome, single quantitative predictor (*with replication*). From NWK supplement (or the NWK regression book), the description is: "In a study of the effectiveness of coupons offering a price reduction on a given product, 1,000 homes were selected and a coupon and advertising material for the product were mailed to each. The coupons offered different price reductions (5,10,15,20, and 30 cents), and 200 homes were assigned at random to each of the price reduction categories. The independent variable in this study is the amount of price reduction, and the dependent variable is a binary variable indicating whether or not the coupon was redeemed within a six-month period." The price reduction is in c1, number of households (200) in c2, and number redeemed from the 200 households in c3. coupon.lis Logit transformation and OLS and WLS fits to coupon.dat. program.dat Dichotomous outcome, single quantitative predictor. From NWK supplement (or the NWK regression book), the description is: "A small-scale investigation was undertaken to study the effect of computer programming experience on ability to complete a complex programming task, including debugging, within a specified time. Twenty-five persons were selected for the study. They had varying amounts of programming experience (measured in months of experience). All persons were given the same programming task. The results are coded in binary fashion; if the task was completed successfully in the allotted time, it was scored 1, and if the task was not completed successfully, it was scored 0." Months of experience are in c1, and the binary outcome measure is in c2. program.lis Plots and description of program.dat. OLS and WLS fits of straight-line functional form. BMDPLR logistic regression fit (presented in class) compared with straight-line fit. NEW! Minitab blog binary logistic regression. progsas.sas contains the SAS instructions to carry out a logistic regression for the data in program.dat. progsas.lst SAS output obtained from the command line statement: "sas progsas" on an elaine. Contains the logistic regression parameter estimates and fits. disease.dat Data set, 98 cases, shown in NWK Table 14.3 and App.C.3. In disease.dat C1 is Age, C2 and C3 the SES indicators (see p.582) C4 City Sector, C5 disease status. diseaseselect.lis Comparison for variable selection of various logistic regression models for disease data following NWK sec 14.5. Poisson Regression Construction of Artificial Data using Minitab and SAS analysis using PROC GENMOD Contingency Tables and Log-linear Models draft.cnt Draft lottery data from 1971. Rows are months Jan-Dec and columns are #days with highest risk C1 (numbers 1-122), numbers 123-244 in C2 and lowest risk (numbers 245-366) in C3. draft.lis Chi-square test for independence (fairness) for draft lottery data. Aspirin and MI Data and SAS analysis for Aspirin Use and Myocardial Infarction, Agresti Section 2.2.2 Lung Cancer Data and SAS analysis for Smoking and Lung Cancer example Tea Tasting Data and SAS analysis Fishers Tea Tasting example; Fisher's Exact test, Agresti Section 2.6.1 Bayes Rule and Conditional Probability: At-risk Students example Matched Pairs, McNemar's test Data and SAS analysis using PROC FREQ of matched pairs data. Example is approval rating (approve/disapprove) data from 1600 individuals at two times. see Agresti Ch 9 CMH analysis Ex Data and SAS analysis SAS file for CMH analysis (Cochran-Mantel-Haenszel Statistics) for meta analysis of Chinese smoking data in Agresti Table 3.3 CMH analysis for Migraine Ex Data and SAS analysis Cochran-Mantel-Haenszel Statistics for Migraine Ex. 2x2 factorial design--Gender by Treatment (Active, Placebo) with binary outcome Improve (Better, Same). Belief in Afterlife Ex Data and SAS GENMOD analysis loglinear model for Agresti Ch. 2 2x2 Example. Death Penalty Ex Cross-classification Tables for Death Penalty Data. Illustration of Simpson's Paradox. 2x2x2 Table: Death Penalty dp (yes/no); Defendant Race defr, Victim Race victr, (# white=1, black=2) PROC GENMOD code for Migraine Ex SAS run file for all partial and saturated log-linear models for Migraine Ex. 2x2 factorial design--Gender by Treatment (Active, Placebo) with binary outcome Improve (Better, Same). PROC GENMOD output for Migraine Ex Resulting SAS output for all partial and saturated log-linear models for Migraine Ex. Selected models for Migraine Ex SAS code and output for selected log-linear models for Migraine Ex. Subset of examples above. -------------- Alcohol, Cigarette, and Marijuana Use Example Agresti Table 6.3 A survey conducted in 1992 by the Wright State University School of Medicine and the United Health Services in Dayton, Ohio. Among other things, the survey asked students in their final year of high school in a nonurban area near Dayton, Ohio whether they had ever used alcohol, cigarettes, or marijuana. Denote the variables in this 2 X 2 X 2 table by A for alcohol use, C for cigarette use, and M for marijuana use. Table 6.3 Alcohol (A), Cigarette (C), and Marijuana (M) Use for High School Seniors Marijuana Use Alcohol Cigarette Use Use Yes No --------------------------------------- Yes Yes 911 538 No 44 456 No Yes 3 43 No 2 279 -------------------------------------------- PROC GENMOD code for A C M Ex SAS run file for all partial and saturated log-linear models for A C M Ex. PROC GENMOD output for A C M Ex Resulting SAS output for all partial and saturated log-linear models for A C M Ex. Drugs AC AM CM SAS GENMOD analysis for best loglinear model (AM, AC, CM) shown in Agresti Table 6.7 Drugs AM CM SAS analysis for (poor-fitting) loglinear model (AM, CM) shown in Agresti Table 6.7 Trend in 2xC tables Agresti section 2.5.2 Alcohol and Infant Malformation Example Table 2.7 refers to a prospective study of maternal drinking and congenital malformations. After the first three months of pregnancy, the women in the sample completed a questionnaire about alcohol consumption. Following childbirth, observations were recorded on presence or absence of congenital sex organ malformations. Table 2.7 Infant Malformation and Mothers Alcohol Consumption Alcohol Malformation Percentage Consumption Absent Present Total Present 0 17,066 48 17,114 0.28 less1 14,464 38 14,502 0.26 1—2 788 5 793 0.63 3—5 126 1 127 0.79 6 37 1 38 2.63 Source: B. I. Graubard and E. L. Kom, Biometrics 43:471—476 (1987). SAS output illustrates Cochran-Armitage test for trend. Linear Association Models for Ordinal Data Agresti, Chapter 7. Data from the 1991 General Social Survey, illustrates the inadequacy of ordinary loglinear models for analyzing ordinal data. Subjects were asked their opinion about a man and woman having sex relations before marriage, with possible responses “always wrong,” “almost always wrong,” “wrong only sometimes,” and “not wrong at all.” They were also asked if they “strongly disagree,” “disagree,” “agree, or “strongly agree” that methods of birth control should be made available to teenagers between the ages of 14 and 16. Both classifications have ordered categories. SAS analysis using PROC GENMOD compare independence model and linear association model