LOGISTIC REGRESSION (aka VARBRUL) in SPSS. This tutorial continues where the January 5 tutorial left off. PREAMBLE: today we look at actually building a quantitative model for final /-s/ deletion in Panama Spanish. Model building can have a number of purposes: o in order to test a hypothesis about the relationship between two variables in your data, you may need to build a model as a way of controlling for interactions between variables. Even so-called "simple" hypothesis testing implicitly involves building a (simple) model of your data. o a model can tell you how much of the variability in your data you are able to account for using information you have encoded in your dataset. As such, it can tell you how well you are able to explain the phenomenon. o a model can serve as a compact, extensible representation of your data that you apply for some other purpose. A probabilistic model of verbal subcategorization frames, for example, could be useful to a psycholinguist who is interested in how knowledge of subcategorization trends affects reading times. Our models will be of the form of P( Y=y | X=x ) where Y is the DEPENDENT variable and X is some set of INDEPENDENT variables. For now, we'll treat logistic regression as a black box, and focus on what you do in general with models of the form outlined above. 7) Aggregated, or grouped, views of the data. There are 8,846 examples in our dataset, but many of them are indistinguishable, and given the number of values that each variable takes, there are only 2*3*5*4 = 120 possible combinations of variable values. It can be more useful to look at the data in aggregate form -- that is, only one row per unique combination of variables. When we aggregate our data, we have a choice: we can either combine cases that differ only in Deletion, or treat them separately. In the former case, we probably want to include in the aggregation the frequency of deletion for each row. The latter case is a bit simpler in SPSS, though, so we'll do it first: Select Data | Aggregate. Move Deletion, POS, Environment, and class all into the "Break Variables" block. Check "Number of cases", and name it something -- I'll use "N". Finally, under "Save" select "Create new data file" and name it something appropriate -- I'll use "cedergren-aggregated.sav". Click OK. Then go back to the main menu, choose File | Open, and select file that you just created. In the new file, each row has a unique combination of variable values, and also has the number of cases under the "N" column. To aggregate data differing in value of Deletion, go back to the original data file and choose Data |aggregate again. Move POS, Environment, and class into the "Break Variables" block. Then move Deletion into the "Aggregated Variables" box. By default it will say in that box: Deletion_mean = MEAN(Deletion) This means that a new variable called Deletion_mean will be created, and its values will be the average value of Deletion for all the cases of a given combination of the break variables. As before, check "Number of cases" and name the new variable, and choose to create a new file. Click OK, and open the new file. Note that there are only about half as many rows now, and that each row has a Deletion_mean variable valued between 0 and 1. 8) Logistic Regression (aka VARBRUL). Select Analyze | Regression | Binary Logistic. This gives you the interface for setting up logistic regressions. Recall that we're interested in looking at the effects of grammatical category (POS), following environment, and social class on likelihood of /-s/ deletion. We'll start with building a simple model of deletion where POS, environment, and class have independent effects. Since deletion is the dependent variable in our dataset, move "Deletion" into the "Dependent" box and move "POS", "Environment", and "class" into the "Covariates" box. Also, even though we've coded social class with numerical values, we're actually going to treat it as a _categorial_ variable -- that classes 1,2,3,4 are not on any kind of scale. Do this by clicking on "Categorial" at the bottom of the window. This will open up a "Define categorial variables" window; move "class" into the right-hand "Categorial Covariates" box in this window. You can leave "Change Contrast" at "Indicator". Then click Continue, and back in the Logistic Regression interface window click OK. A series of tables will then be printed in the Output window. There will be three major sections of output, looking like this (See the outline on the left side of the Output window): Logistic Regression Case Processing Summary ... Dependent variable encoding ... Categorical Variables Codings ... Block 0: Beginning Block Classification Table ... Variables in the Equation ... Variables not in the Equation ... Block 1: Method = Enter Omnibus Tests of Model Coefficients ... Model Summary ... Classification Table ... Variables in the Equation ... At the moment, we're not interested in the specifics of how logistic regression works, which is required to understand part of the "Variables (not) in the Equation" tables. Instead, we want to focus on looking at how well a model fits a dataset, and comparing the fit of two models. The information relevant for this is in three tables: I) Model Summary: this table in Block 1 contains three pieces of information. The most important one is -2 Log likelihood which is calculated from the likelihood of the data under the model. Lower values of this statistic indicate more likely models, since it is a negative log (think this through if it's not self-evident). All else being equal, we want to use models that have high data likelihood. The Cox & Snell R square and Nagelkerke R square are closely related statistics, and basically summarize how much of the variability in your data is successfully explained away by your model. Larger values of these R squares (the Nagelkerke has a maximum value of 1) indicate that your model captures more of the data variability. We'll look more at this idea later in the course, in the context of linear regression. Block 0 should really have a Model Summary table that shows the data likelihood, but SPSS for some reason doesn't put it there. If you think about it, there is a way of getting around this! II) Classification table: this table shows the percentage of individual cases correctly classified by a simple "majority rule" that says for each case, the model guesses the class it considers most likely. III) Omnibus Tests of Model Coefficients: this table shows a chi-square statistic comparing the model with a simpler model (in this case, the model with only an "intercept"). Larger values of the chi-square statistic indicate a bigger difference in fit between this and the simpler model. The "Sig." column shows the statistical significance of this difference (again, we'll discuss this idea more throughly later on.) In our case, all three rows are the same, but you can also have SPSS build a model incrementally, in which case the rows can differ. 9) Visualizing data: scatterplots. Displaying the data in some graphical form is often useful for understanding patterns in it. Right now we will visualize the fit of the logistic regression model by plotting predicted frequency of deletion against actual frequency of deletion. To do this, we're going to make use of the data aggregation function again. Go back into the binary logistic regression interface, click "Save", check Predicted Values: Probabilities, and click "Continue". Then click "OK" from the main logistic regression interface. The output will be the same as before, but in the Data Editor window there will be a new variable PRE_1 whose value is the predicted probability P( Deletion = 1 | POS, Environment, class) To usefully plot predicted versus actual frequency, we need to aggregate the data as before. Select Data | Aggregate, move POS, Environment, and class into "Break Variables" as before, and move both Deletion and the new Predicted variable into "Aggregated Variables." (The MEAN of the Predicted variable is fine because this will always be the same for a given combination of break variables -- why?) Save the aggregated data to file and open that file. The Data Editor window now has both Predicted_mean and Deletion_mean variables. We'll plot these against each other. Go to Graphs | Scatter/Dot, highlight "Simple Scatter", and click "Define." Move Predicted_mean to the "X axis" box and Deletion_mean to the "Y axis" box. When you click OK, a graph will appear in the Output window, with Predicted_mean on the X axis and Deletion mean on the Y axis. Now double click on the chart, and you'll enter the Chart Editor. You can do all sorts of useful things from this window, but we'll focus on two. First, select Options | Reference Line from Equation. In the ideal model, the predicted deletion frequency always exactly equals the ideal deletion frequency, so we can plot this ideal model by selecting a Y-intercept of 0 and a slope of 1. Click "Apply" and a line representing this ideal area of fit will appear. Points close to this line represent combinations of independent variables (often called COVARIATE VECTORS) where the model closely matches the data. Points farther away have a poorer fit. Points extremely far away may be OUTLIERS -- that is, cases where there is a fundamental mismatch the assumptions inherent in the model and the actual dynamics in the data. You can also examine individual cases by left-clicking on points, then right-clicking on them and selecting "Go to case". This can be very useful.