LOGISTIC REGRESSION (aka VARBRUL) in SPSS.

This tutorial continues where the January 5 tutorial left off.

PREAMBLE: 

today we look at actually building a quantitative model for
final /-s/ deletion in Panama Spanish.  Model building can have a
number of purposes:

 o in order to test a hypothesis about the relationship between two
   variables in your data, you may need to build a model as a way of
   controlling for interactions between variables.  Even so-called
   "simple" hypothesis testing implicitly involves building a (simple)
   model of your data.

 o a model can tell you how much of the variability in your data you
   are able to account for using information you have encoded in your
   dataset.  As such, it can tell you how well you are able to explain
   the phenomenon.

 o a model can serve as a compact, extensible representation of your
   data that you apply for some other purpose.  A probabilistic model
   of verbal subcategorization frames, for example, could be useful to
   a psycholinguist who is interested in how knowledge of
   subcategorization trends affects reading times.

Our models will be of the form of 

   P( Y=y | X=x )

where Y is the DEPENDENT variable and X is some set of INDEPENDENT
variables.  For now, we'll treat logistic regression as a black box,
and focus on what you do in general with models of the form outlined
above.

7) Aggregated, or grouped, views of the data.  There are 8,846
   examples in our dataset, but many of them are indistinguishable,
   and given the number of values that each variable takes, there are
   only 2*3*5*4 = 120 possible combinations of variable values.  It
   can be more useful to look at the data in aggregate form -- that
   is, only one row per unique combination of variables.  

   When we aggregate our data, we have a choice: we can either combine
   cases that differ only in Deletion, or treat them separately.  In
   the former case, we probably want to include in the aggregation the
   frequency of deletion for each row.  The latter case is a bit
   simpler in SPSS, though, so we'll do it first:

   Select Data | Aggregate. Move Deletion, POS, Environment, and class
   all into the "Break Variables" block.  Check "Number of cases", and
   name it something -- I'll use "N".  Finally, under "Save" select
   "Create new data file" and name it something appropriate -- I'll
   use "cedergren-aggregated.sav".  Click OK.  Then go back to the
   main menu, choose File | Open, and select file that you just
   created. In the new file, each row has a unique combination of
   variable values, and also has the number of cases under the "N"
   column.

   To aggregate data differing in value of Deletion, go back to the
   original data file and choose Data |aggregate again.  Move POS,
   Environment, and class into the "Break Variables" block.  Then move
   Deletion into the "Aggregated Variables" box.  By default it will
   say in that box:

     Deletion_mean = MEAN(Deletion)

   This means that a new variable called Deletion_mean will be
   created, and its values will be the average value of Deletion for
   all the cases of a given combination of the break variables.  As
   before, check "Number of cases" and name the new variable, and
   choose to create a new file.  Click OK, and open the new file.
   Note that there are only about half as many rows now, and that each
   row has a Deletion_mean variable valued between 0 and 1.

8) Logistic Regression (aka VARBRUL).  Select Analyze | Regression |
   Binary Logistic.  This gives you the interface for setting up
   logistic regressions.  Recall that we're interested in looking at
   the effects of grammatical category (POS), following environment,
   and social class on likelihood of /-s/ deletion.  We'll start with
   building a simple model of deletion where POS, environment, and
   class have independent effects.  Since deletion is the dependent
   variable in our dataset, move "Deletion" into the "Dependent" box
   and move "POS", "Environment", and "class" into the "Covariates"
   box.  Also, even though we've coded social class with numerical
   values, we're actually going to treat it as a _categorial_ variable
   -- that classes 1,2,3,4 are not on any kind of scale.  Do this by
   clicking on "Categorial" at the bottom of the window.  This will
   open up a "Define categorial variables" window; move "class" into
   the right-hand "Categorial Covariates" box in this window.  You can
   leave "Change Contrast" at "Indicator".  Then click Continue, and
   back in the Logistic Regression interface window click OK.  

   A series of tables will then be printed in the Output window.
   There will be three major sections of output, looking like this
   (See the outline on the left side of the Output window):


     Logistic Regression
  
       Case Processing Summary
       ...

       Dependent variable encoding
       ...

       Categorical Variables Codings
       ...

     Block 0: Beginning Block

       Classification Table
       ...

       Variables in the Equation
       ...

       Variables not in the Equation
       ...

     Block 1: Method = Enter

       Omnibus Tests of Model Coefficients
       ...

       Model Summary
       ...

       Classification Table
       ...

       Variables in the Equation
       ...

   
   At the moment, we're not interested in the specifics of how
   logistic regression works, which is required to understand part of
   the "Variables (not) in the Equation" tables.  Instead, we want to
   focus on looking at how well a model fits a dataset, and comparing
   the fit of two models. The information relevant for this is in
   three tables:

     I) Model Summary: this table in Block 1 contains three pieces of
     information.  The most important one is 

       -2 Log likelihood

     which is calculated from the likelihood of the data under the
     model.  Lower values of this statistic indicate more likely
     models, since it is a negative log (think this through if it's
     not self-evident).  All else being equal, we want to use models
     that have high data likelihood.

     The Cox & Snell R square and Nagelkerke R square are closely
     related statistics, and basically summarize how much of the
     variability in your data is successfully explained away by your
     model.  Larger values of these R squares (the Nagelkerke has a
     maximum value of 1) indicate that your model captures more of the
     data variability.  We'll look more at this idea later in the
     course, in the context of linear regression.

     Block 0 should really have a Model Summary table that shows the
     data likelihood, but SPSS for some reason doesn't put it there.
     If you think about it, there is a way of getting around this!

     II) Classification table: this table shows the percentage of
     individual cases correctly classified by a simple "majority rule"
     that says for each case, the model guesses the class it considers
     most likely.

     III) Omnibus Tests of Model Coefficients: this table shows a
     chi-square statistic comparing the model with a simpler model (in
     this case, the model with only an "intercept").  Larger values of
     the chi-square statistic indicate a bigger difference in fit
     between this and the simpler model.  The "Sig." column shows the
     statistical significance of this difference (again, we'll discuss
     this idea more throughly later on.) In our case, all three rows
     are the same, but you can also have SPSS build a model
     incrementally, in which case the rows can differ.

9) Visualizing data: scatterplots.  Displaying the data in some
   graphical form is often useful for understanding patterns in it.
   Right now we will visualize the fit of the logistic regression
   model by plotting predicted frequency of deletion against actual
   frequency of deletion.  To do this, we're going to make use of the
   data aggregation function again.  Go back into the binary logistic
   regression interface, click "Save", check Predicted Values:
   Probabilities, and click "Continue". Then click "OK" from the main
   logistic regression interface.  The output will be the same as
   before, but in the Data Editor window there will be a new variable
   PRE_1 whose value is the predicted probability

      P( Deletion = 1 | POS, Environment, class)

   To usefully plot predicted versus actual frequency, we need to
   aggregate the data as before.  Select Data | Aggregate, move POS,
   Environment, and class into "Break Variables" as before, and move
   both Deletion and the new Predicted variable into "Aggregated
   Variables."  (The MEAN of the Predicted variable is fine because
   this will always be the same for a given combination of break
   variables -- why?)  Save the aggregated data to file and open that
   file.

   The Data Editor window now has both Predicted_mean and
   Deletion_mean variables.  We'll plot these against each other.  Go
   to Graphs | Scatter/Dot, highlight "Simple Scatter", and click
   "Define."  Move Predicted_mean to the "X axis" box and
   Deletion_mean to the "Y axis" box.  When you click OK, a graph will
   appear in the Output window, with Predicted_mean on the X axis and
   Deletion mean on the Y axis.

   Now double click on the chart, and you'll enter the Chart Editor.
   You can do all sorts of useful things from this window, but we'll
   focus on two.  First, select Options | Reference Line from
   Equation.  In the ideal model, the predicted deletion frequency
   always exactly equals the ideal deletion frequency, so we can plot
   this ideal model by selecting a Y-intercept of 0 and a slope of 1.
   Click "Apply" and a line representing this ideal area of fit will
   appear.  Points close to this line represent combinations of
   independent variables (often called COVARIATE VECTORS) where the
   model closely matches the data.  Points farther away have a poorer
   fit.  Points extremely far away may be OUTLIERS -- that is, cases
   where there is a fundamental mismatch the assumptions inherent in
   the model and the actual dynamics in the data.

   You can also examine individual cases by left-clicking on points,
   then right-clicking on them and selecting "Go to case".  This can
   be very useful.