Methodology

Methodology behind the Predictions

We provide a simple explanation of what we do, how we arrive at our result and how our analysis is related to what is reported in the news.

In the News

“Dead Heat”, “Neck-to-Neck”, “Solidly Republican”, “In Kerry’s Column”, “Well Within the Margin of Error” – these are what we often hear or read in the news. Such qualitative characterization is based on the theory of random sampling. Polling tries to figure out the proportion of the population having a certain attribute (e.g., favoring Nader).

The following simple model illustrates the idea of polling.

Imagine a box with N balls (with N being very large, e.g., all registered voters in a state), B of them painted blue and R red. We want to guess what the proportions are without having to count them all – by only randomly (shaking and mixing well) examining n of them. The first ball we examine has a probability of R/N being red and B/N being blue. If we were to put the drawn ball back into the box after each draw, these probabilities would remain unchanged for the second ball. Although polling is carried out without replacement (we do not poll the same person twice), if we poll far fewer balls than the total population size, it is safe to assume that these probabilities remain the same in all draws. The challenge behind polling is to best guess the population proportions and using the tabulated result of the poll (how many red and how many blue balls we have observed out of a sample of size n).

Since we do not examine all N of the balls, our random sample is unlikely to represent the actual population proportion exactly: there is an error. We often hear comments such as “48% with 3% margin of error”. This statement has a rigorous mathematical interpretation if we know the size of this poll. For example, if the poll was conducted on 1,068 voters, this statement means that there is a 95% chance that the real population proportion is inside the interval (48%-3%, 48%+3%) = (45%, 51%). If the poll size is a mere 752, there is only a 90% chance that the real population proportion is inside this interval. The value 95% or 90% is referred to as the confidence probability of the poll for a given margin of error. This confidence probability reflects the fact that the margin error is not an “all-or-nothing statement” about the poll results: the real population proportion might well lie outside the boundaries of the margin of error, and the confidence probability tells us how (un)likely this is. Its value is derived from probability theory (Central Limit Theorem).

Now consider a particular state in the presidential election where the poll returns a result of 48% and 47% with a 3% margin of error. Since the two confidence intervals -- (45%, 51%) and (44%, 50%) respectively – overlap, commentators typically call the state a statistical dead heat. When these intervals do not overlap, they usually place that state (with its electoral votes) in a particular candidate’s column – even though, as we have just explained, there is only a 90% or 95% that the real population proportion is inside each of these intervals (depending on the sample size).

Our analysis attempts to make finer distinction from poll data, apart from declaring a state Red, Blue, or Swing. It proceeds in two steps:
  • Step 1: compute the probability that a given candidate wins a given state, given the latest polls for that state, rather than categorizing the state as Red, Blue, or Swing.
  • Step 2: roll up the probabilities thus computed for each of the 50 states plus D.C. to infer the nationwide probability that a given candidate wins the majority of the electoral colleges (270 or more votes in the 2004 election).

Step 1

We start by estimating the probability that either candidate wins each state, and do this from polling data for each state. We denote this probability Pk for state k. For example, if k corresponds to California, Kerry will collect (all of) California’s 55 electoral votes with probability Pk. If we estimate Pk = 98% from poll data, then we will say that Kerry has 98% of winning the 55 California votes – similar to flipping a loaded coin with a 98% chance of heads. We will do this calculation for each of the 50 states and the District of Columbia.

Step 2

Using these computed probabilities, we can devise a procedure to compute the probability that either candidate wins exactly X electoral votes, for any X ranging from zero to 538.

Of course, it is very unlikely that Kerry will win all the 528 votes, akin to flipping each of the 51 loaded coins (each with a different probability of turning up heads) and obtaining heads in all of them. Similarly, it is very unlikely for Kerry to lose them all. However, to compute the probability that Kerry wins exactly 282 votes, for example, is a daunting task. Imagine all the possible ways that Kerry can get 282 votes: 55 from California, 10 from Wisconsin, 4 from New Hampshire, … It amounts to a laborious book-keeping problem -- to figure out all collections of states that will give Kerry 282 votes, getting heads in those coin flips and tails for the remaining. Using a clever procedure, we are able to carry out these computations efficiently without having to explicitly consider all possible combinations of state outcomes.

To win the election, either candidate has to win at least 270 electoral votes (that is, 270, or 271, or 272, …, or 538). The final output of our analysis is the single probability that either candidate wins the majority of the electoral votes, and thus the White House. It is, when all is said and done, the only number we should care about in this election.