In this example, we look at several aspects of the voting records of the US Congress of 2010. The votes were compiled from http://clerk.house.gov.
We have the votes (or absence of a vote) for each member of the House of Representatives on over 939 roll call votes in 2011 with 426 members having all votes. Not surprisingly, these votes are clearly structured across party lines. We will apply PCA and see that the first component pretty much determines party affiliation.
I generated the data file 2011_cleaned_votes.csv using the following function that scrapes the votes for a given roll call in a given year.
In[1]:
import urllib, os.path, time
from elementtree import ElementTree as ET
import numpy as np
In[2]:
def scrape_votes(year, roll, sleep=True):
"""
For a given year and roll call number, download the votes
as an XML file, retrieve all votes and return an array that
has a row for each member of the House, their ID, their party
and their vote, represented as a string in the "vote" field
and a numeric representation in their "numeric_vote" field.
An "Aye" was coded as 1, "Nay" as -1, with "Not Voting" and
"Present" coded as 0.
"""
# download the data and make a local copy
if not os.path.exists("../vote/%s_%s.xml" % (year, roll)):
if sleep:
time.sleep(np.random.randint(2,4))
data = urllib.urlopen("http://clerk.house.gov/evs/%d/roll%03d.xml" % (year, roll)).read()
f = file("../vote/%s_%s.xml" % (year, roll), 'w')
f.write(data)
f.close()
data = file("../vote/%s_%s.xml" % (year, roll))
# parse the XML tree
# an investigation of one of the .xml files shows that
# the information we want is in the <recorded-vote> tags
xml_data = ET.parse(data)
values = []
for vote in xml_data.find('vote-data').findall('recorded-vote'):
legislator = vote.find('legislator')
name = legislator.get('unaccented-name')
ident = legislator.get('name-id')
party = legislator.get('party')
v = vote.find('vote').text
values.append((name, v, party, ident, {'Yea':1,'Aye':1, 'Nay':-1, 'Not Voting':0, 'Present':0, 'No':-1}[v]))
return np.array(values, np.dtype([('name', '|S50'),
('vote%d' % roll, '|S12'),
('party', '|S15'),
('id', '|S12'),
('numeric_vote%d' % roll, np.int)]))
Here’s a sample of what this function returns.
In[3]:
scrape_votes(2011,21)[:20]
Out[3]:
array([('Ackerman', 'Yea', 'D', 'A000022', 1),
('Adams', 'Yea', 'R', 'A000366', 1),
('Aderholt', 'Yea', 'R', 'A000055', 1),
('Akin', 'Yea', 'R', 'A000358', 1),
('Alexander', 'Yea', 'R', 'A000361', 1),
('Altmire', 'Yea', 'D', 'A000362', 1),
('Amash', 'Yea', 'R', 'A000367', 1),
('Andrews', 'Yea', 'D', 'A000210', 1),
('Austria', 'Yea', 'R', 'A000365', 1),
('Baca', 'Yea', 'D', 'B001234', 1),
('Bachmann', 'Yea', 'R', 'B001256', 1),
('Bachus', 'Yea', 'R', 'B000013', 1),
('Baldwin', 'Yea', 'D', 'B001230', 1),
('Barletta', 'Yea', 'R', 'B001269', 1),
('Barrow', 'Yea', 'D', 'B001252', 1),
('Bartlett', 'Yea', 'R', 'B000208', 1),
('Barton (TX)', 'Yea', 'R', 'B000213', 1),
('Bass (CA)', 'Yea', 'D', 'B001270', 1),
('Bass (NH)', 'Yea', 'R', 'B000220', 1),
('Becerra', 'Yea', 'D', 'B000287', 1)],
dtype=[('name', '|S50'), ('vote21', '|S12'), ('party', '|S15'), ('id', '|S12'), ('numeric_vote21', '<i4')])
We will get all votes for 2011 using a simple loops (with a little sleep so we don’t get kicked out of the server).
In[4]:
import matplotlib.mlab as ML # for recarray utils
def get_all_votes():
data = scrape_votes(2011, 1)
rolls_found = [1]
for i in range(2,950):
try:
new_votes = scrape_votes(2011,i)
newdata = ML.rec_join('id', data, ML.rec_drop_fields(new_votes,
['name', 'party']))
if newdata.shape[0] < 10:
# A merge seems to have failed if this happens...
raise ValueError
else:
data = newdata
rolls_found.append(i)
except:
# Roll 2 is election of the speaker... not a Y/N vote
pass
# every 100th vote, let's save what we have to disk
if i % 100 == 0:
ML.rec2csv(data, '../data/2011_votes.csv', delimiter=';')
# we don't need all columns to analyse the data
cleaned = ML.rec_keep_fields(data, ['party'] + [n for n in data.dtype.names if 'numeric_vote' in n])
ML.rec2csv(cleaned, '../data/2011_cleaned_votes.csv', delimiter=';')
ML.rec2csv(data, '../data/2011_votes.csv', delimiter=';')
force = False
if not os.path.exists('../data/2011_cleaned_votes.csv') or force:
get_all_votes()
I will run my R analysis from within ipython, though all the R code is exectuable in R itself, of course. The ipython notebook allows me to keep some python preprocessing along with R code in one file...
votes = read.table("http://stats202.stanford.edu/data/2011_cleaned_votes.csv",
header = TRUE, sep = ";")
full_votes = read.table("http://stats202.stanford.edu/data/2011_votes.csv",
header = TRUE, sep = ";")
dim(votes)
## [1] 426 948
summary(votes$party)
## D R
## 189 237
Principal components analysis is an unsupervised technique that yields some “interesting” linear combinations of the votes. However, as implemented in R, it only works when there are fewer votes, than members. Let’s take a random sample of 20 votes and look at the result.
set.seed(0) # this can be changed
subset = sample(2:ncol(votes), 20)
pca.votes = princomp(votes[, subset])
plot(pca.votes$scores[, 1], pca.votes$scores[, 2])
This data is also “supervised”, in the sense that we have a label: the member’s party. Let’s color our points with this label.
red = votes$party == "R"
blue = votes$party == "D"
plot(pca.votes$scores[, 1], pca.votes$scores[, 2], type = "n", xlab = "Variable 1",
ylab = "Variable 2")
points(pca.votes$scores[red, 1], pca.votes$scores[red, 2], pch = 23, bg = "red")
points(pca.votes$scores[blue, 1], pca.votes$scores[blue, 2], pch = 23, bg = "blue")
Clearly, there is some separation between the two parties along Variable 1. However, there does not appear to be much separation between the parties on Variable 2.
We can also take a larger sample of votes, and the picture looks similar.
subset = sample(2:ncol(votes), 100)
pca.votes = princomp(votes[, subset])
plot(pca.votes$scores[, 1], pca.votes$scores[, 2], type = "n", xlab = "Variable 1",
ylab = "Variable 2")
points(pca.votes$scores[red, 1], pca.votes$scores[red, 2], pch = 23, bg = "red")
points(pca.votes$scores[blue, 1], pca.votes$scores[blue, 2], pch = 23, bg = "blue")
In summary, Variable 1 seems highly associated to party. There are more than 2 scores. In fact, in this example, there are 100 scores. Each score “explains” some amount of the variability in the votes dataset. How much it explains can be visualized with what is known as a screeplot.
plot(pca.votes)
summary(pca.votes$scores[red, 1])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.03 6.13 6.60 6.41 7.02 7.42
summary(pca.votes$scores[blue, 1])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -10.10 -9.24 -8.88 -8.04 -7.99 2.54
Most of the scores for each party are on one side of 0. This task of creating a new variable from several votes is an example of dimension reduction or feature extraction and is an unsupervised task. Often, this is used as a preprocessing to be fed into a supervised task like predicting party labels based on voting records. Note that we did not use the party variable to find these new variables (this is very important).
sign_R = sign(mean(pca.votes$scores[red, 1]))
predictions = numeric(length(votes$party))
predictions[sign(pca.votes$scores[, 1]) == sign_R] = "R"
predictions[sign(pca.votes$scores[, 1]) != sign_R] = "D"
How good is our classifier?
sum(predictions != votes$party)
## [1] 4
It looks pretty good. To be fair, we should probably fit the classifier on a subset of the members and then validate it on the remaining members. We do this for logistic regression below.
In any case, given our simple classifier, we can see how many people “cross party lines” on this variable:
which(sign(pca.votes$scores[blue, 1]) == sign_R)
## [1] 17 117 135 148
which(sign(pca.votes$scores[red, 1]) != sign_R)
## integer(0)
So, we see that 1 Democrat had a “Republican” score on the first variable. Which are they?
full_votes$name[17]
## [1] Bishop (GA)
## 426 Levels: Ackerman Adams Aderholt Akin Alexander Altmire ... Young (IN)
Later in the class, we will discuss classifiers, a simple one being logistic regression. Let’s fit the logistic regression model to our data here and see what it tells us.
party_N = numeric(length(votes$party))
party_N[red] = 1
party_N[blue] = 0
variable = pca.votes$scores[, 1]
logistic.model = glm(votes$party ~ variable, family = binomial())
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
How well did the model fit? Well, the logistic model estimates probabilities between 0 or 1. We can turn these into 0-1 labels by the rule: assign 1 if the probability is > 0.5, 0 otherwise.
predictions_logistic = fitted(logistic.model) > 0.5
sum(predictions_logistic != party_N)
## [1] 2
To be fair, we are testing our classifier using the same features we used to extract our feature. Let’s take a subset of members before we do PCA, and then see how well our classifier does.
subset_members = sample(1:nrow(votes), 400)
subset_votes = sample(2:ncol(votes), 100)
pca.votes = princomp(votes[subset_members, subset_votes])
We now have to evaluate the variable on new members, this is what the loadings are.
variable = as.matrix(votes[-subset_members, subset_votes]) %*% pca.votes$loadings[,
1]
Our thresholded classifier still does reasonably well
red = votes[subset_members, ]$party == "R"
sign_R = sign(mean(pca.votes$scores[red, 1]))
# the new predictions
predictions = numeric(length(party_N[-subset_members]))
predictions[sign(variable) == 1] = 1
predictions[sign(variable) != 1] = 0
predictions
## [1] 1 0 1 0 1 0 1 1 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 1 1 1
party_N[-subset_members]
## [1] 1 0 1 0 1 0 1 1 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 1 1 1
sum(predictions != party_N[-subset_members])
## [1] 0
We can also predict the labels using the logistic model on the remaining 26 using this variable.
logistic.model = glm(votes$party[-subset_members] ~ variable, family = binomial())
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
predictions_logistic = fitted(logistic.model) > 0.5
sum(predictions_logistic != party_N[-subset_members])
## [1] 0
Both models do pretty well. Is this a good thing or a bad thing?