We will apply
-means clustering to the NCI data, which is the
data used for the hierarchical cluster we saw last class. This plot
shows the within cluster sum of squares as a function of the number of
clusters. To estimate the variability, we used 5 different random
initial data points to initialize K-means. As you can see, it is
typically descreasing, and there is some variability.
In[2]:
%load_ext rmagic
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster import vq
Given a data matrix, a set of centroids associated to some partition labels, we want to compute the within cluster sum of squares:
In[3]:
def within_sum_of_squares(data, centroids, labels):
"""
Compute total sum of squares of a prototype clustering
algorithm.
"""
SSW = 0
for l in np.unique(labels):
data_l = data[labels == l]
resid = data_l - centroids[l]
SSW += (resid**2).sum()
return SSW
%R -o nci -n library(ElemStatLearn); data(nci)
In[4]:
nci.shape;
Let’s form a number of clustering of the samples of different sizes, storing the within cluster sum of squares
In[14]:
niter, kmax = 5, 25
SSW = []
for k in range(1,kmax):
SSW.append([within_sum_of_squares(nci.T, *vq.kmeans2(nci.T, k=k, minit='points')) for _ in range(niter)])
SSW = np.array(SSW)
In[15]:
plt.errorbar(range(1,kmax), SSW.mean(1), yerr=SSW.std(1), fmt='ro')
Out[15]:
<Container object of 3 artists>
<matplotlib.figure.Figure at 0x90b12b0>
We can also try to see how stable the clustering is on the iris data by repeating it several times. I found a random instance that split the Iris-setosa cluster and have saved the seed. There appear to be a few configurations that it chooses, in any case, there are definitely more than one.
set.seed(3)
iris = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
sep = ",", header = FALSE)
names(iris) = c("sepal.length", "sepal.width", "petal.length", "petal.width",
"iris.type")
par(mfrow = c(2, 2))
plot(iris$sepal.length, iris$sepal.width, pch = 23, bg = c("red", "blue", "green")[kmeans(iris[,
-5], 3)$cluster])
plot(iris$sepal.length, iris$sepal.width, pch = 23, bg = c("red", "blue", "green")[kmeans(iris[,
-5], 3)$cluster])
plot(iris$sepal.length, iris$sepal.width, pch = 23, bg = c("red", "blue", "green")[kmeans(iris[,
-5], 3)$cluster])
plot(iris$sepal.length, iris$sepal.width, pch = 23, bg = c("red", "blue", "green")[kmeans(iris[,
-5], 3)$cluster])
plot(iris$sepal.length, iris$sepal.width, pch = 23, bg = c("red", "blue", "green")[kmeans(iris[,
-5], 3)$cluster])
plot(iris$sepal.length, iris$sepal.width, pch = 23, bg = c("red", "blue", "green")[kmeans(iris[,
-5], 3)$cluster])
plot(iris$sepal.length, iris$sepal.width, pch = 23, bg = c("red", "blue", "green")[kmeans(iris[,
-5], 3)$cluster])
plot(iris$sepal.length, iris$sepal.width, pch = 23, bg = c("red", "blue", "green")[kmeans(iris[,
-5], 3)$cluster])
The
-medoid algorithm is implemented in the pam function in
the cluster package in R. PAM stands for: Partitioning Around
Medoid. It seems to be more robust than
-means in the sense
that for the iris data, it almost never split the Iris-setosa
cluster into 2 groups. The call to sample is just to shuffle the
rows of the dataset to ensure that nothing special is happening based on
the ordering of the rows.
library(cluster)
par(mfrow = c(2, 2))
s = sample(150)
plot(iris$sepal.length[s], iris$sepal.width[s], pch = 23, bg = c("red", "blue",
"green")[pam(iris[s, -5], 3)$cluster])
s = sample(150)
plot(iris$sepal.length[s], iris$sepal.width[s], pch = 23, bg = c("red", "blue",
"green")[pam(iris[s, -5], 3)$cluster])
s = sample(150)
plot(iris$sepal.length[s], iris$sepal.width[s], pch = 23, bg = c("red", "blue",
"green")[pam(iris[s, -5], 3)$cluster])
s = sample(150)
plot(iris$sepal.length[s], iris$sepal.width[s], pch = 23, bg = c("red", "blue",
"green")[pam(iris[s, -5], 3)$cluster])
s = sample(150)
plot(iris$sepal.length[s], iris$sepal.width[s], pch = 23, bg = c("red", "blue",
"green")[pam(iris[s, -5], 3)$cluster])
s = sample(150)
plot(iris$sepal.length[s], iris$sepal.width[s], pch = 23, bg = c("red", "blue",
"green")[pam(iris[s, -5], 3)$cluster])
s = sample(150)
plot(iris$sepal.length[s], iris$sepal.width[s], pch = 23, bg = c("red", "blue",
"green")[pam(iris[s, -5], 3)$cluster])
s = sample(150)
plot(iris$sepal.length[s], iris$sepal.width[s], pch = 23, bg = c("red", "blue",
"green")[pam(iris[s, -5], 3)$cluster])
Finally, the pam function also allows us to produce a silhouette plot
plot(pam(iris[, -5], 3), which.plots = 2)
As a function of the number of clusters, the average silhouette width reaches its best value at 2 for the iris data. Unfortunately, the silhouette method cannot select 1 cluster.
ASW = numeric(19)
k = 2:20
for (kk in 2:20) {
ASW[kk - 1] = pam(iris[, -5], kk)$silinfo$avg.width
}
plot(k, ASW, pch = 23, bg = "red")
A softer version of
-means (or
-medoids) can be found
using a Gaussian mixture model. Rather than assigning each point to a
cluster, a mixture model assigns each point to a cluster with some
probability.
Nevertheless, at the end we can obtain a label by choosing the cluster with highest estimated probability.
library(mclust)
## Warning: package 'mclust' was built under R version 2.15.1
## Package 'mclust' version 4.0
par(mfrow = c(2, 2))
s = sample(150)
plot(iris$sepal.length[s], iris$sepal.width[s], pch = 23, bg = c("red", "blue",
"green")[Mclust(iris[s, -5], G = 3, modelNames = c("VVV"))$classification])
s = sample(150)
plot(iris$sepal.length[s], iris$sepal.width[s], pch = 23, bg = c("red", "blue",
"green")[Mclust(iris[s, -5], G = 3, modelNames = c("VVV"))$classification])
s = sample(150)
plot(iris$sepal.length[s], iris$sepal.width[s], pch = 23, bg = c("red", "blue",
"green")[Mclust(iris[s, -5], G = 3, modelNames = c("VVV"))$classification])
s = sample(150)
plot(iris$sepal.length[s], iris$sepal.width[s], pch = 23, bg = c("red", "blue",
"green")[Mclust(iris[s, -5], G = 3, modelNames = c("VVV"))$classification])