First, we will cluster using “complete” linkage which uses the maximum dissimilarity
iris = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
sep = ",", header = FALSE)
names(iris) = c("sepal.length", "sepal.width", "petal.length", "petal.width",
"iris.type")
iris_hclust = hclust(dist(iris[, -5]))
plot(iris_hclust)
We can cut the tree and look at the resulting clustering. Let’s cut it
at the canonical 3 groups. We see the results are quite similar to the
-means and mixture model results.
iris_3 = cutree(iris_hclust, k = 3)
plot(iris$sepal.length, iris$sepal.width, pch = 23, bg = c("red", "blue", "green")[iris_3])
From the first plot, the three groups corresponds to a height of roughly 4, perhaps a little bit less. We can also cut the tree by height. This means that the maximum dissimilarity between any clusters is 4.
iris_h = cutree(iris_hclust, h = 3.9)
plot(iris$sepal.length, iris$sepal.width, pch = 23, bg = c("red", "blue", "green")[iris_h])
iris_6 = cutree(iris_hclust, k = 6)
plot(iris$sepal.length, iris$sepal.width, pch = 23, bg = c("red", "blue", "green",
"yellow", "orange", "purple")[iris_6])
Single linkage uses the minimum distance between the clusters
iris_hclust_single = hclust(dist(iris[, -5]), method = "single")
plot(iris_hclust_single)
This plot has the prototypical ``chaining’’ seen in single linkage. Its split into 3 groups has one large group, with a very small group of size 2.
iris_3 = cutree(iris_hclust_single, k = 3)
plot(iris$sepal.length, iris$sepal.width, pch = 23, bg = c("red", "blue", "green")[iris_3])
Using method="average" yields the average linkage tree. It is usually somewhat intermediate between complete and single linkage.
iris_hclust_average = hclust(dist(iris[, -5]), method = "average")
plot(iris_hclust_average)
iris_3 = cutree(iris_hclust_average, k = 3)
plot(iris$sepal.length, iris$sepal.width, pch = 23, bg = c("red", "blue", "green")[iris_3])