This example is a short introduction to using R for PCA analysis. The data are scores on various olympic decathlon events for 33 athletes. It is contained in the package ade4.
library(ade4)
## Attaching package: 'ade4'
## The following object(s) are masked from 'package:base':
##
## within
data(olympic)
summary(olympic$tab)
## 100 long poid haut
## Min. :10.6 Min. :6.22 Min. :10.3 Min. :1.79
## 1st Qu.:11.0 1st Qu.:7.00 1st Qu.:13.2 1st Qu.:1.94
## Median :11.2 Median :7.09 Median :14.1 Median :1.97
## Mean :11.2 Mean :7.13 Mean :14.0 Mean :1.98
## 3rd Qu.:11.4 3rd Qu.:7.37 3rd Qu.:15.0 3rd Qu.:2.03
## Max. :11.6 Max. :7.72 Max. :16.6 Max. :2.27
## 400 110 disq perc
## Min. :47.4 Min. :14.2 Min. :34.4 Min. :4.00
## 1st Qu.:48.3 1st Qu.:14.7 1st Qu.:39.1 1st Qu.:4.60
## Median :49.1 Median :15.0 Median :42.3 Median :4.70
## Mean :49.3 Mean :15.0 Mean :42.4 Mean :4.74
## 3rd Qu.:50.0 3rd Qu.:15.4 3rd Qu.:44.8 3rd Qu.:4.90
## Max. :51.3 Max. :16.2 Max. :50.7 Max. :5.70
## jave 1500
## Min. :49.5 Min. :257
## 1st Qu.:55.4 1st Qu.:266
## Median :59.5 Median :272
## Mean :59.4 Mean :276
## 3rd Qu.:64.0 3rd Qu.:286
## Max. :72.6 Max. :303
The biplot gives a graphical summary of both cases (athletes) in terms of scores and the variables in terms of loadings.
pca.olympic = princomp(olympic$tab)
biplot(pca.olympic)
From this plot, we see that the first principal component is positively associated with longer times on the 1500. So, slower runners will have higher value on this component.
data.frame(olympic$tab[, "1500"], pca.olympic$scores[, 1])
## olympic.tab....1500.. pca.olympic.scores...1.
## 1 268.9 -6.074
## 2 273.0 -2.678
## 3 263.2 -12.350
## 4 285.1 9.527
## 5 256.6 -19.567
## 6 274.1 -2.312
## 7 291.2 15.048
## 8 265.9 -10.023
## 9 269.6 -5.977
## 10 292.2 16.717
## 11 295.9 20.036
## 12 256.7 -18.720
## 13 257.9 -18.561
## 14 269.0 -6.525
## 15 267.5 -8.326
## 16 268.5 -7.400
## 17 302.4 27.956
## 18 286.0 10.557
## 19 262.4 -14.220
## 20 277.8 2.264
## 21 266.4 -9.508
## 22 262.9 -13.229
## 23 272.7 -2.996
## 24 277.8 1.099
## 25 285.6 8.574
## 26 270.1 -6.798
## 27 261.9 -15.095
## 28 303.2 27.202
## 29 272.1 -4.895
## 30 293.9 17.291
## 31 295.0 19.235
## 32 293.7 16.966
## 33 270.0 -7.218
plot(olympic$tab[, "1500"], pca.olympic$scores[, 1], pch = 23, bg = "red", cex = 2)
Also, the second principal component is associated with strength in the form of a long javelin throw.
plot(olympic$tab[, "jave"], pca.olympic$scores[, 2], pch = 23, bg = "red", cex = 2)
Standardizing
In the previous example, we saw that the two variables were based somewhat on speed and strength. However, we did not scale the variables so the 1500 has much more weight than the 400, for instance. We correct this by passing the cor=TRUE argument, which defaults to FALSE, as an argument to princomp.
pca.olympic = princomp(olympic$tab, cor = TRUE)
biplot(pca.olympic)
This plot reinforces our earlier interpretation and has put the running events on an “even playing field” by standardizing.
pca.olympic$loadings[, 1]
## 100 long poid haut 400 110 disq perc jave
## 0.4159 -0.3941 -0.2691 -0.2123 0.3558 0.4335 -0.1758 -0.3841 -0.1799
## 1500
## 0.1701
pca.olympic$loadings[, 2]
## 100 long poid haut 400 110 disq perc
## 0.14881 -0.15208 0.48354 0.02790 0.35216 0.06957 0.50333 0.14958
## jave 1500
## 0.37196 0.42097
Since the first variable seems related to speed, we should not be surprised that an increase in this variable decreases the overall decathlon score.
plot(pca.olympic$scores[, 1], olympic$score, xlab = "Comp. 1", pch = 23, bg = "red",
cex = 2)