This example involves various visual summaries of a data set.
The data set we will use is Fisher’s famous iris data set, which we can find at the UCI machine learning database site.
iris = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
sep = ",", header = FALSE)
head(iris)
## V1 V2 V3 V4 V5
## 1 5.1 3.5 1.4 0.2 Iris-setosa
## 2 4.9 3.0 1.4 0.2 Iris-setosa
## 3 4.7 3.2 1.3 0.2 Iris-setosa
## 4 4.6 3.1 1.5 0.2 Iris-setosa
## 5 5.0 3.6 1.4 0.2 Iris-setosa
## 6 5.4 3.9 1.7 0.4 Iris-setosa
This data file does not have variables names, so we attach them to the data as follows.
names(iris) = c("sepal.length", "sepal.width", "petal.length", "petal.width",
"iris.type")
This plot is an ASCII graphic representation of a sample of number’s first few significant digits.
stem(iris$petal.width)
##
## The decimal point is 1 digit(s) to the left of the |
##
## 1 | 000000
## 2 | 0000000000000000000000000000
## 3 | 0000000
## 4 | 0000000
## 5 | 0
## 6 | 0
## 7 |
## 8 |
## 9 |
## 10 | 0000000
## 11 | 000
## 12 | 00000
## 13 | 0000000000000
## 14 | 00000000
## 15 | 000000000000
## 16 | 0000
## 17 | 00
## 18 | 000000000000
## 19 | 00000
## 20 | 000000
## 21 | 000000
## 22 | 000
## 23 | 00000000
## 24 | 000
## 25 | 000
Our first plot will be some histograms. A histogram represents counts within given intervals by the height of the bars. The following histogram has 10 equally spaced breaks, so 11 bins.
hist(iris$petal.width, breaks = 10)
We can try to get finer information by adding more breaks, though the information is less accurate because of fewer data points in each bin.
hist(iris$petal.width, breaks = 20)
If we don’t want to use R‘s automatically chosen breaks, we can supply our own. We can also add color to the plot
seq(0, 2.5, length = 15)
## [1] 0.0000 0.1786 0.3571 0.5357 0.7143 0.8929 1.0714 1.2500 1.4286 1.6071
## [11] 1.7857 1.9643 2.1429 2.3214 2.5000
hist(iris$petal.width, breaks = seq(0, 2.5, length = 15), col = "blue")
A density is a smoothed version of a histogram. Formally, it is a non-negative function that integrates to 1. The area under the curve in any given interval is an estimate of the probability that that variable will fall into that interval.
R has built in tools for estimating density. We will estimate the density for petal.length among the three different types of irises.
density.setosa = density(iris$petal.length[iris$iris.type == "Iris-setosa"])
density.versicolor = density(iris$petal.length[iris$iris.type == "Iris-versicolor"])
density.virginica = density(iris$petal.length[iris$iris.type == "Iris-virginica"])
Now, we will plot all densities, by first plotting one density, then adding the remaining ones with lines
plot(density.setosa, ylab = "f(length)", xlab = "length (cm)", col = "red",
main = "Density for petal lengths", xlim = c(0, 8), lwd = 4)
lines(density.versicolor, col = "blue", lwd = 4)
lines(density.virginica, col = "green", lwd = 4)
legend(4, 2.5, c("Setosa", "Virginica", "Versicolor"), lwd = rep(4, 3), col = c("red",
"green", "blue"))
We see from this plot that Setosa has the smallest petal lengths.
The cumulative area under a density is called a distribution function. Any sample of numbers has a natural distribution function, known as the ECDF (Empirical Cumulative Distribution Function). Note that it is not the cumulative area under the previous densities, it is rougher.
ecdf.setosa = ecdf(iris$petal.length[iris$iris.type == "Iris-setosa"])
ecdf.versicolor = ecdf(iris$petal.length[iris$iris.type == "Iris-versicolor"])
ecdf.virginica = ecdf(iris$petal.length[iris$iris.type == "Iris-virginica"])
Let’s plot these ECDFs, which are always step functions. It has the same information as the density: we see that the red curve is always greater than or equal to the blue or green curves. This means that the irises in the red group (setosa) have the smallest petal lengths.
plot(ecdf.setosa, ylab = "F(length)", xlab = "length (cm)", col.h = "red", main = "ECDF for petal lengths",
verticals = T, col.v = "red", do.p = F, lwd = 4, xlim = c(0, 8))
lines(ecdf.versicolor, col.h = "blue", col.v = "blue", verticals = T, lwd = 4,
do.p = F)
lines(ecdf.virginica, col.h = "green", col.v = "green", verticals = T, lwd = 4,
do.p = F)
The inverse of the ECDF is the quantile function. Plotting the quantile function displays all quantiles of the data sets in a visual fashion.
ps = seq(0, 1, length = 20)
quantile.setosa = quantile(iris$petal.length[iris$iris.type == "Iris-setosa"],
probs = ps)
quantile.setosa
## 0% 5.263% 10.53% 15.79% 21.05% 26.32% 31.58% 36.84% 42.11% 47.37%
## 1.000 1.200 1.300 1.300 1.332 1.400 1.400 1.400 1.400 1.500
## 52.63% 57.89% 63.16% 68.42% 73.68% 78.95% 84.21% 89.47% 94.74% 100%
## 1.500 1.500 1.500 1.500 1.511 1.600 1.600 1.684 1.700 1.900
quantile.versicolor = quantile(iris$petal.length[iris$iris.type == "Iris-versicolor"],
probs = ps)
quantile.virginica = quantile(iris$petal.length[iris$iris.type == "Iris-virginica"],
probs = ps)
This plot will tell us the same information all the other plots did, that the iris types are separated by petal length.
plot(ps, quantile.setosa, xlab = "p", ylab = "length (cm)", col = "red", main = "Quantiles for petal lengths",
type = "l", lwd = 4, ylim = c(0, 8))
lines(ps, quantile.versicolor, col = "blue", lwd = 4)
lines(ps, quantile.virginica, col = "green", lwd = 4)
We already saw boxplots on Homework 1. Here is an example that shows box plots of each column in the data set.
boxplot(iris[, -5])
As in Homework 1, we can create a set of boxplots to show how a given variable like petal.length varies with iris.type.
boxplot(iris$petal.length ~ iris$iris.type)
Another useful summary of the data is a set of pairwise plots of all the continuous variables. Pairwise plots when one or more of the variables are discrete are more difficult to interpret – boxplots would be better in this case, or a table if both variables are discrete.
pairs(as.matrix(iris[, -5]), pch = 21, bg = c("red", "green", "blue")[unclass(iris$iris.type)])
Density estimates can also be produced for higher dimensional data. Here is an example of a 2-dimensional kernel density estimate of the petal lengths and widths. The 2D density estimate is in the MASS package.
library(MASS)
petal.dens = kde2d(iris$petal.length, iris$petal.width)
contour(petal.dens)
We can also view it as an image.
image(petal.dens)
Here is an example that shows the case-by-case correlation matrix of the iris data. The blocky structure of this matrix reflects the fact that the cases are broken into three groups and the rows were ordered by these three groups.
library(fields)
cI = cor(t(iris[, -5]))
plot(c(0, 1.2), c(0, 1), type = "n", bty = "n", xlab = "", ylab = "")
image(cI, col = tim.colors(20), add = TRUE)
colorbar.plot(1.1, 0.5, seq(min(cI), max(cI), length = 51), col = tim.colors(50),
adj.x = 0, horizontal = FALSE)
text(1.15, 0.2, round(min(cI), 2))
text(1.15, 0.8, max(cI))