In this example, we will load some data into R, and “explore it”, at least in a simple sense.
In class, I will be sometimes using the ipython notebook to run R, which has to be enabled with this magic.
The data is in a particular library in R called ggplot2. You will have to install it with this command
library(ggplot2)
data(diamonds)
To find out what information there is about the dataset, you can run this command:
help(diamonds)
To find a more numeric summary of the data, try
summary(diamonds)
## carat cut color clarity
## Min. :0.200 Fair : 1610 D: 6775 SI1 :13065
## 1st Qu.:0.400 Good : 4906 E: 9797 VS2 :12258
## Median :0.700 Very Good:12082 F: 9542 SI2 : 9194
## Mean :0.798 Premium :13791 G:11292 VS1 : 8171
## 3rd Qu.:1.040 Ideal :21551 H: 8304 VVS2 : 5066
## Max. :5.010 I: 5422 VVS1 : 3655
## J: 2808 (Other): 2531
## depth table price x
## Min. :43.0 Min. :43.0 Min. : 326 Min. : 0.00
## 1st Qu.:61.0 1st Qu.:56.0 1st Qu.: 950 1st Qu.: 4.71
## Median :61.8 Median :57.0 Median : 2401 Median : 5.70
## Mean :61.8 Mean :57.5 Mean : 3933 Mean : 5.73
## 3rd Qu.:62.5 3rd Qu.:59.0 3rd Qu.: 5324 3rd Qu.: 6.54
## Max. :79.0 Max. :95.0 Max. :18823 Max. :10.74
##
## y z
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 4.72 1st Qu.: 2.91
## Median : 5.71 Median : 3.53
## Mean : 5.73 Mean : 3.54
## 3rd Qu.: 6.54 3rd Qu.: 4.04
## Max. :58.90 Max. :31.80
To view another textual summary, try
str(diamonds)
## 'data.frame': 53940 obs. of 10 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
To peak at a few rows of the data, try
head(diamonds)
## carat cut color clarity depth table price x y z
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Or, if you want the first 10th through 20th rows (inclusive)
diamonds[10:20, ]
## carat cut color clarity depth table price x y z
## 10 0.23 Very Good H VS1 59.4 61 338 4.00 4.05 2.39
## 11 0.30 Good J SI1 64.0 55 339 4.25 4.28 2.73
## 12 0.23 Ideal J VS1 62.8 56 340 3.93 3.90 2.46
## 13 0.22 Premium F SI1 60.4 61 342 3.88 3.84 2.33
## 14 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71
## 15 0.20 Premium E SI2 60.2 62 345 3.79 3.75 2.27
## 16 0.32 Premium E I1 60.9 58 345 4.38 4.42 2.68
## 17 0.30 Ideal I SI2 62.0 54 348 4.31 4.34 2.68
## 18 0.30 Good J SI1 63.4 54 351 4.23 4.29 2.70
## 19 0.30 Good J SI1 63.8 56 351 4.23 4.26 2.71
## 20 0.30 Very Good J SI1 62.7 59 351 4.21 4.27 2.66
You can access variables by name:
summary(diamonds[, c("clarity", "price")])
## clarity price
## SI1 :13065 Min. : 326
## VS2 :12258 1st Qu.: 950
## SI2 : 9194 Median : 2401
## VS1 : 8171 Mean : 3933
## VVS2 : 5066 3rd Qu.: 5324
## VVS1 : 3655 Max. :18823
## (Other): 2531
We might want a visual summary of some variables as well
pairs(diamonds[, c("depth", "price")])
Some of the varibles are discrete, or categorical
boxplot(diamonds$price ~ diamonds$clarity)
You may want all rows of the diamonds with price higher than 4000$.
diamonds_more_than_4000 = diamonds[diamonds$price > 4000, ]
head(diamonds_more_than_4000)
## carat cut color clarity depth table price x y z
## 6212 1.07 Very Good I SI1 58.4 60 4001 6.68 6.78 3.93
## 6213 0.90 Ideal G SI1 61.6 57 4001 6.17 6.24 3.82
## 6214 0.90 Ideal H SI2 62.1 55 4001 6.17 6.20 3.84
## 6215 1.03 Good G SI2 63.7 60 4001 6.35 6.28 4.02
## 6216 0.80 Very Good G VVS2 62.5 56 4002 5.95 5.98 3.73
## 6217 0.99 Very Good J SI1 60.3 57 4002 6.44 6.49 3.90
To extract only the color and clarity of these diamonds:
color_clarity_more_than_4000 = diamonds[diamonds$price > 4000, c("color", "clarity",
"price")]
head(color_clarity_more_than_4000)
## color clarity price
## 6212 I SI1 4001
## 6213 G SI1 4001
## 6214 H SI2 4001
## 6215 G SI2 4001
## 6216 G VVS2 4002
## 6217 J SI1 4002
Or, realizing that color and clarity are the 2nd and 3rd columns and price is 7th, we can find the same data with this command:
color_clarity_more_than_4000 = diamonds[diamonds$price > 4000, c(2, 3, 7)]
head(color_clarity_more_than_4000)
## cut color price
## 6212 Very Good I 4001
## 6213 Ideal G 4001
## 6214 Ideal H 4001
## 6215 Good G 4001
## 6216 Very Good G 4002
## 6217 Very Good J 4002