Aquamacs 2.2 and ESS
News of the latest release of Aquamacs, version 2.2, appeared this week in my echo area. Given the opportunity to procrastinate, I dropped everything and upgraded; returning to work, I noticed that the version of ESS shipped with Aquamacs 2.2 is ESS 5.8, released over a year ago. The latest ESS is 5.13, available from http://ess.r-project … ads/ess/ess-5.13.tgz; the easiest way to install is described by Simon Jackman here and elaborated below:
Unarchive ESS and navigate to the folder created; edit Makeconf to set the following:
DESTDIR=/Applications/Aquamacs.app/Contents/Resources/lisp/aquamacs/edit-modes PREFIX=$(DESTDIR) EMACS=/Applications/Aquamacs.app/Contents/MacOS/Aquamacs LISPDIR=$(PREFIX)/ess-mode/lisp INFODIR=$(PREFIX)/info ETCDIR=$(PREFIX)/ess-mode/etc
Open Terminal.app, cd to the directory created when you extracted ESS (tip: drag the little folder icon from the top of the Finder window to copy the path). Then gmake install and the updated ESS will overwrite the old version inside the app package. Done! Hopefully this post will save someone the trouble of figuring out where ESS hides deep inside the Aquamacs.app package.
[Update (4/02/11)]: Martin Maechler was kind enough to add the above to the ESS Makeconf; for future versions of ESS, Aquamacs users should simply have to uncomment the appropriate lines in the Makeconf file.
MATLAB / R Reference
Anyone with a MATLAB background interested in transitioning to R is advised to check out this MATLAB / R Reference by Professor David Hiebeler of the University of Maine.
Google Insights and RCurl
Google Insights is nifty. If you’re logged in to your Google account, you can download the results as a CSV file. This is straightforward if you’re using a browser; if you’re trying to retrieve the results of queries using R, however, things get more complicated.
The following code retrieves the results of a Google Insights search for “Sarah Palin” as a data.frame. It uses the RCurl package to do all of the hard work.
username <- "username@gmail.com" password <- "password_here" loginURL <- "https://accounts.google.com/accounts/ServiceLogin" authenticateURL <- "https://accounts.google.com/accounts/ServiceLoginAuth" require(RCurl) ch <- getCurlHandle() curlSetOpt(curl = ch, ssl.verifypeer = FALSE, useragent = "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13", timeout = 60, followlocation = TRUE, cookiejar = "./cookies", cookiefile = "./cookies") ## do Google Account login loginPage <- getURL(loginURL, curl = ch) require(stringr) galx.match <- str_extract(string = loginPage, pattern = ignore.case('name="GALX"\\s*value="([^"]+)"')) galx <- str_replace(string = galx.match, pattern = ignore.case('name="GALX"\\s*value="([^"]+)"'), replacement = "\\1") authenticatePage <- postForm(authenticateURL, .params = list(Email = username, Passwd = password, GALX = galx), curl = ch) ## get Google Insights results CSV insightsURL <- "http://www.google.com/insights/search/overviewReport" resultsText <- getForm(insightsURL, .params = list(q = "Sarah Palin", cmpt = "q", content = 1, export = 1), curl = ch) if(isTRUE(unname(attr(resultsText, "Content-Type")[1] == "text/csv"))) { ## got CSV file ## create temporary connection from results tt <- textConnection(resultsText) resultsCSV <- read.csv(tt, header = FALSE) ## close connection close(tt) } else { ## something went wrong ## probably need to log in again? }
download ‘Google Insights.R’ from gist.github.com
I don’t have much else to say about this, but I hope that it will be helpful to someone.
You can change the query to incorporate geographic restrictions or such by adding the parameters that appear in the URL when you change your search through the Google Insights web search; for instance, a basic search for “QUERY” gives URL http://www.google.com/insights/search/#q=QUERY&cmpt=q whereas the same search restricted to the state of New York has URL http://www.google.com/insights/search/#q=QUERY&geo=US-NY&cmpt=q; the added parameter is “geo=US-NY”. To incorporate this into the script, change
resultsText <- getForm(insightsURL, .params = list(q = "Sarah Palin", cmpt = "q", content = 1, export = 1), curl = ch)
to have the additional parameter in the .params list:
resultsText <- getForm(insightsURL, .params = list(q = "Sarah Palin", cmpt = "q", geo = "US-NY", content = 1, export = 1), curl = ch)
[Updated 2012-04-24]
How to buy a used car with R (part 2)
Continued from Part 1.
Part 2: Digging into the Kelley Blue Book
The only thing better than a bit of data is a lot of data. Now that we can grab KBB values for a given trim of a given model in a given year, we set our ambitions higher: automating the collection of these values for all trims of a model over a set of years. To do so, let’s back up and recall how we got to the KBB results page:
Let’s suppose we’re still set on the Honda Accord and are considering the last ten model years. Going with “Search by: Year, Make & Model”, we get to the following self-explanatory screen:
Choosing (2005, Honda, Accord) pushes us to the following address: http://www.kbb.com/used-cars/honda/accord/2005/. There, we are reminded that the KBB reports different values for retail, certified retail, private sellers, and trade-ins:
Let’s go with “Private Party Value” for now; we end up at http://www.kbb.com/used-cars/honda/accord/2005/private-party-value. We’re now presented with a plethora of different trims, enough to make us nostalgic for Henry Ford:
Start with the “DX Sedan 4D”. We arrive at http://www.kbb.com/used-cars/honda/accord/2005/private-party-value/equipment?id=846. If the previous screen didn’t freak us out, this one definitely should—-but if we ignore the options at the bottom (which are set to their standard values for the given model year and trim), we’re left with the important parameters: the choice of automatic or manual transmission and the mileage (and the ZIP code, which I’ll discuss later).
I can’t drive stick, so I’m not particularly worried about changing the transmission from its default of Automatic. But if you wanted to, note that choosing Automatic with default options and 10,000 miles pushes you to http://www.kbb.com/used-cars/honda/accord/2005/private-party-value/condition?id=846&mileage=10000 whereas choosing Manual, 5-Spd with the same options and mileage gives http://www.kbb.com/used-cars/honda/accord/2005/private-party-value/condition?id=846&equipment=35014|true&mileage=10000.
Either way, we end up at a completely pointless page: no matter what you select, the results page gives values for all conditions.
Say we select “Good”. The results page for the Automatic is located at http://www.kbb.com/used-cars/honda/accord/2005/private-party-value/pricing-report?condition=good&id=846&mileage=10000 and the results page for the Manual, 5-Spd is located at http://www.kbb.com/used-cars/honda/accord/2005/private-party-value/pricing-report?condition=good&id=846&equipment=35014|true&mileage=10000. If we want, we can tear off the “condition” field, in which case the default condition, Excellent, is highlighted.
So, if we want to grab results for a bunch of different years and trims, we need to figure out the id=846 part of the URL (and possibly the equipment=35014|true part if we’re after a manual transmission). Again, it’s time for Firebug. Back up to the trim selection page at http://www.kbb.com/used-cars/honda/accord/2005/private-party-value and load up Firebug. If we examine the links for the various trims, we see that the links for the available trims are contained within a div with id='UCPathTrim'.
The next step is to write some R code to parse the trim selection page and pull out the available trims and their corresponding id values. This will make use of some of the core functionality of the XML package.
The XML package and HTML documents
In the last post, we used the function readHTMLTable from the XML package to read the results from a webpage into an R data.frame. At the time, there was little mention of the technical details; now, we’re moving beyond convenient functions and into the great unknown.
The XML package, written by Professor Duncan Temple Lang of UC Davis, is a wrapper for libxml2. The package website, hosted by The Omega Project for Statistical Computing, is at http://www.omegahat.org/RSXML/, and the package listing on CRAN is located at http://cran.r-project.org/web/packages/XML/index.html.
At its core, the XML package is meant for parsing XML and HTML documents into tree structures and selecting and extracting or otherwise manipulating branches or nodes of the trees. Take a look at the HTML tab of Firebug again (on http://www.kbb.com/used-cars/honda/accord/2005/private-party-value), and note that the webpage consists of a tree of HTML tags. At its root, there’s a html node, with children head and body; within the body branch are nodes defining the structure of the document, including a branch descending from a div node (<div class="modCBox UCPathModule" id="UCPathTrim">) containing a branch descending from a span node (<span class="sectContent">) with leaf nodes like <a class="link_circle_arrow_blue" href="/used-cars/honda/accord/2005/private-party-value/equipment?id=846"> Accord DX Sedan 4D</a>.
Now, moving to R, we’ll look at the tree produced by the XML package for this document. The first section of code should be fairly straightforward:
## download the webpage kbbHTML <- readLines("http://www.kbb.com/used-cars/honda/accord/2005/private-party-value") ## load the XML package and parse the downloaded document require(XML) kbbTree <- htmlTreeParse(kbbHTML, asText = TRUE) ## get the root ('html') node kbbRoot <- xmlRoot(kbbTree)
Each node object (class XMLNode) is also a list containing its immediate children as node objects.
> ## print the child nodes ('head' and 'body')
> print(summary(kbbRoot))
Length Class Mode
head 14 XMLNode list
body 19 XMLNode list
Thus, we can get the body of the document:
## select the 'body' child node using the usual R list element extraction syntax kbbBody <- kbbRoot[["body"]]
Within the body, there’s a bunch of child nodes (the same ones we see in Firebug, of course):
> ## print the child nodes of the 'body'
> print(summary(kbbBody))
Length Class Mode
script 1 XMLNode list
script 1 XMLNode list
div 4 XMLNode list
comment 0 XMLCommentNode list
script 0 XMLNode list
script 1 XMLNode list
script 1 XMLNode list
script 1 XMLNode list
noscript 1 XMLNode list
comment 0 XMLCommentNode list
comment 0 XMLCommentNode list
script 0 XMLNode list
div 2 XMLNode list
script 0 XMLNode list
script 1 XMLNode list
comment 0 XMLCommentNode list
script 1 XMLNode list
noscript 1 XMLNode list
comment 0 XMLCommentNode list
Either by looking at the tree in Firebug or using summaries of the tree in R, we can identify the div node we’re looking for and access the corresponding node object in R:
## select our 'div id="UCPathTrim"...' node; instead of using node ## names (like 'div'), which aren't necessarily unique here, we use ## indices (we want the first child of the first child of the second ## child of the second child of the third child of 'body') divUCPathTrim <- kbbBody[[3]][[2]][[2]][[1]][[1]]
> ## print the child nodes
> print(summary(divUCPathTrim))
Length Class Mode
h2 1 XMLNode list
text 0 XMLTextNode list
span 9 XMLNode list
We can then access the trim links, which are the leaf nodes of the span node under divUCPathTrim. Printing an XMLNode object outputs the raw HTML.
> ## print the HTML of the first of the link leaf nodes (children of the 'span' node) > print(divUCPathTrim[["span"]][[1]]) <a href="/used-cars/honda/accord/2005/private-party-value/equipment?id=846" class="link_circle_arrow_blue">Accord DX Sedan 4D</a>
To get the node contents (here, the trim label), we use the xmlValue function:
> ## print the *contents* of this leaf node > print(xmlValue(divUCPathTrim[["span"]][[1]])) [1] "Accord DX Sedan 4D"
To get the link target (the ‘href’ attribute), we use the xmlAttrs function:
> ## print the 'href' attribute of this leaf node > print(xmlAttrs(divUCPathTrim[["span"]][[1]])[["href"]]) [1] "/used-cars/honda/accord/2005/private-party-value/equipment?id=846"
There’s an easier way to select a set of nodes and apply functions over this set. To do so, we must learn a bit of XPath.
XPath
XPath is a query language for selecting sets of nodes from XML or XML-like documents (like HTML webpages). A nice quick introduction to XPath syntax is the w3schools.com article XPath Syntax. Open it in a tab, read it, and come back.
Done? Good. If we’re super lazy, we can use Firebug to generate an XPath expression to select a given node—just right click on the node and choose “Copy XPath”. Here’s the XPath expression for the second of the nine trim links:
/html/body/div/div[2]/div[2]/div/div/span/a[2]
To select all of the nine trim links, we simply chop off the “[2]” on the end (match all a nodes that are children of that span):
/html/body/div/div[2]/div[2]/div/div/span/a
If we want a short XPath expression, we can instead use something like this:
//div[@id = 'UCPathTrim']//a
That is, we select all a nodes that descend from any div node with attribute id='UCPathTrim'. In XPath syntax, “//nodename” selects descendant nodes named nodename while “/nodename” selects child nodes named nodename (immediate descendants). Using double forward slashes allows us to skip specifying intermediate nodes. Expressions within brackets are conditions, evaluated to booleans, specifying whether a node should or should not be included.
Is there any advantage to using one expression over the other? So long as the structure of the webpage doesn’t change, both will work; however, if the order of the nodes in the document changes, the former expression will fail, but the latter will continue to work (it selects on the div id attribute rather than its position in the document). Similarly, if the div id changes but the document structure otherwise remains unchanged (this is unlikely, but might happen if they messed around with their CSS styling or something), the former would continue working but the latter would fail.
We can create a fancier XPath expression using XPath functions that will continue to work so long as the KBB URL scheme stays the same. Since the rest of the code will depend on this remaining constant, our XPath expression should only fail at the same time as the rest of our code. A list of XPath functions can be found here. We’ll use the function contains(x, y), which returns true if string x contains string y (else false). Our XPath expression is:
//a[contains(@href, 'used-cars/honda/accord/2005/private-party-value/equipment')]
This selects all links with target URLs containing ‘used-cars/honda/accord/2005/private-party-value/equipment’.
getNodeSet and xpathApply
To use XPath with the XML package, we need to parse the document a little differently. You see, the XML package can either parse the document into a tree structure of R objects (as we did above, using htmlTreeParse) or into a tree structure of pointers to C-level objects. In the latter case, the parsed structure is maintained as lower-level objects in memory, and is not immediately accessible in R. Indeed, incorrectly accessing the parsed document object can cause R to crash. However, parsing the document into this C-level structure internal to libxml2 permits the use of XPath expressions. For more, do help("xmlParse").
In practice, using XPath expressions with the XML package is fairly simple. We parse the document with htmlParse instead of htmlTreeParse, and select sets of nodes corresponding to XPath expressions using getNodeSet. We can then lapply or sapply over the resulting nodeset. If we only need to apply a single function, we can instead use xpathApply to apply a function to an XPath-defined set directly.
## parse the downloaded document to an XMLInternalDocument kbbInternalTree <- htmlParse(kbbHTML, asText = TRUE) ## select nodes matching our XPath expression xpath.expression <- "//a[contains(@href,'/used-cars/honda/accord/2005/private-party-value/equipment')]" trim.nodes <- getNodeSet(doc = kbbInternalTree, path = xpath.expression)
> ## the result is of class "XMLNodeSet", a list of 9 externalptr
> ## objects of class "XMLInternalElementNode"
> print(summary(trim.nodes))
Length Class Mode
[1,] 1 XMLInternalElementNode externalptr
[2,] 1 XMLInternalElementNode externalptr
[3,] 1 XMLInternalElementNode externalptr
[4,] 1 XMLInternalElementNode externalptr
[5,] 1 XMLInternalElementNode externalptr
[6,] 1 XMLInternalElementNode externalptr
[7,] 1 XMLInternalElementNode externalptr
[8,] 1 XMLInternalElementNode externalptr
[9,] 1 XMLInternalElementNode externalptr
> ## we can now lapply or sapply over this list object > print(lapply(trim.nodes, function(x) c(xmlValue(x), xmlAttrs(x)[["href"]]))) [[1]] [1] " Accord DX Sedan 4D" [2] "/used-cars/honda/accord/2005/private-party-value/equipment?id=846" [[2]] [1] " Accord EX Coupe 2D" [2] "/used-cars/honda/accord/2005/private-party-value/equipment?id=863" [[3]] [1] " Accord EX Sedan 4D" [2] "/used-cars/honda/accord/2005/private-party-value/equipment?id=859" [[4]] [1] " Accord EX-L Coupe 2D" [2] "/used-cars/honda/accord/2005/private-party-value/equipment?id=263736" [[5]] [1] " Accord EX-L Sedan 4D" [2] "/used-cars/honda/accord/2005/private-party-value/equipment?id=263737" [[6]] [1] " Accord Hybrid Sedan 4D" [2] "/used-cars/honda/accord/2005/private-party-value/equipment?id=868" [[7]] [1] " Accord LX Coupe 2D" [2] "/used-cars/honda/accord/2005/private-party-value/equipment?id=856" [[8]] [1] " Accord LX Sedan 4D" [2] "/used-cars/honda/accord/2005/private-party-value/equipment?id=850" [[9]] [1] " Accord LX Special Edition Coupe 2D" [2] "/used-cars/honda/accord/2005/private-party-value/equipment?id=867"
Putting it all together
I’m getting tired, so let’s jump ahead to a complete function that retrieves all of the trims for a given year. If you’ve read and understood everything above, you should be able to figure out how the function works without much trouble (with the possible exception of the XPath expression, which needlessly uses regular expressions). Go wild with help(...) until it all makes sense.
getKBBYearTrims <- function(prefix, year, type = "private-party-value") { require(XML) kbbTrimPageURL <- sprintf("%s%i/%s", prefix, year, type) cat("Loading", kbbTrimPageURL, "\n") x <- readLines(kbbTrimPageURL) g <- htmlParse(x, asText=TRUE) xpath <- gsub("([http:/w.]+kbb\\.com/)(.*)", "//a[contains(@href, '\\2/equipment')]", kbbTrimPageURL) cat("XPath expression is:", xpath, "\n") trims <- getNodeSet(doc = g, path = xpath) trimlabels <- sapply(trims, xmlValue) trimids <- sapply(trims, function(node) sub(".*id=([[:digit:]]+)$", "\\1", xmlAttrs(node)[["href"]])) trimtable <- data.frame(year = year, trim = trimlabels, id = trimids, stringsAsFactors = FALSE) return(trimtable) }
The function works great for 2005 Accords:
> ## print trims and ids for 2005 Honda Accords > print(getKBBYearTrims(prefix = "http://www.kbb.com/used-cars/honda/accord/", year = 2005)) Loading http://www.kbb.com/used-cars/honda/accord/2005/private-party-value XPath expression is: //a[contains(@href, 'used-cars/honda/accord/2005/private-party-value/equipment')] year trim id 1 2005 Accord DX Sedan 4D 846 2 2005 Accord EX Coupe 2D 863 3 2005 Accord EX Sedan 4D 859 4 2005 Accord EX-L Coupe 2D 263736 5 2005 Accord EX-L Sedan 4D 263737 6 2005 Accord Hybrid Sedan 4D 868 7 2005 Accord LX Coupe 2D 856 8 2005 Accord LX Sedan 4D 850 9 2005 Accord LX Special Edition Coupe 2D 867
The following function wraps getKBBYearTrims to return a data.frame of trims for a set of model years.
getKBBTrims <- function(prefix, years, type = "private-party-value") { kbbTrimList <- lapply(years, function(year) getKBBYearTrims(prefix, year)) kbbTrims <- do.call('rbind', kbbTrimList) return(kbbTrims) }
Using it, we can try getting the trims for a series of model years:
> ## print trims and ids for years 2003 to 2007 > accord.trims <- getKBBTrims(prefix = "http://www.kbb.com/used-cars/honda/accord/", years = 2003:2007) Loading http://www.kbb.com/used-cars/honda/accord/2003/private-party-value XPath expression is: //a[contains(@href, 'used-cars/honda/accord/2003/private-party-value/equipment')] Loading http://www.kbb.com/used-cars/honda/accord/2004/private-party-value XPath expression is: //a[contains(@href, 'used-cars/honda/accord/2004/private-party-value/equipment')] Loading http://www.kbb.com/used-cars/honda/accord/2005/private-party-value XPath expression is: //a[contains(@href, 'used-cars/honda/accord/2005/private-party-value/equipment')] Loading http://www.kbb.com/used-cars/honda/accord/2006/private-party-value XPath expression is: //a[contains(@href, 'used-cars/honda/accord/2006/private-party-value/equipment')] Loading http://www.kbb.com/used-cars/honda/accord/2007/private-party-value XPath expression is: //a[contains(@href, 'used-cars/honda/accord/2007/private-party-value/equipment')] > print(accord.trims) year trim id 1 2003 Accord DX Sedan 4D 2488 2 2003 Accord EX Coupe 2D 2496 3 2003 Accord EX Sedan 4D 2498 4 2003 Accord EX-L Coupe 2D 263731 5 2003 Accord EX-L Sedan 4D 263730 6 2003 Accord LX Coupe 2D 2495 7 2003 Accord LX Sedan 4D 2492 8 2004 Accord DX Sedan 4D 2664 9 2004 Accord EX Coupe 2D 2671 10 2004 Accord EX Sedan 4D 2676 11 2004 Accord EX-L Coupe 2D 263735 12 2004 Accord EX-L Sedan 4D 263734 13 2004 Accord LX Coupe 2D 2669 14 2004 Accord LX Sedan 4D 2663 15 2005 Accord DX Sedan 4D 846 16 2005 Accord EX Coupe 2D 863 17 2005 Accord EX Sedan 4D 859 18 2005 Accord EX-L Coupe 2D 263736 19 2005 Accord EX-L Sedan 4D 263737 20 2005 Accord Hybrid Sedan 4D 868 21 2005 Accord LX Coupe 2D 856 22 2005 Accord LX Sedan 4D 850 23 2005 Accord LX Special Edition Coupe 2D 867 24 2006 Accord EX Coupe 2D 741 25 2006 Accord EX Sedan 4D 739 26 2006 Accord EX-L Coupe 2D 263727 27 2006 Accord EX-L Sedan 4D 263726 28 2006 Accord Hybrid Sedan 4D 744 29 2006 Accord LX Coupe 2D 736 30 2006 Accord LX Sedan 4D 734 31 2006 Accord SE Sedan 4D 738 32 2006 Accord VP Sedan 4D 737 33 2007 Accord EX Coupe 2D 83835 34 2007 Accord EX Sedan 4D 83834 35 2007 Accord EX-L Coupe 2D 263674 36 2007 Accord EX-L Sedan 4D 263675 37 2007 Accord Hybrid Sedan 4D 83836 38 2007 Accord LX Coupe 2D 83833 39 2007 Accord LX Sedan 4D 83829 40 2007 Accord SE Sedan 4D 83832 41 2007 Accord VP Sedan 4D 83827
Everything works great. What a shock.
How to buy a used car with R (part 1)
I’m in the process of buying a used car. Since I enjoy making these decisions as complicated as possible, I’ve written some R code to scrape relevant websites for informative data. I’ve written this up as a blog entry because I think it’s a decent example of how one might use the XML package and Firebug to quickly and easily bring data from websites into R.
Part 1: Scraping the surface of the Kelley Blue Book
In the past, the first resource a used car buyer looking for price information might have turned to was the Kelley Blue Book; now, this information is available for free at KBB.com:
Finding the data with Firebug
For now, I’m going to skip ahead to the page containing the kind of information that we want; later, I’ll back up and go through the process of getting to that page and detail how I wrote some simple functions automating queries for different parameters.
Here’s http://www.kbb.com/used-cars/honda/accord/2005/private-party-value/pricing-report?condition=excellent&id=846&mileage=10000, giving the KBB private party value for a 2005 Honda Accord DX Sedan with automatic transmission, standard options, and 10,000 miles:

To get at the data we want, we need to identify where it is located in the structure of the page. While one can do this by simply reading the HTML source code, Firebug makes things much simpler. Load up Firebug and go to the HTML tab. Click the Inspect Element button (or go to the Firebug menu and choose Inspect Element); as you mouse-over elements on the page, you’ll notice that the corresponding tag in the HTML element tree is opened and highlighted. In the screenshot below, I’ve clicked on the value for the Excellent condition:
Examining the HTML tree in the Firebug display, we can see that all of the information we’re interested in is contained in a table with id ‘priceCondition’. Similarly, if you’re using Google Chrome, you can accomplish the same thing with the Developer Tools. Below, Firefox is on the left and Chrome is on the right:
Parsing the web with the XML package
The XML package includes a convenient function called readHTMLTable to grab the data from the table we identified earlier. We can simply give it the URL of the page and it returns a list containing each of the page’s tables as an R object (converting them to data.frame by default).
kbbURL <- "http://www.kbb.com/used-cars/honda/accord/2005/private-party-value/pricing-report?condition=excellent&id=846&mileage=10000" require(XML) kbbTables <- readHTMLTable(kbbURL)
With this minimal amount of effort, we’re most of the way to what we’re after:
> print(kbbTables) $priceCondition Condition\r\n \r\n Value 12 Excellent $12,340 3 Good $11,665 4 Fair $10,565
By explicitly specifying the header, skipping the first two rows, and extracting the ‘priceCondition’ data.frame itself, we’re left with the raw data we are interested in:
kbbTable <- readHTMLTable(doc = kbbURL, header = c("Condition","Value"), skip.rows = c(1,2))[["priceCondition"]]
> print(kbbTable) Condition Value 1 Excellent $12,340 2 Good $11,665 3 Fair $10,565
Now, if we take a look at the URL we’re using, http://www.kbb.com/used-cars/honda/accord/2005/private-party-value/pricing-report?condition=excellent&id=846&mileage=10000, it should be apparent that fetching these values for any given mileage won’t be any trouble. The following code gets the KBB values for 10,000 mile increments from 10,000 to 150,000 miles:
kbbURLPrefix <- "http://www.kbb.com/used-cars/honda/accord/2005/private-party-value/pricing-report?condition=excellent&id=846&mileage=" kbbValuesList <- lapply(seq(10000,150000,by=10000), function(m) { readHTMLTable(doc = paste(kbbURLPrefix,m,sep=""), header = c("Condition","Value"), skip.rows = c(1,2))[["priceCondition"]] })
> length(kbbValuesList) [1] 15 > head(kbbValuesList,2) [[1]] Condition Value 1 Excellent $12,340 2 Good $11,665 3 Fair $10,565 [[2]] Condition Value 1 Excellent $11,965 2 Good $11,290 3 Fair $10,190
Finally, we can convert the list into one big data.frame and augment it with the corresponding mileages and the model year. This leaves us with a nice data.frame from which we can extract whatever information we desire.
kbbValues <- do.call('rbind', kbbValuesList) kbbValues$Mileage <- rep(seq(10000,150000,by=10000), each = 3) kbbValues$Year <- 2005
> head(kbbValues)
Condition Value Mileage Year
1 Excellent $12,340 10000 2005
2 Good $11,665 10000 2005
3 Fair $10,565 10000 2005
4 Excellent $11,965 20000 2005
5 Good $11,290 20000 2005
6 Fair $10,190 20000 2005
> print(kbbValues[which(kbbValues$Condition == "Excellent"),c("Mileage","Value")])
Mileage Value
1 10000 $12,340
4 20000 $11,965
7 30000 $11,565
10 40000 $11,140
13 50000 $10,740
16 60000 $10,265
19 70000 $9,740
22 80000 $9,190
25 90000 $8,640
28 100000 $9,440
31 110000 $7,640
34 120000 $7,190
37 130000 $6,765
40 140000 $6,190
43 150000 $5,965
Graphing our results with ggplot
Our last trick for the day is a simple one: take the data and make a pretty picture. Having collected the KBB values for different conditions and mileages, it is straightforward to construct a plot of value versus mileage for each condition.
First, however, we need to convert the kbbValues$Value column from its current human-readable state (a factor with levels like “$10,265”) into a more natural form for analysis. A quick bit of regular expressions magic using gsub does the trick, and we’re left with a nice column of numbers:
kbbValues$Value <- as.numeric(gsub("[$,]","",kbbValues$Value))
> kbbValues$Value [1] 12340 11665 10565 11965 11290 10190 11565 10890 9790 11140 10465 9365 [13] 10740 10065 8965 10265 9590 8490 9740 9065 7965 9190 8515 7415 [25] 8640 7965 6865 9440 8765 7665 7640 6965 5865 7190 6515 5415 [37] 6765 6090 4990 6190 5515 4415 5965 5290 4190
Use of ggplot is a subject best left for another time. Here, it’s as simple as:
require(ggplot2) ggplot(kbbValues, aes(x = Mileage, y = Value, color = Condition, group = Condition)) + geom_line()
This gives us the following beautiful plot:
Wait, what?
So, where did that peak at 100,000 miles come from?
Well, looking back, it’s clear that it’s present in the raw data in kbbValues. If we check the original page (http://www.kbb.com/used-cars/honda/accord/2005/private-party-value/pricing-report?condition=excellent&id=846&mileage=100000), however, the values don’t match. What happened?
The culprit, and a correction, are in the code below:
kbbURLPrefix <- "http://www.kbb.com/used-cars/honda/accord/2005/private-party-value/pricing-report?condition=excellent&id=846&mileage=" kbbValuesList <- lapply(seq(10000,150000,by=10000), function(m) { currentURL <- sprintf("%s%i",kbbURLPrefix,m) cat(currentURL,"\n") # print debug info so we catch these errors! readHTMLTable(doc = currentURL, # The following converts m to character using as.character, # but as.character(100000) returns "1e+05" # doc = paste(kbbURLPrefix,m,sep=""), header = c("Condition","Value"), skip.rows = c(1,2))[["priceCondition"]] })
Using the corrected procedure, we are rewarded with a nice, smooth graph:
![]()
