Skip to content

Scraping New York Times Articles with R

This is a very simple, quick and dirty attempt to make use of the NYT Articles API from within R.
It retrieves the publishing dates of articles that contain a query string and plots the number of articles over time, like this:

NYT Articles containing "Health Care Reform"

The How-To:

There are a currently few limitations on the NYT Article API.

  1. Results are in JSON format only. XML may be available in the future. In order to handle this format we take advantage of the RJSONIO package from omegahat, which needs to be installed from source.
  2. Maximal 10 records are being returned per query.
  3. The total amount of queries is limited to 10/sec and 5,000/day per API key.

To run the code below (free) registration for an API key is required.

R:
  1. # Need to install from source http://www.omegahat.org/RJSONIO/RJSONIO_0.2-3.tar.gz
  2. # then load:
  3. library(RJSONIO)
  4.  
  5. ### set parameters ###
  6. api <- "XXXX" ###### <<<API key goes here!!
  7.  
  8. q <- "health+care+reform" # Query string, use + instead of space
  9. records
  10.  
  11. # calculate parameter for offset
  12. os
  13.  
  14. # read first set of data in
  15. uri raw.data resdat
  16.  
  17. # read in the rest via loop
  18. for (i in 2:length(os)) {
  19. # concatenate URL for each offset
  20. uri raw.data res dat }
  21.  
  22. # aggregate counts for dates and coerce into a data frame
  23. cts
  24.  
  25. # establish date range
  26. dat.conv daterangedat.all
  27.  
  28. # compare dates from counts dataframe with the whole data range
  29. # assign 0 where there is no count, otherwise take count
  30. # (take out PSD at the end to make it comparable)
  31. dat.all # cant' seem to be able to compare Posix objects with %in%, so coerce them to character for this:
  32. freqs
  33.  
  34. plot (freqs, type="l", xaxt="n", main=paste("Search term(s):",q), ylab="# of articles", xlab="date")
  35. axis(1, 1:length(freqs), dat.all)
  36. lines(lowess(freqs, f=.2), col = 2)

No TweetBacks yet. (Be the first to Tweet this post)

Categories: Noteworthy Bits.

Tags: , , ,

Comment Feed

12 Responses

  1. Terrific tutorial, but a few comments/questions:
    Would you provide the source as a downloadable link? When copying/pasting from your example, the line numbers were included. There are also better blog plugins for displaying source code.

    To get your example to run, I had to remove "amp;" from line 23. That wasn't present on line 15.

    Is request throttling handled automatically by NYT? I ask because I thought I read somewhere that NYT submissions are limited to 2/sec.

  2. Thanks for letting me know about that typo on line 23 -- fixed now.

    Right now the way it works is not very intuitive, but if you switch to PLAIN TEXT (top of the box where the code is displayed) you get the original code without the line numbers. (Yes, I need to find some time to get a better plugin for displaying code and fix up a few other things also.)

    I am afraid I don't know any details about their request throttling.

  3. Ricardo Pietrobon suggested to try the above on Pubmed. Here is roughly how it might go.

    Pubmed has an API, provided by NCBI. Details are here. So all we'd need to do is concatenate the appropriate URL string then parse the results in R. Results are in xml, so we'd need the XML library.

    To map keywords by publication date (like in the NYT example) we need to submit two different queries:

    1. For the first query we use the esearch utility to retrieve two important variables, WebEnv and QueryKey.
    For the search term "h1n1" the URL needs to look like:
    http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmax=0&term=h1n1&usehistory=y

    #Read in:
    raw.data < - readLines("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmax=0&term=h1n1&usehistory=y")

    # Parse the result for WebEnv and QueryKey:
    library(XML)
    parsed.data < - xmlTreeParse(raw.data)
    xmlValue(xmlChildren(xmlRoot(parsed.data))$QueryKey)
    xmlValue(xmlChildren(xmlRoot(parsed.data))$WebEnv)

    2. For the second query we use the esummary utility to retrieve the actual records. So the URL, using WebEnv and QueryKey, from the previous query looks in my example looks like:
    http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&retmode=xml&query_key=1&WebEnv=NCID_1_51824642_130.14.18.46_9001_1255542890

    Now parse the result for PubDate.

    For an example in php see here

    I will make this a separate post once I get time to write out the details. Or maybe someone else will.

  4. Hi Claudia,

    Thank you for the interesting code. I tried unpacked the RJSONIO file folder into ..\R\win-library\2.10 but I get the following error:

    Loading required package: RJSONIO
    Error in library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, :
    'RJSONIO' is not a valid installed packag

    Any thought on why it's not working? The code works using the rjson package. What functionality is lost by not using RJSONIO? Also, how can we change the chart axis tick marks to highlight the dates returned by the API?

    Thank you!!

    Matt

    Matthew BascomOctober 29, 2009 @ 5:24 pmReply
    • Matt,

      I don't have much experience on the Windows OS -- *maybe* RJSONIO is not compiled for 2.10 yet?? But this is really just a wild guess.

      See here for difference between rjson and RJSONIO.

      If you want to play with the tick marks layout you could take look at axis.

      Claudia

  5. Duncan Lang recently released a 0.1 version of a package called RNYTimes. It still lacks much documentation and I have not tried it, but it may make things easier.

  6. Thank you, Claudia! I'll take a look at axis and the new RNYTimes packages.

    Best regards,

    Matt Bascom

    Matthew BascomNovember 2, 2009 @ 3:20 pmReply
  7. RJSONIO is also available on the CRAN network, so downloading & installing it is now easier. Just type "install.packages("RJSONIO")" and R will find the package on the Web, download it to the right location and install it.

    MontanaJune 22, 2011 @ 6:14 amReply
  8. Dear Claudia,
    Thank you so much for sharing this work. I stumbled into the throttling issue Neil asked about and found the answer on stack exchange. The NYTimes does not throttle requests for you. You have to add a pause or you will get a polite email telling you you've exceeded your QPS.
    http://stackoverflow.com/questions/1174799/how-to-make-execution-pause-sleep-wait-for-x-seconds-in-r

    PAUSE <- function(x)
    {
    p1 <- proc.time()
    Sys.sleep(x)
    proc.time() - p1 # The cpu usage should be negligible
    }
    PAUSE(1)

    #(.1 will also work in your concatenation loop for 10 queries per second)
    Thank you again. I really admire your work.

    Sincerely,
    -Liam Honecker



Some HTML is OK

or, reply to this post via trackback.

Continuing the Discussion

  1. [...] have read the blog post about Scraping New York Times Articles with R. It’s great. I want to reproduce the work with python. First, we should learn about nytimes [...]

  2. [...] have read the blog post about Scraping New York Times Articles with R. It’s great. I want to reproduce the work with python. First, we should learn about nytimes [...]

  3. [...] friend sent me along this great tutorial on webscraping NYtimes with R. I would really love to try it. However, the first step is to installed a package called RJSONIO [...]