We will work on Stanford's servers
Open SSH (or PuTTY) and connect to corn.stanford.edu
ssh sunetid@corn.stanford.edu
GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP
Useful for simple downloads
and for recursive downloads (getting an entire website)
wget http://www.google.com
‘-r’
Turn on recursive retrieving.
‘-l depth’
Depth to download
wget -r -l 1 http://www.google.com
This will download all files within a depth of 1 from google.com
‘-m’
Turn on options suitable for mirroring. This option turns on recursion and time-stamping, sets infinite recursion depth and keeps ftp directory listings.
Used to interactively request information from servers
Send variables to request specific data
Send cookies (authentication, etc.)
'Fake' form data
curl http://google.com
Basic request
curl-o example.html http://google.com
Write output to a file instead of stdout
curl -O http://www.stanford.edu/index.html
Save a remote file and preserve the remote name
Two general methods for sending data to a server over the HTTP protocol
embed information needed to respond to the request in the URL
https://www.google.com/search?q=iriss+stanford
https://www.google.com/search?q=iriss+stanford&hl=de
Order doesn't matter, but the first variable must be preceded by a '?' and all others by an '&'
Can represent any kind of data of any length
Data are encoded in a similar way to get data
A form collecting your "Name" and "Location" would encode to:
Name=Sean+Westwood&Location=Palo+Alto
example: http://www.htmlcodetutorial.com/forms/_FORM_METHOD_POST.html
HTTP headers contain information about all HTTP requests and responses. They include:
Request headers
Response headers
Accept:text/html,application/xhtml+xml,application/xml;q=0.9;q=0.8 Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.3 Accept-Encoding:gzip,deflate,sdch Accept-Language:en-US,en;q=0.8 Cache-Control:max-age=0 Connection:keep-alive Cookie:RMID=19de3a0356b15005c00dbe25;... Host:www.nytimes.com Referer:http://www.nytimes.com/ User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.8 (KHTML, like Gecko) Chrome/23.0.1255.0 Safari/537.8
HTTP/1.1 200 OK
Date: Wed, 19 Sep 2012 21:30:14 GMT
Server: Apache
expires: Thu, 01 Dec 1994 16:00:00 GMT
cache-control: no-cache
pragma: no-cache
Set-cookie: adxcl=l*2ea2d=5107574f:1|li=5107574f:1; expires=Thursday, 19-Sep-2013 21:30:14 GMT; path=/; domain=.nytimes.com
Content-Type: text/html; charset=UTF-8
Content-Encoding: gzip
Transfer-Encoding: chunked
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html> <head> <title>An example page</title> </head> <body> <h1>Heading text</h1> <p>A paragraph</p> </body> </html>
Node structure
Clear hierarchy of nodes
XHTML is XML, but HTML is not XML.
Methods for working with XML such as XPATH will not always work with HTML. This largely depends on the language and the implementation of XPATH.
Crap markup will often break XPATH and other tools.
<body>
<h1 id="introHeader">Heading text</h1>
<p class="articleText">A paragraph</p>
<p class="articleText">A second paragraph</p>
</body>
XPath is used to navigate through elements and attributes in an XML/HTML document.
Select the first paragraph
/html/body/p[1]
Select all paragraphs in body
/html/body/p
Select just the text in the paragraphs
/html/body/p/text()
<body> <h1 id="introHeader">Heading text</h1> <p class="firstParagraph">A paragraph</p> <p class="articleText">A second paragraph</p> </body>
Conditional selections (only the paragraph with class "firstParagraph")
//p[@class='firstParagraph']
"/" Selects from the root node
"//" Selects nodes in the document from the current node that match the selection no matter where they are
"@" Selects attributes
Try the previous XPATH examples with http://www.bit-101.com/xpath/ and the following markup:
<html> <head> <title>An example page</title> </head> <body> <h1 id="introHeader">Heading text</h1> <p class="firstParagraph">A paragraph</p> <p class="articleText">A second paragraph</p> </body> </html>
Using the following markup
<html>
<head>
<title>An example page</title>
</head>
<body>
<div id="main">
<h1 id="introHeader">Heading text</h1>
<p class="firstParagraph">A paragraph</p>
<p class="articleText">A second paragraph with a
<a href="http://www.google.com">link to google</a>.</p>
</div>
</body>
</html>
for help see: http://www.w3schools.com/xpath/xpath_syntax.asp
//div[@id="main"]
//div[@id="main"]/h1/text()
//a/@href
<?php
$curl = curl_init('http://www.stanford.edu');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.224 Safari/534.10');
$html = curl_exec($curl);
curl_close($curl);
$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$result = $xpath->query('//span[@class="event_title"]');
foreach ($result as $linkText) {
echo $linkText->nodeValue."<br>";
}
php -f script.php
To create a file use nano (or vim)
nano filename.php
Using the code on the previous slide as a starting point capture the headlines from http://www.nytimes.com/pages/politics/index.html
Bonus if you can get all the articles, including the featured article
<?php
$curl = curl_init('http://www.nytimes.com/pages/politics/index.html');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.224 Safari/534.10');
$html = curl_exec($curl);
curl_close($curl);
$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$result = $xpath->query('//div[@class="storyHeader"]/h1/a/text()');
foreach ($result as $linkText) {
echo $linkText->nodeValue."<br>";
}
$result = $xpath->query('//div[@class="story"]/h3/a/text()');
foreach ($result as $linkText) {
echo $linkText->nodeValue."<br>";
}
Using the code from the last example use XPATH and cURL to download all the articles from the politics section of the New York Times
Name each file as n.html, where n is the current index of the article in the list of articles
<?php
$curl = curl_init('http://www.nytimes.com/pages/politics/index.html');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.224 Safari/534.10');
$html = curl_exec($curl);
curl_close($curl);
$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$result = $xpath->query('//div[@class="story"]/h3/a/@href');
$i=0;
foreach ($result as $linkText) {
$curl = curl_init($linkText->nodeValue);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.224 Safari/534.10');
$html = curl_exec($curl);
curl_close($curl);
file_put_contents($i."html", $html);
}
<body> <div id="main">
<h1 id="introHeader">Heading text</h1>
<p class="firstParagraph">A paragraph</p>
<p class="articleText">A second paragraph with a <a href="http://www.google.com">link to google</a>.</p></div>
</body>
A class selector is a name preceded by a period (.) and an ID selector is a name preceded by a hash character (#).
An ID represents ONE element, whereas a class can represent any number of elements
#introHeader selects the h1 element
.articleText selects all the p elements
Additional information such as element names or additional classes come after the initial selector
To select the link in the second paragraph:
.articleText a
Generally select a single node or a list of nodes
To access attributes (e.g., href of an 'a' tag) or content you must act select attributes of a node
Java
String URL = "http://www.spiegel.de/spiegel/print/index-" + year + "-" + issue + ".html";
Document document = Jsoup.connect(URL).timeout(12000).get();
Elements links = document.select("#spHeftInhalt a");
Integer articleNumber = 0;
for (Element link : links) {
String linkHref = "http://www.spiegel.de" + link.attr("href");
processFile(linkHref, year, issue, articleNumber)
articleNumber++;
}
Java class (also an example of how to create an XML document with Java
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://feeds.nytimes.com/nyt/rss/Politics");
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$newsScrape = curl_exec($ch);
$docNews = new DOMDocument();
$docNews->loadXML($newsScrape);
$nodesArticles = $docNews->getElementsByTagName("item");
foreach($nodesArticles as $nodesArticle){
$link = $nodesArticle->getElementsByTagName("link")->item(0)->nodeValue;
}
http://boilerpipe-web.appspot.com/
Java
URL url = new URL("http://www.example.com/some-location/index.html");
String text = ArticleExtractor.INSTANCE.getText(url);
<?xml version="1.0" encoding="UTF-8"?>
<articles>
<article>
<source>The Political Methodologist</source>
<title> Data from the Web into R</title>
<author> Simon Jackman</author>
<date> Fall 2006</date>
</article>
</articles>
|
XML |
[{
"article":
[{
"source":"The Political Methodologist",
"title":" Data from the Web into R",
"author":" Simon Jackman",
"date":" Fall 2006"
}]
}] |
JSON |
Convert the following CSV data to XML and JSON
| Name | Department | |
| aarefeva@stanford.edu | Arefeva, Alina | Econ |
| gallego@stanford.edu | Gallego, Aina | Poli Sci |
| kgleich@gmail.com | Gleichauf, Karla | Eng |
| akarama1@stanford.edu | Karamalla, Ayman | MS&E |
http://shancarter.com/data_converter/ converts CSV or tab data to XML and JSON
Programmatic conversion from JSON and XML to csv requires custom code for each document/schema
Convert the following XML to a flat file
<items>
<item id="0001" type="donut">
<name>Cake</name>
<batters>
<batter>Regular</batter>
<batter>Chocolate</batter>
<batter>Blueberry</batter>
</batters>
<topping>None</topping>
<topping>Glazed</topping>
<topping>Sugar</topping>
<topping>Sprinkles</topping>
<topping>Chocolate</topping>
<topping>Maple</topping>
</item>
</items>