![]() |
CS 124 / LINGUIST 180 - Winter 2009
Homework 6: Relation Extraction and XML |
| Due: Feb 26 before the start of class |
This homework has two parts, both making use of Wikipedia.
The first 100mb of Wikipedia in xml format is here:
/usr/class/cs124/enwiki-20081008-pages-articles-first-100mb.xml
Obviously, this file is very large, and you may not want to download it. In particular, you don't want to submit it, or you'll very likely run us out of quota. (In fact, opening this in an editor may cause problems. We suggest looking at it in less instead of emacs or vi.)
Part 1 is to take this wikipedia xml file and produce a summary of the titles and sort by last-modified timestamp. It will probably be easiest to use xmlstarlet.
XMLStarlet is a commandline utility for processing XML files. We already sent out an email about how to configure your environment to use it.To use it, edit your ~/.cshrc and add /usr/class/cs124/bin to make it look like:
set path=( ~/bin $site_path /usr/class/cs124/bin/ )
This should be around line 41. Logout and login. You must be on the myths, not the pods, or anything else. Then, execute "xml" to use xml starlet. You'll probabbly be most interested in using "xml tr" which "translates" a page using an xsl style sheet. The syntax is
xml tr [xsl file] [xml file] > [output file]
You should write an xsl file that will print an html table with the title of the articles along with their "revision/timestamp". We also want you to sort them by their timestamp. (Big hint: Search for xsl:sort).
Your job in Part 2 is to do relation extraction. You will be extracting two kinds of relations: the hypernym/hyponym relation, and the "spouse-of" relation, from Wikipedia.
For the first relation, your job is to give us a list of every kind of fish (all hyponyms of "fish") you can find in this data. For your second relation, you must give us all pairs of spouses you can find in this data. (This can be current spouses, or past or historical spouses.)
In both cases, you must use both of the following two methods:
Start by taking a look at the data. For example, search for Largemouth bass, and notice that it is in a structured table called an Infobox. The first way is to use the Infobox information, for example the following:
{{Infobox U.S. state symbols
|Name = Alabama
|Amphibian = [[Red Hills salamander]]
|Bird = [[Yellowhammer]], [[Wild Turkey]]
|Butterfly = [[Eastern Tiger Swallowtail]]
|Fish = [[Largemouth bass]], [[Atlantic tarpon|Fighting tarpon]]
...
You might use this Infobox from the article on
Alabama (David's home state!) to learn that
Largemouth bass, Atlantic tarpon, and Fighting tarpon are all kinds of fish.
Similarly, you might noticed that the Infobox for Arnold Schwarzenegger tells you that he is married to Maria Shriver:
{{Infobox Governor |name = Arnold Schwarzenegger ... |spouse = {{nowrap|[[Maria Shriver]] (1986–present)}}
Next, use patterns like those of Hearst (1992) described in lecture to find more examples of fish that are only in the plain text. For example, the pattern "fish such as X" could help you learn that Chinese Paddlefish and Yangtze Sturgeon are both kinds of fish:
The river is also home to rare fish such as the [[Chinese Paddlefish]] and [[Yangtze Sturgeon]], which may also already be extinct.
Similarly, the pattern X is married to Y could help you extract the spouse relation between Ingrid Selberg and Mustapha Matura from this sentence:
Ingrid Selberg is married to playwright [[Mustapha Matura]].
For each relation, we also want you to tell us the title of the page you extracted the relations from.
For part 1, please turn in your xsl style sheet.
Part 2 of this assignment is deliberately open-ended. You will be evaluated primarily on creativity, effort, and the quality of your report. We don't expect you to solve relation extraction; we want to see that you can successfully mine XML documents and embedded markup to find something interesting.
Your code should produce two lists, one of fish and one of spouses,
in the following format. For fish, fish<TAB>article name. For spouses: entity1<TAB>entity2<TAB>article name
Please make sure that entity 1 is less than entity 2. In Java, you can compare using entity1.compareTo(entity2) < 0;