You are hereBlogs
Blogs
Aberrant Adjectives in 19th Century Novels
I created the visualization below using Many Eyes and a data set derived from part-of-speech tagged novels from 19th century Britain. Found here are the 100 most "aberrant adjectives." Aberrant here is determined by selecting those words that have the greatest amount of usage deviation (measured by relative frequency) over a 13 decade time period. To qualify a word must also appear in every decade.
Chronicle of Higher Education Article
This week the Chronicle of Higher Education ran an article written by Jennifer Howard about "literary geospaces." The article featured some work I have done mapping Irish-American literature using Google Earth (and also profiled the work of Janelle Jenstad who has been mapping early modern London).
POS Tagging XML with xGrid and the Stanford Log-linear Part-Of-Speech Tagger
Recently (4/2008) I had reason to Part-Of-Speech tag a whole mess of novels, around 1200. I installed the Stanford Tagger and ran my first job of 250 novels on an old G4 under my desk. Everything worked fine, but the job took six days. After that experience, I figured out how to utilize xGrid for "distributed" tagging, or what I'll call, according to convention, "Tagging@Home." At the time that I was working on this tagging project, the folks in Stanford's NLP group, especially Chris Manning and Anna Rafferty, were improving the tagger and adding some XML functionality to the program. I'm very grateful to Chris and Anna for their work. What follows is a practical guide for those who might wish to employ the tagger for use with XML or who might want to understand how to set up xGrid to help distribute a big tagging job.
First I provide some simple examples showing how to take advantage of the XML functionality added in the May 19, 2008 release of the Stanford Log-linear Part-Of-Speech Tagger. Further down in this page I include information about setting up xGrid to farm out a large tagging job to a network of Macs, useful if you want to POS tag a large corpus. These example assume that you have installed the tagger and understand how to invoke the tagger from the cmd line. If you are not yet familiar with the tagger, you should consult the ReadMe.txt file and javadoc that come with it. In the javadoc, see specifically the link for "MaxentTagger" where there is a useful "parameter description" table.
Example One: Tagging XML with the Stanford Log-linear Part-Of-Speech Tagger
Many texts are currently available to us in XML format, and in literary circles the most common flavor of XML is TEI. In this example we will POS tag a typical TEI encoded XML file.