Is it the Joyce Industry or the Shakespeare Industry?
At the recent Digital Humanities Conference in Maryland, Matthew Wilkins and I got into a discussion about famous authors and the "industries" of scholarship that their works have inspired (see Matt's blog post about our discussion and his survey analysis of the MLA bibliography).
The first time I ever heard the term "industry" used in this context was in reference to the scholarship generated by Joyce's novel Ulysses. As Joyce himself predicted (bragged) the book would keep scholars busy for centuries to come, and, of course, Joyce was right--well maybe not centuries, but you get the idea. But can we really compare the Shakespeare "industry" to the Joyce "industry" given that the Bard had such a significant head start in terms of establishing his scholarly "fan base"?
Using the MLA bibliography, Matthew W. took a stab at this and compiled some rough figures of recent scholarship on the two masters. By Matt's count, since 1923, Joyce has inspired just 9315 citations to Shakespeare's massive 35,489.
But there is an obvious problem here: the figures begin in 1923 and Ulysses, the book that really puts Joyce on the map, was only published in 1922. So Joyce is getting into the industry-building business a bit late. Clearly we must do some norming here to account for the Bard's head start.
Now, since I am pretty sure that I owe Matt a beer if the Bard has a bigger industry, I think some well thought out math is warranted here:-) . . .
Machine-Classifying Novels and Plays by Genre
In the post that follows here, I describe some recent experiments that I (and others) have conducted. The goal of these experiments was to accurately machine-classify novels and plays (Shakespeare's) by genre. One of the most interesting results ends up having more to do with feature extraction than classification algorithm
Background
Several weeks ago, Mike Witmore visited the Beyond Search workshop that I organize here at Stanford. In prior work, Witmore and some colleagues utilized a program called Docuscope (Developed at Carnegie Mellon) to distinguish between and classify (statistically) Shakespeare's histories and comedies.
"Equipped with a specialized dictionary, Docuscope is able to divide texts into strings of words that are then sorted into one of eighteen word categories, such as "Inner Thinking" and "Past Events." The program turns differentiating amongst genres into a statistical task by testing the frequency of occurence of words in each of the categories for each individual genre and recognizing where significant differences occur."
Docuscope was designed as a tool for analyzing student writing, but Witmore (et. al.) discovered that it could also be employed as a specialized sort of feature extraction tool.
To test the efficacy of Docuscope as a tool for detecting and clustering novels by genre, Franco Moretti and I created a full text corpus that included 36 19th century novels (striped of title page and other identifying information). We divided this corpus into three groups and organized them by genre:
Executing R in Php
For their final project, the students in my Introduction to Digital Humanities seminar decided to analyze narrative style in Faulkner's Sound and the Fury. In addition to significant off-line analysis, we are building a web-based application that allows visitors to compare the different sections of the novel to each other and also to new, unseen texts that visitors to the site can enter themselves.

To achieve this end, the web application must be able to "ingest" a text sample, tokenize it, extract data about the various features that will be used in the comparison, and then prepare (organize, tabulate) those features in a manner that will allow for statistical analysis/comparison. Since my course is not a course in statistics, we decided that I would be responsible for the number crunching.