<-- Back to Main Page

Project Introduction:

This final project consists of an encoded version of James McHenry's The Wilderness, Volume 2. The various features of this site allow you to do the following:

* View a list of words contained within the text and their frequencies, both by occurrence and alphabetical order.
* Search for occurrences of specific words within the context of a chosen number of surrounding words.
* Find all instances of dialog by a character, searchable by name.

The analysis that this site performs will allow you to determine how often specific words occur, search for keywords in the text, and see the amount of speech attributed to each character, especially in relation to that of others.

This project was prepared for Digital Humanities: Literature, Science, and Technology (ENGLISH 153H/HUMNTIES 198J), taught by Matthew L. Jockers in Autumn of 2004. The original work by James McHenry was published in 1823 in two volumes.

Technology Abstract:

This final project involved a variety of technologies and contributions from all individuals. Though theoretical readings prepared each member to understand how computers and computation were used in the study of literary texts, it was only through application and completion of exercises that we saw how beneficial-and pleasantly puzzling-it was to employ a range of coding languages for literary analysis. The project is a compilation of our knowledge acquired through both theory and practice.

A primary technology used in the project was xml. Group members added metadata to - and inserted paragraph and speaker ("quote") tags within - a digital transcription of James McHenry's The Wilderness (1823). The basic training we were given in class with xml allowed us to go through the files and tag items as were necessary and appropriate. All coding was done using jEdit, a text editing program available online for download.

We used php to realize several features of our project. As was done through the last three class exercises, we wrote a script that displayed the frequency of every single word in the text, both by occurrence frequency and alphabetical order. We also included a tool to perform word searches within the text. Users are able to enter a particular word and are then given a list of all occurrences of that word within a context of a specified number of surrounding words. Our "special feature" consists of a php script that not only isolates the quotes of each individual character within the text, but also compares the relative amounts of unique instances and words that each character speaks.

There were numerous kinks with the quote tags that had to be ironed out. For example, php was used to "pull out" small narrative interjections such as "she said," so that the character dialogue reported consisted only of the characters' words. Also, at the very last minute we realized we had to better standardize the placement of the quote tags in respect to p tags. It was finally decided that quote tags should go around p tags, since there might be multiple paragraphs in one unique instance of a character quote, and since there can never be multiple unique quotes in one paragraph.

Finally, HTML and Cascading Style Sheets (CSS) were used to author format the webpages on the project site.

Theoretical Issues Abstract:

Tagging character quotes was not as easy at it at first might seem. At first we decided to tag all the quote segments separately so that brief narration (e.g. "she said, continuing,") that segmented quotes were not included in the quote tags. Later, we decided to put the quote tags around the entire quote (even if the quote was not continuous) so that we could count the number of unique "speaking instances" in addition to just number of words spoken. This worked because we were able to use php to remove any text not enclosed in quotes, effectively removing any narration.

Another interesting challenge that came up was deciding the list of character names for the quote tag attributes. Throughout the book, several of the characters were referred to with slightly differing names (e.g. Charles vs Charles Adderly). Even more problematic were cases in which characters changed names (such as Nancy Frazier marrying and becoming Nancy Killbreath). In general, we picked one name for each character and made sure each tag used that name, so that all the character's speech could be compiled into one set of data. Even when a character was writing a letter or thinking outloud to the reader (for example, Charles Adderly), we considered such unspoken dialogue equivalent to any "normal" dialogue. As a result, no distinctions were drawn.

The one exception to this name standardization is the case of Nancy Frazier becoming Nancy Killbreath. In this case, we thought it might be more interesting to compare the dialogue from both the unmarried and married incarnations of the character in order to see if there were any significant differences. Indeed, it is interesting to note that Nancy Frazier speaks fewer times than Nancy Killbreath, but speaks more words, indicating that the unmarried version has significantly longer quotes on average.

In general, our online tools perform a mostly numerical analysis of the quote metadata. The website allows users to view who spoke how many times and in how many words, and allows to user to compare the data between different characters. In the future, we envision this metadata allowing researchers to investigate issues such as gender (do women speak more than men) or the relative amounts of dialogue throughout the novel (Are there any especially dialogue-heavy chapters? Are there hidden pattern concerning when certain characters tend to speak and not speak?).