Linguist 278: Programming for Linguists Stanford Linguistics, Fall 2009 Christopher Potts Assignment 7 - More on building custom classes Distributed 2009-11-09 Due 2009-11-15 ====================================================================== NOTE: Please choose ONE of the following options. ====================================================================== 1. Switchboard timed transcript The NLTK data distribution contains a fragment of the Switchboard corpus (in corpora/switchboard). Open up the timed_transcript file and check it out. Basically, it consists of a number of transcriptions of phone conversations. The individual conversations are separated by blank lines. And, within conversations, each turn takes this form: START_TIME END_TIME UTTERANCE This question involves writing two classes: * Conversation * Utterance A Conversation object should be made up of Utterance objects. Required methods/attributes: * Conversation -- duration -- turn_count -- word_count * Utterance -- start_time -- end_time -- duration -- words -- word_count What counts as a word? You decide! You should provide functionality for reading in the timed_transcript file and parsing it into a list of Conversation objects (which, again, should contain Utterance objects). To illustrate how your classes work, do at least the following: * Show how to derive a cumulative word count for the whole of timed_transcript. * Show how to derive a cumulative duration count for the whole file. * Show how to derive average and median utterance lengths. (Suggestion: load numpy at least for the median calculation.) * Answer the question: Are there *any* overlapping utterances, according to the times given? If so, where? ====================================================================== 2. Computing fractions. The goal is to write a command-line program that allows the user to do basic math with fractions, by writing things like python fractions.py "4/3 + 3/4" > 25/12 Why might you want to do this? Well, math is more precise this way. Floating points are often approximations, and the errors those approximations introduce can snowball into substantial mistakes. Also, it can be useful to have the capability to do this; you can alias it using your .bash_profile and then you have easy access to a fractions calculator. Requirements: * A Fraction class that initializes with a string and builds an object that has at least the following methods or attributes: numerator ==> int denominator ==> int plus(fraction) ==> fraction minus(fraction) ==> fraction times(fraction) ==> fraction divided_by(fraction) ==> fraction __str__ The things to the right arrow give return values. So the numerator and denominator should be integers, and the basic mathematical expressions should take fractions as their arguments to return new fractions. Finally, you need to define str() for fraction objects. * A command-line interface that takes a string as argument, where the string is of the form fraction1 OP fraction2 where OP is +, -, /, or *. You'll need to parse this in a way that gives the right sort of inputs for your Fraction class. Optional extensions: * Write more functions: exponentiation, etc. * Simplify your return values, so that, e.g., 4/1 ==> 4 12/24 ==> 1/2 * More challenging: Write a Parser class that genuinely parses complex expressions like: ((1/2 + 1/4) * 4/5) ====================================================================== 3. Sentiment reviews I've stashed a small fragment of the full text of the UMass Amherst Sentiment Corpora here: https://www.stanford.edu/class/linguist278/restricted/data/umass-amherst-sentiment-fragment.zip This is a tagged version. The tagging was done with the Stanford log-linear classifier, which was trained on very different text, so the accuracy is not great. Still, it's useful for some purposes. Here's the format of the files, by way of an example: 42 of 45 people found the following review helpful 5.0 out of 5 stars A great companion to The Biggest Loser book and show- written by one of the fitness coaches for the show March 7, 2006 K. Corn "reviewer" Indianapolis,, IN United States Watching/VBG the/DT show/NN and/CC reading/NN this/DT book/NN helped/VBD me/PRP to/TO lose/VB a/DT very/RB stubborn/JJ 27/CD pounds./NN [...] Each field is on a single line (even the reviews, which can be quite long). * Write a Review class that initializes on a string like the above and allows easy access to the parts of the individual texts, with at least the following attributes/methods: rating ==> int summary ==> string date ==> string author ==> string location ==> string review ==> string found_helpful ==> (int; 42 in the above example) registered_readers ==> (int; 45 in the above example) helpfulness_percentage (float; 42.0/52.0 = 0.80769230769230771 in the above example) * In addition, we want the following (perhaps via the above review method): -- A version of the review text with the tags stripped out. -- A version of the review text that is a list of word--tag tuples. * A Corpus class that initializes on a file glob (set of file names) and creates an object that contains Review objects. Your corpus should contain an iterator that allows you to write: for review in corpus: ... where the for loop yields the Review objects one by one. Illustrate how your classes work by providing code for the following: * A cumulative word count of all the word in review text in the corpus. * A cumulative word count of all the word in summary text in the corpus. * A dictionary mapping tags to the number of tokens with that tag in the text. * A dictionary mapping ratings to the number of reviews with that rating. ======================================================================