Linguist 278: Programming for Linguists
Stanford Linguistics, Fall 2009
Christopher Potts
Assignment 7 - More on building custom classes
Distributed 2009-11-09
Due 2009-11-15
======================================================================
NOTE: Please choose ONE of the following options.
======================================================================
1. Switchboard timed transcript
The NLTK data distribution contains a fragment of the Switchboard
corpus (in corpora/switchboard). Open up the timed_transcript file
and check it out. Basically, it consists of a number of transcriptions
of phone conversations. The individual conversations are separated by
blank lines. And, within conversations, each turn takes this form:
START_TIME END_TIME UTTERANCE
This question involves writing two classes:
* Conversation
* Utterance
A Conversation object should be made up of Utterance objects.
Required methods/attributes:
* Conversation
-- duration
-- turn_count
-- word_count
* Utterance
-- start_time
-- end_time
-- duration
-- words
-- word_count
What counts as a word? You decide!
You should provide functionality for reading in the timed_transcript
file and parsing it into a list of Conversation objects (which, again,
should contain Utterance objects). To illustrate how your classes
work, do at least the following:
* Show how to derive a cumulative word count for the whole of
timed_transcript.
* Show how to derive a cumulative duration count for the whole file.
* Show how to derive average and median utterance lengths.
(Suggestion: load numpy at least for the median calculation.)
* Answer the question: Are there *any* overlapping utterances,
according to the times given? If so, where?
======================================================================
2. Computing fractions.
The goal is to write a command-line program that allows the user to do
basic math with fractions, by writing things like
python fractions.py "4/3 + 3/4"
> 25/12
Why might you want to do this? Well, math is more precise this way.
Floating points are often approximations, and the errors those
approximations introduce can snowball into substantial mistakes.
Also, it can be useful to have the capability to do this; you can
alias it using your .bash_profile and then you have easy access to a
fractions calculator.
Requirements:
* A Fraction class that initializes with a string and builds an object
that has at least the following methods or attributes:
numerator ==> int
denominator ==> int
plus(fraction) ==> fraction
minus(fraction) ==> fraction
times(fraction) ==> fraction
divided_by(fraction) ==> fraction
__str__
The things to the right arrow give return values. So the numerator
and denominator should be integers, and the basic mathematical
expressions should take fractions as their arguments to return new
fractions. Finally, you need to define str() for fraction objects.
* A command-line interface that takes a string as argument, where
the string is of the form
fraction1 OP fraction2
where OP is +, -, /, or *. You'll need to parse this in a way that
gives the right sort of inputs for your Fraction class.
Optional extensions:
* Write more functions: exponentiation, etc.
* Simplify your return values, so that, e.g.,
4/1 ==> 4
12/24 ==> 1/2
* More challenging: Write a Parser class that genuinely parses
complex expressions like:
((1/2 + 1/4) * 4/5)
======================================================================
3. Sentiment reviews
I've stashed a small fragment of the full text of the UMass Amherst
Sentiment Corpora here:
https://www.stanford.edu/class/linguist278/restricted/data/umass-amherst-sentiment-fragment.zip
This is a tagged version. The tagging was done with the Stanford
log-linear classifier, which was trained on very different text, so
the accuracy is not great. Still, it's useful for some purposes.
Here's the format of the files, by way of an example:
42 of 45 people found the following review helpful
5.0 out of 5 stars
A great companion to The Biggest Loser book and show- written by one of the fitness coaches for the show
March 7, 2006
K. Corn "reviewer"
Indianapolis,, IN United States
Watching/VBG the/DT show/NN and/CC reading/NN this/DT book/NN helped/VBD me/PRP to/TO lose/VB a/DT very/RB stubborn/JJ 27/CD pounds./NN [...]
Each field is on a single line (even the reviews, which can be quite
long).
* Write a Review class that initializes on a string like the above and
allows easy access to the parts of the individual texts, with at
least the following attributes/methods:
rating ==> int
summary ==> string
date ==> string
author ==> string
location ==> string
review ==> string
found_helpful ==> (int; 42 in the above example)
registered_readers ==> (int; 45 in the above example)
helpfulness_percentage (float; 42.0/52.0 = 0.80769230769230771 in the above example)
* In addition, we want the following (perhaps via the above review
method):
-- A version of the review text with the tags stripped out.
-- A version of the review text that is a list of word--tag tuples.
* A Corpus class that initializes on a file glob (set of file names)
and creates an object that contains Review objects. Your corpus should
contain an iterator that allows you to write:
for review in corpus:
...
where the for loop yields the Review objects one by one.
Illustrate how your classes work by providing code for the following:
* A cumulative word count of all the word in review text in the
corpus.
* A cumulative word count of all the word in summary text in the
corpus.
* A dictionary mapping tags to the number of tokens with that tag in
the text.
* A dictionary mapping ratings to the number of reviews with that
rating.
======================================================================