Linguistics 203: Assignment
Week 4: October 16, 2002
Due: October 23, 2002

Using tgrep to explore dative alternation

So far, all corpus studies discussed in this course have been string searches. But this is very limiting: many questions in syntax are not tied to particular lexical items. To facilitate linguistically more sophisticated searches, various corpora have been annotated with a variety of types of information. Among the best known annotated corpora are those in the Treebank of the Linguistic Data Consortium at the University of Pennsylvania. Three LDC corpora – Brown, Wall Street Journal, and Switchboard – have been tagged for part of speech and parsed into phrase structure trees. The search tool provided for searching the parsed versions of these corpora is called tgrep. This exercise is intended to give you a feel for working with the LDC Treebank and tgrep.

The question you are going to investigate was inspired by Arnold Zwicky’s NWAV talk. He claimed that many complex constructions have very skewed distributions in usage. In particular, most actual occurences of certain phenomena are associated with one of a few “seed” words, even though they are possible with many other words. His first example was verb phrase ellipsis after the infinitival to, in which well over half the exemplars occur with forms of the verbs want, have, and used.

Your task is to determine whether the English dative alternation (DA) exhibits this kind of skewing. DA is the alternation between V-NP1-NP2 and V- NP2-to-NP1, as in Chris handed the baby a toy ~ Chris handed a toy to the baby. (There is a similar alternation involving the preposition for, but you should limit yourselves to cases with to).

Getting Started with tgrep: There are some preliminaries needed for getting started with tgrep. A useful web page for this is http://help-csli.stanford.edu/corpora/tgrep.html.

There are a number of web pages with helpful instructions on using tgrep. A good one is http://mccawley.cogsci.uiuc.edu/corpora/tgrepdocs.html. Another useful resource about corpora in general (including tgrep) is the web page from a course Emily Bender taught at Berkeley a couple of years ago. That page is now located at http://www-csli.stanford.edu/~bender/corpora/index.html.

A Few Tips: Easily tgrepable versions of the three corpora in the LDC Treebank are in /afs/ir/data/linguistic-data/Treebank/tgrepabl.

One problem you will face is learning the Treebank tag set (that is, the abbreviations they use for various types of words and phrases). I don’t remember it from one search to the next and find that the easiest way to deal with this problem is just to look at a few trees. Some of the documentation cited above has example trees in it, but it might be easier just to do something like tgrep ‘’NP < VP’, as suggested in the document quoted above, and look at the output. A crucial tag for the exercise in question is PP-DTV, which is the tag used for prepositional phrases headed by to that can undergo the dative alternation. This makes it possible to extract both the double object (V-NP1-NP2) and prepositional (V- NP2-to-NP1) DA constructions automatically. Of course, you will want to inspect a substantial number of the trees you extract to make sure you’re getting what you want. But you don’t have to look at every one.

There are no doubt a number of ways of doing this exercise. I would do my tgrep searches and direct the output of each search to a file. I would then use grep to count the things I want to count in the files. Grep will count for you if you put –c between “grep” and the pattern it is searching for. There are no doubt better ways of doing this, and some of you who are computationally more sophisticated may wish to use them. But my method is fairly simple, even for the technologically challenged like me.



Last modified: January 20, 2003