Tgrep & TGrep2

Tgrep and the improved new version TGrep2 are Unix-based tools that allow you to search syntactically & POS-annotated corpora on AFS. The syntax needs some getting used to but is is worth it since the searches you can do with this tool are quite powerful. If you prefer a graphical interface you can use TIGERsearch which has the same search options. Below you find information on

Tip-1: Much of the information on this page is summarized on this handout on TGrep 1 by Tatiana Nikitina and Jeanette Pettibone (PDF file; ~130KB) which will provide you with a short intro and a comprised summary of the TGrep syntax.

Tip-2: New presentation available:

Setting up TGrep2

By Susanne Riehemann (ed. by Florian Jaeger and Liz Coppock, with input from Neal Snider): To use TGrep2 you need to be logged in to a firebird or raptor computer - TGrep2 is compiled only for linux and doesn't work on the elaines! Use Samson or any ssh software to connect to one of the firebirds, i.e. firebird1.stanford.edu through firebird15.stanford.edu

If you want to be able to use the command "tgrep2" without typing its full path name you need to add

    /afs/ir/data/linguistic-data/bin/linux_2_4

to your PATH variable. You can do this for the current session only by entering:

    > setenv PATH /afs/ir/data/linguistic-data/bin/linux_2_4:$PATH
(Note: the greater-than sign ">" at the beginning of the line represents the command-line prompt here and in what follows.) To make it so that your PATH variable is set this way every time you log in, you need to modify your ~/.login file. To do this, type:
    > cd ~
    > emacs .login
(Mac users: if this doesn't work, type:
    > setenv TERM xterm
and try again.)

Once you are in emacs, add the following lines (the first line is a comment, to keep things organized):

    # Tgrep2 stuff
    setenv PATH /afs/ir/data/linguistic-data/bin/linux_2_4:$PATH
To save, type:
    Ctrl-x Ctrl-s
To exit emacs, type:
    Ctrl-x Ctrl-c
Now, next time you log in, you won't have to worry about setting your PATH variable. To put your changes to your .login file into effect, type this at the command line:
    source .login

The files for the tgrep2able versions of the Brown, Switchboard, WSJ, NEGRA, and Chinese Treebank corpora are in

    /afs/ir/data/linguistic-data/Treebank/tgrep2able

You can find out what corpora can be used with Tgrep2 by typing:

    > ls /afs/ir/data/linguistic-data/Treebank/tgrep2able
Say you want to use the Wall Street Journal corpus. Then you can specify that corpus as an argument to Tgrep2 by typing:

    > cd /afs/ir/data/linguistic-data/Treebank/tgrep2able/
    > tgrep2 -c wsj_mrg.t2c.gz 'VP < NP'
If you are doing multiple searches with the same corpus, you don't want to specify the corpus you are using each time you run Tgrep2. You can specify a default corpus to use by changing your TGREP2_CORPUS variable. You can do this for the current session only by typing at the command line:
    > setenv TGREP2_CORPUS /afs/ir/data/linguistic-data/Treebank/tgrep2able/wsj_mrg.t2c.gz
If you want to have this variable set automatically every time you log in, then you need to modify your .login file again:
    > cd ~
    > emacs .login
Once you are in emacs, add the following line (this assumes you want the Wall Street Journal as your default corpus):
    setenv TGREP2_CORPUS /afs/ir/data/linguistic-data/Treebank/tgrep2able/wsj_mrg.t2c.gz
If you want to have Switchboard as your default corpus, then use this instead:
    setenv TGREP2_CORPUS /afs/ir/data/linguistic-data/Treebank/tgrep2able/swbd.t2c.gz
To save, type:
    Ctrl-x Ctrl-s
To exit emacs, type:
    Ctrl-x Ctrl-c

Once you are out of emacs, try this to test out Tgrep2:

    > tgrep2 -afl 'NP-SBJ < CC' | less
This will give you a list of the subject NPs that immediately dominate a coordinating conjunction (e.g. "and", "or"). The "-afl" part specifies a useful set of options (see below). The vertical dash ("pipe") redirects the output of the search to the program called "less", which allows you to scroll through the output. To quit less, type:
    q

The first two outputs of that search are the phrases "all the brothers and sisters", and "freedom of speech in this country and everything".


(NP-SBJ (PDT all)
        (DT the)
        (NNS brothers)
        (CC and)
        (NNS sisters))

(NP-SBJ (NP (NP (NN freedom))
            (PP (IN of)
                (NP (NN speech)))
            (PP-LOC (IN in)
                    (NP (DT this)
                        (NN country))))
        (CC and)
        (NP (NN everything)))

As you can see, the output is hierarchically structured (think of it as a syntax tree lying on its side: say a very sad syntax tree).

If you want to save the output of your search into a text file, then redirect the output to a file in your home directory with the ">"-operator thus:

    > tgrep2 -afl 'NP-SBJ < CC' > ~/coord-subj-NPs.txt
To take a look at this file, type:
    > less ~/coord-subj-NPs.txt
Now, since you probably don't actually care about coordinated subject NPs, you should delete this file!
    > rm ~/coord-subj-NPs.txt

See also the TGrep syntax and some examples for more details, but here are some quick tips on Tgrep and Tgrep2 syntax:

Patterns. Suppose you want to find all of the subject noun phrases in your corpus. Not all subject NPs have exactly the same label, but all subject NPs have the tag -SBJ marking them. We can say we want all constituents whose label contains -SBJ by putting -SBJ in forward slashes, thus:

    > tgrep2 -afl '/-SBJ/' | less

Other operators you can use:

<  immediately dominates
<<  dominates
$  is a sister of
!  negation
Example: If you want to find all PPs containing the word "aside", you could use this pattern:
    PP << aside

Order of operations (important!). Multiple operators are assumed to be related by an AND relation. So the following means, "an S that immediately dominates an NP and immediately dominates a VP":

    S < NP < VP
To nest operators, you must group them with parentheses:
    S < (NP < DT)
This means, "An S that immediately dominates an NP, that dominates a DT". This is an important fact to remember! This will be the source of many of your problems!

Formating your output. By default tgrep2 returns the match for the left-most element in the search pattern. So, if you want the NP the search above to be the output, you can regroup the pattern:

    NP > (S < VP)
This matches all NPs that are immediately dominated by an S that dominates a VP.

There is a manual for TGrep2, which can also be found in

    /afs/ir/data/linguistic-data/Treebank/tgrep2able/tgrep2-manual.pdf

See also the TGrep syntax and some examples, The Treebank Page, Switchboard Tags, and Jeanette Pettibone's page on Treebank tags. For more information on the tags and bracketing conventions used in Switchboard, try: The Penn Treebank: An overview (voted "most useful" by Florian Jaeger), and Bracketing Switchboard: An addendum to the Treebank II Bracketing Guidelines. There is also useful documentation on the Penn Part-Of-Speech Tagging conventions, Disfluency Annotations, and Predicate-Argument Structure annotations. Some of this information applies both to TGrep and to TGrep2.

Setting up TGrep

By Susanne Riehemann (ed. by Florian Jaeger): To use TGrep you need to be logged in to an epic computer - this doesn't work on the elaines! Use Samson or any ssh software to connect to one of the epics, i.e. epic1.stanford.edu through epic28.stanford.edu

Assuming you use csh or tcsh, you need to do the following to set everything up properly. (You can do these at the prompt, but if you want to avoid having to do this every time you use TGrep, put them at the end of the .login file in your home directory. The first time you do this you'll need to log in again.):

    setenv TGREP_CORPUS /afs/ir/data/linguistic-data/Treebank/tgrepabl/swbd_mrg.crp
    setenv PATH /afs/ir/data/linguistic-data/bin/sun4x_57:$PATH
    setenv MANPATH /afs/ir/data/linguistic-data/man:$MANPATH

Then, you should be able to do:

    tgrep 'NP < VP'

which finds NPs which immediately dominate VPs, and have it work! You can find usage instructions in 'man tgrepdoc', while 'man tgrep' tells you about the command flags. See also the notes below on the TGrep syntax and some examples, The Treebank Page, Switchboard Tags, and Jeanette Pettibone's page on Treebank tags.

This sets things up for Switchboard (merged). On AFS there are now 4 TGrep indices, covering the parsed sections of switchboard, the WSJ (2 versions: one with PoS tags, one without), and Brown (only a small fragment was treebanked). You can change the value of TGREP_CORPUS above appropriately, or specify one on the command line.

    setenv TGREP_CORPUS /afs/ir/data/linguistic-data/Treebank/tgrepabl/swbd_mrg.crp
    setenv TGREP_CORPUS /afs/ir/data/linguistic-data/Treebank/tgrepabl/brown_mrg.crp
    setenv TGREP_CORPUS /afs/ir/data/linguistic-data/Treebank/tgrepabl/wsj_mrg.crp
    setenv TGREP_CORPUS /afs/ir/data/linguistic-data/Treebank/tgrepabl/wsj_skel.crp

Note that we now have TGrep2 as well!

The TGrep syntax and some examples

Searching for NPs that immediately dominate VPs:
    tgrep 'NP < VP'

If you want to look at this output as it is generated, use:

    tgrep 'NP < VP' | more

If you want to save the output to a file, use:

    tgrep 'NP < VP' > filename
Some useful command-line options: If you want to see only the terminal nodes of the tree, use:
    tgrep -t 'NP < VP'

If you want to see the tree for the whole sentence in which the match occurred, use:

    tgrep -w 'NP < VP'

If there are multiple matches for the pattern in a sentence, you can find them all with:

    tgrep -a 'NP < VP'

These switches can be combined, e.g. if you want to see the whole sentence that was matched, use:

    tgrep -tw 'NP < VP'
Some useful operators
A < B      A immediately dominates B
A << B A dominates B
A <- B B is the last child of A
A <<, B B is a leftmost descendant of A
A <<` B B is a rightmost descendant of A
A . B A immediately precedes B
A .. B A precedes B
A $ B A and B are sisters
A $. B A and B are sisters and A immediately precedes B
A $.. B A and B are sisters and A precedes B

Some examples using these operators: To search for NPs that are coordinations of plural nouns:

    tgrep -at '/NP*/ <1 NNS <2 (CC < and) <3 NNS'

If you've done any interesting TGrep searches for your research, please send the commands to (Corpus TA), so other people can learn by example.

Differences between TGrep2 and Tgrep

The manual for TGrep2, locally stored at:

    /afs/ir/data/linguistic-data/Treebank/tgrep2able/tgrep2-manual.pdf
also contains a list of new features and changes with regards to the first version of TGrep in section 6 (p. 16-17). The main difference is that TGrep2 allows reference to edge labels (and as far as I see, secondary edge labels). Those labels are usually used to mark up additional information about a phrase, including:
  • That a to-PP is dative, or locative, etc.
  • The grammatical function of a constituent
  • X-bar tags (e.g. headedness)
  • expletive 'es' (for German)
  • etc.

From that it should be clear that this is a considerable improvement over TGrep. Another tool that allows searches that refer to any kind of egde labels is TigerSearch. Tgrep2 has also been improved in terms of the control it gives you over the form of the output, the speed of searches, and in that it now accepts search patterns from files as input rather than only command line pattern inputs.

Note that some features of the old TGrep are not anymore supported in TGrep2 (for a list, cf. the manual).