An introduction to the Stuttgart Corpus Workbench, CQP, & XKWIC

By Susanne Riehemann (ed. by Florian Jaeger): The Stutgart Corpus Workbench is a collection of powerful tools to do searches on prepared corpora, sort the results, creates frequency lists, etc. This page gives an introduction to the Stuttgart Corpus Workbench, summarizes what you have to do to use the Workbench, provides some sample queries, and points out some known problems.

Tip-1: See also the Stuttgart Corpus Workbench Web Page and be sure to check out Jeanette Pettibone's CQP Manual (one of the reasons why the current page is not really necessary ;-).

Tip-2: Be sure to check out this nice introductory case study by Roger Levy (PDF file; ~115KB). On only 1.5 pages you can get a pretty good i idea of the potential of CQP and along the way you even get an introduction to some of the relevant syntax.

Tip-3: You may also find the CQP Demos very useful.

Tip-4: The following site offers a comprehensive comparison of XKwic/CQP vs. WordSmith.

Note: This page assumes that you are logged in on turing since the Workbench does not yet work on the Leland system.

Getting started

Before you can use the Stuttgart Corpus Workbench tools with the North American News corpus for the first time, you'll have to add the following three lines in your .cshrc on turing (although it seems as if the IMS Workbench is also installed on AFS now):
    setenv CORPUS_REGISTRY /turing/local/corpora/IMS-Corpus-Toolbox/registry
    setenv LLQUERY_LOCAL_CORP_DIR somedirectory
    setenv CQP_LOCAL_CORP_DIR somedirectory

You can choose any somedirectoy as long as there is enough space, e.g. make a subdirectory in /tmp ("mkdir /tmp/myusername"). Then add the following path to your PATH variable ('echo $PATH' shows you what is included in your PATH):

    /turing/local/corpora/IMS-Corpus-Toolbox/bin

You can do this by typing (which will append the above-mentioned path to your PATH variable):

    setenv PATH /turing/local/corpora/IMS-Corpus-Toolbox/bin:$PATH


[ Getting started | Two interfaces | Selecting a corpus | Queries | Subcorpora | Saving the results | Sorting | Macros | Problems | top ]

Two interfaces to the Stuttgart Corpus Workbench

There are two ways of accessing the data:

  • with the command-line CQP interface (type CQP on turing)

    Documentation:
    /turing/local/corpora/IMS-Corpus-Toolbox/cwb-960825/doc/CQPman.ps

  • with the graphical XKWIC interface (type XKWIC & on turing)

    Documentation:
    /turing/local/corpora/IMS-Corpus-Toolbox/cwb-960825/doc/XKWIC-man.ps

All the commands you can type at the CQP command line should also work in the XKWIC "query input" box (you'll have to click on "start query" instead of hitting return), although for some things (like selecting a corpus) there is also a simpler way of doing it in XKWIC. There are some things that you can do only in XKWIC, like click on a line in the result and see more context in a separate box, or sort the result, or select a previous query from the query history.



[ Getting started | Two interfaces | Selecting a corpus | Queries | Subcorpora | Saving the results | Sorting | Macros | Problems | top ]

Selecting an available Corpus

To select a corpus you type the name of the corpus followed by a semicolon. Corpora have to be specially prepared and indexed in order to be used with the Stuttgart Corpus Workbench. The corpora available for use with the Corpus Workbench are (please be aware that this information may be outdated; if you have questions please ask the Corpus TA):
  • NYT - New York Times News Syndicate 7/94-12/96
  • WSJ - Wall Street Journal 7/94-12/96
  • LATWP - Los Angeles Times & Washington Post 5/94-8/97
  • REUTE - Reuters News Service General 4/94-12/96
  • REUFF - Reuters News Service Financial 7/94-12/96

These correspond to the plain text files in

    /turing/local/corpora/North-American-News


[ Getting started | Two interfaces | Selecting a corpus | Queries | Subcorpora | Saving the results | Sorting | Macros | Problems | top ]

Formulating a query

Before you start querying, you need to decide how much context you want to have available, e.g. if you want one sentence before and one sentence after the sentence that matches your query:

    set context 2 s;

Some example queries to get familiar with the syntax are given below:

    "Clinton";
returns all occurrences of that word
    "give[sn]?|gave";
returns all occurrences of "give", "gives", "given", and "gave" (see below for how to make this more efficient with high-frequency words)
    "give" "up";
returns all occurrences of that phrase. (see below for how to make this more efficient with high-frequency words)
    "give" []* "up" within s;
returns all occurrences of "give" and "up" within the same sentence but separated by any number of words (see below for how to make this more efficient with high-frequency words)
    MU (meet "give" "up");
produces the same result as "give" "up"; but much more efficiently
    MU (meet "pull" "strings" -5 7);
returns all occurrences of "strings" within 5 positions to the left or 7 positions to the right of "pull")
    MU (meet "pull" "strings" s);

returns all occurrences of "pull" and "strings" in the same sentence



[ Getting started | Two interfaces | Selecting a corpus | Queries | Subcorpora | Saving the results | Sorting | Macros | Problems | top ]

Creating subcorpora to speed up searches

If you want to look for a sequence of frequent words like "There|there" "is" "every" you can start by doing an efficient meet query (see "Formulating a query") on parts of it and then run the slower query just on the result:

    MU (meet "is" "every") expand to s;
    isevery=Last;
    isevery;
    "There|there" "is" "every";

There are many other things you can do with "subcorpora" - see the documentation.



[ Getting started | Two interfaces | Selecting a corpus | Queries | Subcorpora | Saving the results | Sorting | Macros | Problems | top ]

Saving the Result

In XKWIC there is a menu option for saving the results of your query. In CQP, you use:

    cat Last > "filename";

If you want to save all the sentences matching your query into a file without any of the context around them, you need to set the context to one sentence and expand the match to one sentence:

    set context 1 s;
    MU (meet "pull" "strings") expand to s;
    cat Last > "filename";

This only seems to work in CQP and not in XKWIC (Obviously, when you "expand to s" you don't have a KWIC format any more).



[ Getting started | Two interfaces | Selecting a corpus | Queries | Subcorpora | Saving the results | Sorting | Macros | Problems | top ]

XKWIC: Sorting the result

If you are using XKWIC you have the option of sorting your result in various ways, e.g. in this case by the first word following "there is every". Because the first word in the match ("there") is position 0, the position you're interested in is 3, so select first sort column=3, last sort column=4. To sort by the word before the match use first sort column=-1, last sort column=-2.

For more details, see the documentation.



[ Getting started | Two interfaces | Selecting a corpus | Queries | Subcorpora | Saving the results | Sorting | Macros | Problems | top ]

CQP: Using Macros

If you are not a graphical interface person or you need to run many similar, complex queries, you might like the ability to run the system from the command line with macro files. For example, you can put the following into a macro file called "give-up-macro":
    NYT;
    set context 1 s;
    MU (meet (union (union (union (union "give" "gives") "gave") "given") "giving") "up" 1 2);

and then use the command:

    CQP -f give-up-macro > & /tmp/give-up &

This will return all occurrences of the forms "give" followed by "up" with at most one other word intervening. For more examples, look at the documentation.

If you are interested in the frequency of words with a particular characteristic such as ending in "ive", you can type something like:

    lexdecode -f -p ".*ive" nyt | sort -rn > /tmp/nyt-ive &

See also the Stuttgart Corpus Workbench Web Page



[ Getting started | Two interfaces | Selecting a corpus | Queries | Subcorpora | Saving the results | Sorting | Macros | Problems | top ]

Known problems

There are a lot of duplicate examples. In some cases this is because press releases are repeated in slightly varied form and/or in different papers, but I don't know whether that accounts for all of the duplicates. In any case, I don't know whether these duplicates can be deleted automatically with these tools - I've always used unix sort & uniq for this purpose. (There is a button for manually deleting all selected sentences in XKWIC, which may be an option for a small search result.)

When I inserted the sentence boundary markers I missed some of the SGML tags, so you get junk appended to some of your sentences even with the context set to "1 s". Also, my sentence boundary detection from punctuation marks & a list of abbreviations isn't perfect, so there are some incomplete sentences and some missed sentence boundaries. Since it took me weeks to install the corpus (mostly because turing neither had enough memory nor a large enough partition) I'm not going to fix this.