The TextRunner system you read about in Banko et al. 2007 later evolved into the ReVerb system, described here:
Fader, Anthony; Stephen Soderland; and Oren Etzioni. 2011. Identifying Relations for Open Information Extraction.
For our present purposes, the differences between TextRunner and ReVerb are not important. Broadly speaking, ReVerb uses the same approach to unsupervised relation extraction as TextRunner, with some additional refinements that improve the precision and coverage of the system.
There is a demo of the ReVerb system online at:
http://openie.cs.washington.edu/
To get warmed up, spend a few minutes playing with the demo. Try using it to answer the following relational queries:
Try a few more queries of your own invention. Fun, huh? OK, now you're ready to answer a couple of easy questions:
A crucial ingredient of the Yao et al. 2012 paper you read for today is the clustering of path senses into semantic relations. In this exercise, we'll take a closer look at the clustering evaluation metrics used by Yao et al. to measure their success: the pairwise metrics and the B3 metrics. Clustering tasks crop up in many different areas of NLP and NLU (including documentation classification, coreference resolution, topic modeling, and automatic thesaurus construction), so it's useful to know something about how clustering models are evaluated.
Assume that we have a set of n items to be clustered, and two clusterings (that is, partitions) of the items: a predicted clustering which is to be evaluated, and a target clustering against which to perform the evaluation. Both the pairwise approach and the B3 approach reformulate the clustering task as a binary classification task (or a set of such tasks). By comparing the binary labels derived from the predicted clustering to the binary labels derived from the target clustering, we can then compute precision and recall, just as in a standard evaluation of a binary classifier. However, the two approaches differ in how they define this reformulation.
The pairwise approach reformulates the clustering task as a binary classification task over pairs of items. Given a clustering, we label each of the n(n − 1)/2 pairs of items true if the two items are in the same cluster, and false otherwise. We then compute precision and recall by comparing the pairwise labels derived from the predicted clustering to those derived from the target clustering.
The B3 approach reformulates the clustering task as a set of binary classification tasks, each focused around a single item. Given a clustering and a focus item x, we label every item (including x itself) true if it appears in the same cluster as x, and false otherwise. We then compute per-item precision and recall by comparing the labels derived from the predicted clustering to those derived from the target clustering. Finally, we compute aggregate precision and recall by averaging per-item precision and recall over all items.
Lastly, in both approaches we compute F1 as the harmonic mean of precision and recall.
Suppose we have a set of five items to cluster: {koala, ostrich, porpoise, salmon, tuna}.
The target clustering is {{koala, porpoise}, {ostrich}, {salmon, tuna}}. (In other words, we want to cluster the critters by taxonomic class.)
We'll evaluate three possible clusterings against this target clustering:
We'll evaluate each clustering using both the pairwise metrics and the B3 metrics.
| pairwise metrics | |||
| clustering | precision | recall | F1 |
| naïve | |||
| minimal | |||
| maximal | |||
| B3 metrics | |||
| clustering | precision | recall | F1 |
| naïve | |||
| minimal | |||
| maximal | |||
If you're struggling with the definitions of the evaluation metrics, it might be helpful to do some Googling, or even go back to the original sources. The earliest source I know of for the pairwise approach is:
Hatzivassiloglou, Vasileios and Kathleen R. McKeown. 1993. Towards the automatic identification of adjectival scales: Clustering adjectives according to meaning.
If someone knows of an earlier source, please let me know!
The B3 evaluation metrics originated in:
Bagga, Amit and Breck Baldwin. 1998. Algorithms for scoring coreference chains.
This page may also be helpful in trying to grok the B3 metrics.
In case you're unfamiliar with the concepts of precision, recall, and F1, here's a quick explanation.
To evaluate a binary classifier, we compare the predicted labels to the target labels for every item in our test set. We begin by compiling counts, over the test set, of each of the four possible outcomes:
We then compute precision and recall from the counts of these outcomes.
Finally, F1 is defined as the harmonic mean of precision and recall:
For example, suppose we have a test set of 10 items, with the following predicted and target labels:
| item | predicted | target | outcome |
|---|---|---|---|
| 1 | T | F | FP |
| 2 | T | T | TP |
| 3 | F | F | TN |
| 4 | T | T | TP |
| 5 | F | T | FN |
| 6 | F | F | TN |
| 7 | T | T | TP |
| 8 | T | F | FP |
| 9 | F | F | TN |
| 10 | T | T | TP |
The outcome counts are:
| outcome | count |
|---|---|
| TP | 4 |
| FP | 2 |
| FN | 1 |
| TN | 3 |
The precision, recall, and F1 are therefore:
More background can be found in the Wikipedia articles on precision and recall and F1.