This page contains information about the design of the class project. Please direct all questions about this information to the TA for the class, Teg Grenager.
The inspiration for this quarter's class project comes particularly from the CiteSeer Scientific Literature Library at the NEC Research Institute. CiteSeer is a publicly available digital library which indexes the academic research papers that are available on the World Wide Web. However, there are also other interesting online academic literature repositories, such as the local Highwire Press, an offshoot of the Stanford libraries. Earlier predecessors which implemented similar kinds of functionality were Andrew Ng's ML Papers, and Cora.
While CiteSeer is a very useful research tool, we believe that it leaves a lot of opportunities for improvement, and one goal of the class project is to collectively produce a system which performs better than CiteSeer, in at least some respects. For lack of a better name, we will start off by calling our new and improved system CiteUnseen. We're looking for suggestions for a better name from you!
The other, and more important, goal of this project is that this problem provides an application wherein one can explore most of the important technologies discussed in the course: text clustering, classification, information extraction, link-based analysis, collaborative filtering, and various forms of text mining. We hope that the project can be a good instance of problem-based learning: you will get to acquire skills and see their importance in a real context. We will attempt to provide appropriate resources and pointers about what you should be doing, but an important part of this is that you should ask questions if there are things that you feel that you need to know more about. We can hopefully at least point you in useful directions.
Because the scope of the project is relatively large, we do not expect small groups of students to implement their own version of CiteUnseen. Instead, we have developed APIs and data structures that (hopefully) will allow individual teams to work on separate subproblems, while enforcing overall interoperability. Each small team will have the opportunity to work on more than one project over the course of the quarter, and will do a combination of vital infrastructure and research into a problem of interest. We're going to write the system in Java. We've tried to outline components and database structures for the system below. However, we're not perfect (and haven't actually built a complete system before the class began, though we did play with components), and we may well have missed some things. So, if, when you start to build things, some things seem missing, or something seems wrong in our suggested organization or approach, do let us know. You might well be right.
When complete, CiteUnseen should provide users with the following functionality (and perhaps some other good ideas we come up with):
CiteSeer has been a very successful and widely used tool that is valued by academic researchers. Apparently crucial to its success has been the fact that it finds and indexes content (like a web search engine crawler), rather than requiring researchers to deposit materials in an institutional or other online paper server. Beyond that, it offers facilities such as locating citations of papers and measurements of the impact of authors which were previously only available through laborious library research (using traditional citation services like Science Citation Index). On the other hand, there are many shortcomings and opportunities to improve on the kind of service it provides. Problems include (in no particular order):
m jordan or michael jordan or m i jordan or michael i jordan, which even when done carefully cause you to lose on either precision (there are several
m jordan's or recall (if you omit this form).
The project will be divided into two major parts, corresponding to the first and the second half of the course. In the first part we will focus on developing the core infrastructure required to have a functioning system. We would like people to work in pairs on subprojects relevant to this goal. This part will be divided into two stages: 1A and 1B.
In stage 1A we will assign a specific system component to each group of students (we will try to take student preferences into account, but cannot guarantee that all students will get their most preferred choice). Students will have approximately 2 weeks to complete the projects in stage 1A. After all the groups submit their code for 1A, we will attempt to run the system as a whole, noting where it is broken or has performance problems.
In stage 1B we will assign new projects to the student groups, with the objective of adding functionality that was missing from 1A, as well as fixing problems and performance issues of 1A. There will be a week and a half to complete projects for 1B. At the conclusion of 1B we hope to have a functioning system that implements the main components of a paper and citation indexing and retrieval system. We would like to work with you to help achieve it. If there are things that you don't know how to do, please fell free to contact the TA and professors for any ideas they may have.
In part 2 of the project, you will have a chance to pick your own topic for study and development. The topic should be something that fits into and improves or extends upon the basic citation index system that was built in part 1. You will have five weeks to complete part 2, and at the end will present their projects (perhaps to a celebrity audience!). This is your opportunity to do a more detailed piece of research, which should be focussed on research results and performance analysis, rather than just getting things working. For example, in stage 1, we may have used fairly crude regular expression matching information extraction techniques to parse citations. This would be an opportunity to investigate alternative approaches to information extraction, and to find a better performing method. You will need to submit a write-up describing your experiments and findings, as a research paper.
We do want projects in part 2 be interoperable with the existing citation index system, and to facilitate this, we will have a checkpoint for integrating an early version of your new work, so that we have some time to evaluate and solve any interoperability concerns. We will also be pleased if people implement any miscellaneous bug fixes or performance enhancements to the general system that come up in the course of their project. We hope that at the end of the course we will have a working system that includes innovative new ideas for paper and citation finding, browsing, etc.
Throughout the project, we expect students to work in pairs (if there is an odd number of students, we will make accomodations). Students may change their groups between part 1 and part 2 of the project so as to find other partners who share their research interest. However, you will need to negotiate with other groups to come up with some rearrangement of people so one person isn't left stranded and unhappy.
In part 1, some projects require more work in IR/IE/NLP, and others require more work in systems and software engineering. Because both types of effort are critical to the success of this project, we will value equally these two types of contributions. For part 2, your project should definitely aim to pick up on some of the IR/IE/NLP topics discussed in the course.
The CiteUnseen application begins with virtually no internal data, and gradually builds up data structures that allow it to provide the functionality described above. In order to make the subproblems modular, we have designed a workflow of several stages, with an associated sequence of data structures, that we want programs to follow. The data structures have been designed to store more data than strictly necessary, to allow some flexibility in the actual algorithms used in the implementation. We define the workflow as follows:
|Objective||Project(s)||Input Data||Output Data||Issues/Challenges||Team|
Omar Seyalo: firstname.lastname@example.org
Steve Branson: email@example.com
||Anton Ushakov: firstname.lastname@example.org|
Yang Huang: email@example.com
Fang Wei: firstname.lastname@example.org
Haoyi Want: email@example.com
Zhen Yin: firstname.lastname@example.org
Joseph Smarr: email@example.com
Tim Grow: firstname.lastname@example.org
Qi Su: email@example.com
Steve Ngai: firstname.lastname@example.org
Unfortunately, it seems to be necessary to use three different types of data structures, all of them residing on disk: files, relational database tables, and an inverted index. We discuss each in turn below:
These should be pretty self-explanatory. We choose to store both the postscript and the text file of the paper so that we can redo the conversion if we obtain better algorithms. All files are stored in a formal directory structure so that they can be found deterministically.
These live in the base directory /afs/ir/class/cs276b/data/webPages. A file with name ABCDEFGH lives at path AB/CD/EF/GH/ABCDEFGH. Some of these web pages are also hub pages, and this is indicated not by the file names but by the records in the database.
These live in the base directory /afs/ir/class/cs276b/data/rawPapers. A file with name ABCDEFGH lives at path AB/CD/EF/GH/ABCDEFGH.
These live in the base directory /afs/ir/class/cs276b/data/textPapers. A file with name ABCDEFGH lives at path AB/CD/EF/GH/ABCDEFGH.
The first five tables are used to store the raw data taken from the documents themselves, before removing duplicates.
PageInstance(id, url, filename, status, score, isHub)
PaperInstance(id, url, rawFilename, textFilename, status, author, title, abstract, citations, selfCitation, citationInstanceID, authorBegin, authorEnd, titleBegin, titleEnd, abstractBegin, abstractEnd, citationsBegin, citationsEnd, paperID)
CitationInstance(id, fromPaperInstanceID, fromHubInstanceID, toPaperInstanceID, citationText, citationTag, author, title, date, publication, volume, pages, editor, publisher, citationID, paperID, status)
CitationContextInstance(citationInstanceID, paperInstanceID, contextBegin, contextEnd, context)
AuthorInstance(id, authorText, first, middle, last, suffix, citationInstanceID, paperInstanceID, authorID)
The latter seven tables are the final data structure from which duplicates have been removed and which is used to respond to user queries.
Paper(id, citationInstanceID, paperInstanceID)
Author(id, first, middle, last, suffix, email, affiliation)
Name(id, altName, isCanonical)
Citation(fromPaperID, toPaperID, citationInstanceID)
We will use MySQL as our database management system. It manages the tables, and queries over the tables.
We will build an inverted index over the text versions of the papers, where the papers have been divided up into fields including author, title, abstract, introduction, and references. We will use Lucene, an open-source Java-based indexing system developed by Apache to build and query the inverted index.
Part 2 of the project is your opportunity to do a more detailed piece of research, which should be focussed on novel research, interesting results and performance analysis, rather than just getting things working. For example, in stage 1, we may have used fairly crude regular expression matching information extraction techniques to parse citations. This would be an opportunity to investigate alternative approaches to information extraction, and to find a better performing method. For part 2, results of your experiments should be submitted as a (maximum 8 page) research paper, as in a conference proceedings. Any topic related to the course and the project is fine. The topic should be something that fits into and improves or extends upon the basic citation index system that was built in part 1.
There are many possibilities for the second part of the project. We list a few here, but our experience shows that students can often be much more creative than we are.
Our general aim is to build a well-engineered and scalable enough
system that we can download, and do information extraction,
retrieval, text clustering, etc. on a database of the order
of a million research papers. This means that algorithms
and methods will have to be chosen so that they scale
sensible (e.g., an n3 clustering algorithm
would not be a good choice). We're going to keep our CVS
repository on the Leland systems, and in Part 1A of the project
will develop and run code there. You should check code into
CVS as soon as possible (as soon as it compiles...), since
this will make it easier for other people to see what you're
doing, and how things might work together. For reasons of disk
storage alone, all tests at this stage will have to be
small. We then plan to deploy and test the system on a
Linux machine with plenty of disk space. (And so you should
do anything that you think will make the system hard to port
A. Paper Web Crawler
A central requirement for this project is an efficient and robust web crawler which can initially find and download what we are calling "hub pages" -- here pages that contain one or more research papers linked off them. There are several important issues:
http://www.robotstxt.org/. Your crawl must observe the
robots.txtprotocol, and only download things that the webserver owner wants robots to download. The robot should supply contact information. You must also limit the amount of stuff that you download from one site over a short period of time. This requires the project to use a round robin mode where it bounces around a list of sites it is downloading stuff from, rather than pulling everything off one site first in a "depth-first" manner. Finally, there must also be opportunities to control the overall rate of data downloading, so that Stanford doesn't complain at us either. For example, one should be able to throttle back the downloading rate to 1 megabyte a second or whatever.
http://citeseer.com). Some other well-known paper repositories, such as
http://www.arxiv.org/also disallow robot crawling (well, to be more precise, they allow Google some places but noone else...). In general, aiming to get papers from individual researcher's home pages seems the best way to get the widest and most up-to-date selection of papers.
The first aim of the robot is to find and download HTML pages that contain citation information on academic works, including in particular "hub" pages which contain links to papers (in Postscript, PDF, or other formats). The central research challenge is to effectively find appropriate pages. One needs to start somewhere, and we imagine beginning with a small seed file, which could contains some suggested search engine queries and starting page URLs. For example, such a file might have:
Query: conference workshop papers pdf technical report|
A central research question is how to do intelligent focussed crawling (or resource discovery) rather than simply blindly downloading linked HTML pages. A good performance metric might be the ratio of useful pages (ones with citation information, including hubs) to all pages downloaded. See the paper by Chakrabarti et al. on focused crawling.
The robot might want to adopt a variety of specialized crawling methods. For example, descending through the faculty and student pages of departments or universities, looking for papers on faculty and student pages or on a publications page linked off of them seems a good strategy. Another good strategy for resource discovery seems to be to take a paper title that you know, send it to a search engine, and to find other sites which store that paper. They may in turn have many other papers stored at them. There can thus be a feedback look between later stages of project processing (getting paper titles) feeding back into new things for the crawler to look for. A central need is that the system be able to discover over time new sites where academic papers are located, rather than simply finding things from a static collection.
A second task of the robot is to download the actual papers. These
will be identified by another
group, but it is the crawler's job to download them
(while again observing constraints imposed by load
There has been a considerable amount of work on writing web crawlers, and in particular some previous work using Java (some of it old versions of Java...). However, we're not aware of a publically available Java crawler. This work has focussed on standard web page crawling applications. Most such work maintains a list (representing a kind of breadth-first search) or URLs it knows about but hasn't downloaded, and then multiple crawler agents select new URLs to download based on some metric (such as the estimated pagerank of the pages). There are opportunities here to use cleverer metrics. Some work to be aware of includes:
The first part of this project is to take putative hub pages (ones that are meant to contain research papers on them) identified by the crawler, and to do further processing on these HTML pages, and then subsequently on the actually downloaded pages.
"A"elements), but there is ample room for creativity and cleverness in finding how much context to extract as representing citation information about the paper. Commonly the anchor text (stuff inside the
"A"element) of a paper will just be the title, or maybe just
P, BR, LIas separators (but not others such as
I, B, ü), and one could also hope to sanity check the content of the proposed citation: it should not be too big or too small, and it should look like a paper citation. This could be evaluated by a simple sequence model, such as a n-gram language model trained on a featural decomposition of words (number, capitalized, etc.).
HTML, RTF, DOC, or other formats, and many other documents such as product manuals and advertising copy appear in
Many people have written tools for converting PS files to text. Most
of them are wrappers around the ghostscript
ps2ascii.ps, which is actually a Postscript
program that gets text out of another Postscript file.
(All Postscript files are essentially programs, written
in a stack-based language that resembles Forth, for
those of you that have run across that before.) A
couple of people have written
their own such postscript programs.
to essentially be able to use these tools.
We've installed under /afs/ir/class/cs276b/software various such tools.
(We've also spent a fair bit of time looking at them: you should come
talk to Teg or Chris for a bit of a brain-dump about
what we currently know, though there are some more
The main practical issue is whether to just use the plain text format
(which gives less information but is easier to look at)
or the richer representation which includes font change
information, etc. This should make subsequent
information extraction easier. We may want to use both
for different purposes. This should be
negotiated with the next group. It's probably best to
start with just plain text
For PDF files, there is similarly a
available on the Leland systems. It produces
very plain text files. We're not currently aware of a free program that
produces a text version of PDF files with some more font
and markup information. But it'd be nice if there were
one and you could find it!
More on what we know
You can find programs that we've looked at in either the
Most of them we have compiled and added some simple setup shell scripts
to. Extracting text from a PS program is a messy heuristic business,
and the frustrating thing is that all the programs seem to work better
for some bits of the problem and worse in other places. For the first
phase, we should probably stick to using one that seems best, or perhaps
doing a meta-chooser based on the output of several. A first task is to
evaluate which one seems most reliable in general. We've only done that
for a very few files. A few comparisons on a few files follows:
|Name||OK on 7.ps||OK on 8.ps||OK on tense.ps||OK on gi.ps||OK on gi.pdf||OK on alg.pdf||Ligatures||Notes|
|gs-8.0 ps2ascii||OK-ish||OK-ish||OK||OK||OK||OK||Yes (except 7)||Generally stable. Only 7 bit ASCII. Can decode all charsets. Doesn't put in line breaks as well as pstotext, e.g., on algthatlearns.pdf. Has modes where more detailed info about font changes etc. can be output. Would need decoding.|
|gs-6.0 ps2ascii||OK-ish||OK-ish||OK||OK||Yes||Yes||Yes (except 7)||version installed on leland. Seems same as 8.0.|
|gs-4.03-ang ps2ascii||No!||No!||Yes||Yes||Yes||Yes||Bad for TeX OT1; okay on Type 1||Andrew once upon a time did some work to put extra info on font changes and line breaks into text, but it would seem that one would need to port the useful parts to a more modern version of ghostscript for good coverage. He also made it abort processing after the first 3 pages, but this part could be turned off.|
|pstotext||Gibberish||OK-ish||OK||OK||Yes||Yes||Yes (TeX OT1, Type 1)||Seems fairly robust. Gives page breaks, but little else. Puts section headings and title lines on a line by themselves more robustly than gs8 (see tense.ps or algthatlearns.pdf).|
|prescript 0.1||No!||No!||Yes||Yes||No!||No!||Some (TeX OT1, not Type 1)||A bit above 2.2. Has html mode which puts in paragraph marks, but really no different to double blank lines in text mode.|
|prescript 2.2||No!||No!||No!||Yes||No!||No!||No||Doesn't seem to live up to the webpage hype for the few pages I tried. Has html mode which puts in paragraph marks, but really no different to double blank lines in text mode.|
|pdftotext||No!||No!||No!||No!||Yes||Yes||Yes||Installed on Leland machines. Only for pdf, but for those, it seems to do a slightly better job with headings, section titles etc. than anything else. Just plain text.|
(Other thinks I looked at (the Crete ps2html, old dmjones code, old JHU ps2html) doesn't seem worth exploring further.)
The goal here is to do what is sometimes called
zoning: working out the status of larger blocks
of text. The main blocks initially of interest are:
As a more detailed information extraction task, part of this project is to do actual information extraction of paper titles and authors.
There are various ways that one can approach this task. For identifying names, knowing names is very useful. However, there are many names, especially as the range of nationalities grows, and so one wants to be able to overall identify something that looks like a name versus things that look like paper titles. N-gram language model methods (and sensitivity to features like capitalization) can be very effective here.
A central need is getting accurate information about other papers cited in a particular paper. This is a reasonably structured information extraction task. The task is first a segmentation task of separating out individual citations (line break information is useful here, but not sufficient by itself), and then separating the citation text into fields of author, year, title, etc. Many of these fields are fairly obvious from their content, but sequence information is also an important indicator. This extracted information will then be put in the database for each citation. A later phase will try to collapse (usually variant) citations of the same work in different papers.
We'd like to be able to give snippets of text of the context in which
a work is cited in the text. This crucially involves
finding the key used to cite the citation in the
references, and then locating it in the text. The key
may be a number with symbols like
alphanumeric key, like
something constructed from author and year, like
Manning (1999) which may be cited in text
in several different ways:
Manning (1999) or
(Manning 1999) or
(Manning 1995, 1999).
A central problem in this domain is the of identifying instances of the same paper, person, or journal, despite the fact that they are described in variant ways. This problem is an instance of the data association problem, that of associating many different observations of an object to the object itself. In this case, the objects are papers, people, and journals, and the observations are the instances of the papers and citations that we find on the Internet.
This same problem turns up in many other domains where a lot of other places where one has to deal with messy, real-world observations including:
The task in this project is to create a database that contains unique papers, with associated authors and journals as first order objects. The starting point is the database produced by the preceding projects, which includes relations like PaperInstance and CitationInstance. Some of the specific challenges in this part of the project will be:
Rosenblatt F. (1961). Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan Books, Washington, D.C.  Rosenblatt, F. (1962). Principles of Neurodynamics. Washington, DC:Spartanto appear in the same cluster.
Steve Lawrence (developer of CiteSeer) describes the approaches he tried in his paper on the subject, "Autonomous Citation Matching", located at http://www.neci.nec.com/~lawrence/pub-ri.html:
It is important to note that he is beginning from "unfielded" data, or citation "blobs" with no internal structure, while this group can count on the existence of fields that have been extracted by the other groups mentioned above.
This system needs a good UI to be effective ... and arguably this is one of the places where Citeseer is more lacking. Certainly, other sites, such as Hirewire, are trying to do rather more with the user interface, and there are a number of other ideas one might try. But, first things first, we need some UI. Indeed, we've adopted a slighly large definition of what is the front end: this includes the text indexing of papers. Key components are:
Most of the discussion of user interfaces and parametric search in cs276A is relevant.