Required installs
-
Head to the NLTK download page, and download and install Python, NLTK, Numpy, Matplotlib, and the data distribution. You can skip Prover9, MaltParser, and MegaM if you like, but get all the other optional libraries. (If you prefer to get a complete installation, with all the options, feel free to come to office hours for help. Mac users will need to install the Developer Tools, i.e., the XCode optional install on the OS disk.)
Note: We will use Python 2.6.*, not Python 3. You should be fine with any version in the range 2.5.x-2.6.x. Python 3 is a different beast. You are welcome to use it, but you might have to modify code and make other adjustments.
-
Windows users should install Cygwin for the command-line utilities it provides. (Mac and Linux users have these utilities aready.)
-
Windows users might find the differences between Cygwin and the regular shell a drag. If so, they can do the work entirely within their account on the Stanford Unix machines. This page provides the details on what software to download and how to set them up for the Stanford machines. You'll want at least SecureFX and SecureCRT.
- Firefox is not only a great browser, it's also smart about character-encoding and representing the underlying structure of Web documents.
Other data and resources
Tips for new stuff are most welcome!
- Directory containing the readings for this course (and others); password protected
- Stanford NLP's resources list
- Stanford Linguistics' Corpus Resources
- Finite-state morphoology (book and tools)
- Linguistic Data Consortium
- Andrew McCallum's data page
- NLTK data distribution
- 20_newsgroups
- English Gigaword sample (LDC)
- Switchboard Dialog Act Corpus
- CHILDES database
- Lillian Lee's Sentiment analysis corpora
- UMass Amherst Sentiment Corpora
- Project Gutenberg (novels, poems, plays)
- Enron email collection (Corrada-Emmanual's useful tools)


