Most of our corpora are provided by the Linguistic Data Consortium, but there are also several non-LDC corpora. The corpora are available on CD/DVDs, and some also online, on AFS. We are currently in the process of moving more corpora online, and this page will be updated with location details as new datasets are added.

LDC Corpora

If a corpus is stored on AFS, the table below shows its directory under /afs/ir/data/linguistic-data/ — for example, the English Gigaword is stored at /afs/ir/data/linguistic-data/EnglishGigaword. If no AFS location is given, the corpus is not available online and you must borrow the disc in order to use it. A limited subset of (older) corpora are available for download from LDC Online, so if you don't find a corpus in disk format or on AFS, check with the corpus TA. See Get Access for details on how to register to use corpora.

IDNameAFS
LDC2009S052007 NIST Language Recognition Evaluation Supplemental Training SetLanguageRecognitionTraining
LDC2009T25Web 1T 5-gram, 10 European Languages Web1T5gramEuropean
LDC2009T26NXT Switchboard Annotations
LDC2009T24OntoNotes Release 3.0
LDC2009T28French Gigaword Second Edition
LDC2009T30Arabic Gigaword Fourth Edition
LDC2009T29ACL Anthology Reference Corpus
LDC2009T122008 CoNLL Shared Task Data
LDC2009T052008 NIST Metrics for Machine Translation (MetricsMATR08) Development Data
LDC2009L01An English Dictionary of the Tamil Verb Second Edition
LDC2007A09Arab Armored Forces Egyptian Dialect Edition
LDC2009T22Arabic Treebank English Translation
LDC2009E72Arabic Treebank Part 5 V1.0
LDC2009V01Audiovisual Database of Spoken American English
LDC2009T04BioProp 1.0
LDC2009E39CoNNL 2009 Shared Task Chinese Development Set
LDC2009E39ACoNNL 2009 Shared Task Chinese Development Set
LDC2009E39BCoNNL 2009 Shared Task Chinese Development Set
LDC2009E37CoNNL 2009 Shared Task Chinese Test Set
LDC2009E38CoNNL 2009 Shared Task Chinese Training Set
LDC2009E36CoNNL 2009 Shared Task Chinese Trial Data Set
LDC2009E36ACoNNL 2009 Shared Task Chinese Trial Data Set
LDC2009E36DCoNNL 2009 Shared Task Chinese Trial Data Set
LDC2009E35CoNNL 2009 Shared Task Czech Development Set
LDC2009E35BCoNNL 2009 Shared Task Czech Development Set
LDC2009E33CoNNL 2009 Shared Task Czech Test Data Set
LDC2009E34CoNNL 2009 Shared Task Czech Training Set
LDC2009E34ACoNNL 2009 Shared Task Czech Training Set
LDC2009E32CoNNL 2009 Shared Task Czech Trial Set
LDC2009E32ACoNNL 2009 Shared Task Czech Trial Set
LDC2009E31CoNNL 2009 Shared Task English Development Set
LDC2009E29CoNNL 2009 Shared Task English Test Data Set
LDC2009E30CoNNL 2009 Shared Task English Training Set
LDC2009E30ACoNNL 2009 Shared Task English Training Set
LDC2009E28ACoNNL 2009 Shared Task English Trial Set
LDC2009S01CSLU: Numbers Version 1.3
LDC2009T20Czech Broadcast Conversation MDE Transcripts
LDC2009S02Czech Broadcast Conversation Speech
LDC2009T01English CTS Treebank with Structural Metadata
LDC2009T13English Gigaword Fourth Edition
LDC2009R30Fisher Spanish Speech and Transcripts
LDC2009T03GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1
LDC2009T09GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2
LDC2009T02GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 1
LDC2009T06GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 2
LDC2009T15GALE Phase 1 Chinese Newsgroup Parallel Text - Part 1
LDC2009R54Greybeard Eval
LDC2009T08Japanese Web N-gram Version 1
LDC2009T10Language Understanding Annotation Corpus
LDC2009E44LDC Standard Arabic Morphological Analyzer (SAMA) version 3.0
LDC2009A01NIST 2008 Speaker Recognition Evaluation Followup
LDC2009E42NIST LRE 2009 CTS Training Data, Indian English Development Data
LDC2009P01NorthAmerican News Corpus, WSJ Subset
LDC2009T11REFLEX Entity Translation Training/DevTest
LDC2009T21Spanish Gigaword Second Edition
LDC2009E73Standard Arabic Morphological Analyzer (SAMA) Version 3.1
LDC2009T14Tagged Chinese Gigaword Version 2.0
LDC2009T07Unified Linguistic Annotation Text Collection
LDC2009R27VOA Dari & Pashto Audio Archive
LDC2008A022004 An Nahar News Archive
LDC2007A252004 Nove Broadcast Video
LDC2008A032005 An Nahar News Archives
LDC2008A012006 An Nahar Archives
LDC2008A082007 Al Hayat Newswire Archives
LDC2008E30Aquaint Download
LDC2008T25AQUAINT-2 Information-Retrieval Text Research Collection
LDC2008S09CHAracterizing INdividual Speakers(CHAINS)
LDC2008E32CoNNL 2008 Shared Task Training Set
LDC2008E32CoNNL 2008 Shared Task Training Set
LDC2008T22Czech Academic Corpus 2.0
LDC2008E36Fisher Phanotics Calls
LDC2008E38GALE Phase 3 Release 2 - Broadcast Audio
LDC2008E41GALE Phase 3 Release 2 - Web Text
LDC2008L03Global Yoruba Lexical Database v. 1.0
LDC2008E29LCTL Bengali Language Pack 2.1
LDC2008E27LCTL Bengali v2.0
LDC2008S08LDC Spoken Language Sampler
LDC2007A24NIST Pilot Meeting Corpus, Training Data
LDC2008E01NTCIR-7 Advanced Cross-Lingual Information Access Task
LDC2008T20PennBioIE CYP 1.0
LDC2008T21PennBioIE Oncology 1.0
LDC2008T19The New York Times Annotated CorpusNYT-Annotated-Corpus
LDC2007A21TRECVID Nov 2004
LDC2008S052005 NIST Language Recognition Evaluation
LDC2008T13BLLIP North American News Text, Complete
LDC2008T14BLLIP North American News Text, General Release
LDC2008T17CALLHOME Mandarin Chinese Transcripts - XML version
LDC2008T24COMNOM v 1.0
LDC2008S06CSLU: Alphadigit Version 1.3
LDC2008S07CSLU: ISOLET Spoken Letter Database Version 1.3
LDC2008T09GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2
LDC2008T08GALE Phase 1 Chinese Broadcast News Parallel Text - Part 2
LDC2008T18GALE Phase 1 Chinese Broadcast News Parallel Text - Part 3
LDC2008T23NomBank v 1.0
LDC2008T15North American News Text, Complete
LDC2008T16North American News Text, General Release
LDC2008T03 ACE 2005 English SpatialML Annotations
LDC2008L01 An English Dictionary of the Tamil Verb
LDC2008T07 Chinese Proposition Bank 2.0
LDC2008S02 CSLU: National Cellular Telephone Speech Release 2.3
LDC2008S01 CSLU: Portland Cellular Telephone Speech Version 1.3
LDC2008T02 GALE Phase 1 Arabic Blog Parallel Text
LDC2008T06 GALE Phase 1 Chinese Blog Parallel Text
LDC2008L02 Hindi WordNet
LDC2008T01 Hungarian-English Parallel Text, Version 1.0
LDC2008T04 OntoNotes Release 2.0
LDC2008T05 Penn Discourse Treebank Version 2.0
LDC2008S03 STC-TIMIT 1.0
LDC2008S04 West Point Brazilian Portuguese Speech
LDC2007T222001 Topic Annotated Enron Email Data SetTopic-Annotated-Enron-Email
LDC2007S102003 NIST Rich Transcription Evaluation Data
LDC2007S122004 Spring NIST Rich Transcription (RT-04S) Evaluation Data
LDC2007S112004 Spring NIST Rich Transcription (RT-04S) Development Data
LDC2007T40Arabic Gigaword Third Edition
LDC2007S03ARL Urdu Speech Database, Training Data
LDC2007T38Chinese Gigaword Third Edition ChineseGigaword
LDC2007T36Chinese Treebank 6.0 (CTB6.0) Chinese-Treebank
LDC2007S08CSLU: Foreign Accented English Release 1.2
LDC2007S18CSLU: Kids' Speech Version 1.1
LDC2007S13CSLU: Apple Words and Phrases
LDC2007S05CSLU: Yes/No Version 1.2
LDC2007T02English Chinese Translation Treebank v 1.0
LDC2007T07English Gigaword Third Edition EnglishGigaword
LDC2007S02Fisher Levantine Arabic Conversational Telephone Speech
LDC2007T04Fisher Levantine Arabic Conversational Telephone Speech, Transcripts
LDC2007T24GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1GALE
LDC2007T23GALE Phase 1 Chinese Broadcast News Parallel Text - Part 1GALE
LDC2007T20GALE Phase 1 Distillation Training GALE
LDC2007T08ISI Arabic-English Automatically Extracted Parallel Text
LDC2007T09ISI Chinese-English Automatically Extracted Parallel Text
LDC2007S01Levantine Arabic Conversational Telephone Speech
LDC2007T01Levantine Arabic Conversational Telephone Speech, Transcripts
LDC2007S09Mandarin Affective Speech
LDC2007T19MITRE 1997 Mandarin Broadcast News Speech Translations(Hub-4NE)MandarinBroadcastNews
LDC2007S15Nationwide Speech Project
LDC2007T21OntoNotes v 1.0 OntoNotes
LDC2007T03Tagged Chinese Gigaword
LDC2007V02TRECVID 2003 Keyframes & Transcripts
LDC2007V01TRECVID 2005 Keyframes & Transcripts
LDC2006S442004 NIST Speaker Recognition Evaluation
LDC2006T06ACE 2005 Multilingual Training Corpus
LDC2006S46Arabic Broadcast News SpeechArabicBroadcastNews
LDC2006T20Arabic Broadcast News TranscriptsArabicBroadcastNews
LDC2006T02Arabic Gigaword Second Edition
LDC2006S15CSLU: Spelled and Spoken Words
LDC2006S14CSLU: Stories v 1.2
LDC2006S35CSLU: Multilanguage Telephone Speech Version 1.2
LDC2006S39CSLU: Names Release 1.3
LDC2006S26CSLU: Speaker Recognition Version 1.1
LDC2006S16CSLU: Spoltech Brazilian Portuguese Version 1.0
LDC2006S01CSLU: Voices
LDC2006T10English-Arabic Treebank v 1.0
LDC2006T17French Gigaword First EditionFrenchGigaword
LDC2006S43Gulf Arabic Conversational Telephone Speech
LDC2006T15Gulf Arabic Conversational Telephone Speech, Transcripts
LDC2006S45Iraqi Arabic Conversational Telephone Speech
LDC2006T16Iraqi Arabic Conversational Telephone Speech, Transcripts
LDC2006S42Korean Broadcast News Speech
LDC2006T14Korean Broadcast News Transcripts
LDC2006T03Korean Propbank
LDC2006T09Korean Treebank Annotations Version 2.0
LDC2006S29Levantine Arabic QT Training Data Set 5, Speech
LDC2006T07Levantine Arabic QT Training Data Set 5, Transcripts
LDC2006S33Middle East Technical University Turkish Microphone Speech v 1.0
LDC2006T04Multiple-Translation Chinese (MTC) Part 4
LDC2006S13N4 NATO Native and Non-Native Speech
LDC2006S31NIST 2003 Language Recognition Evaluation
LDC2006T01Prague Dependency Treebank 2.0
LDC2006S34Russian through Switched Telephone Network (RuSTeN)
LDC2006T12Spanish Gigaword First Edition
LDC2006S30Speech Controlled Computing
LDC2006T18TDT5 Multilingual Text
LDC2006T19TDT5 Topics and Annotations
LDC2006T08TimeBank 1.2TimeBank
LDC2006T13Web 1T 5-gram Version 1 (AFS has 1,2,3-grams only)Web1T5gram
LDC2006S37West Point Heroico Spanish Speech
LDC2006S36West Point Korean Speech
LDC2005T09ACE 2004 Multilingual Training CorpusACE2004-Training
LDC2005T07ACE Time Normalization (TERN) 2004 English Training Data v1.0TERN
LDC2005T35American National Corpus (ANC) Second Release
LDC2005S07Arabic CTS Levantine Fisher Training Data Set 3, Speech
LDC2005T03Arabic CTS Levantine Fisher Training Data Set 3, Transcripts
LDC2005T02Arabic Treebank: Part 1 v 3.0 (POS with full vocalization + syntactic analysis)Arabic-Treebank
LDC2005T20Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis)Arabic-Treebank
LDC2005T30Arabic Treebank: Part 4 v 1.0 (MPG Annotation)Arabic-Treebank
LDC2005S22Articulation Index
LDC2005T33BBN Pronoun Coreference and Entity Type CorpusBBN-PCET
LDC2005S08BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts
LDC2005T13CCGbank
LDC2005T34Chinese <-> English Name Entity Lists v 1.0
LDC2005T10Chinese English News Magazine Parallel TextChineseEnglishNewsText
LDC2005T14Chinese Gigaword Second EditionChineseGigaword
LDC2005T06Chinese News Translation Text Part 1
LDC2005T23Chinese Proposition Bank 1.0Chinese-PropBank-1.0
LDC2005T01Chinese Treebank 5.0Chinese-Treebank
LDC2005T01U01Chinese Treebank 5.1Chinese-Treebank
LDC2005S26CSLU: 22 Languages Corpus
LDC2005T08Discourse Graphbank
LDC2005T12English Gigaword Second EditionEnglishGigaword
LDC2005S13Fisher English Training Part 2, Speech
LDC2005T19Fisher English Training Part 2, TranscriptsFisher
LDC2005T28HARD 2004 Text
LDC2005T29HARD 2004 Topics and Annotations
LDC2005S15HKUST Mandarin Telephone Speech, Part 1
LDC2005T32HKUST Mandarin Telephone Transcript Data, Part 1HKUST-Mandarin
LDC2005S14Levantine Arabic QT Training Data Set 4 (Speech + Transcripts)
LDC2005L01Mawukakan Lexicon
LDC2005T05Multiple-Translation Arabic (MTA) Part 2
LDC2005S16RT-04 MDE Training Data Speech
LDC2005T24RT-04 MDE Training Data Text/Annotations
LDC2005S25Santa Barbara Corpus of Spoken American English Part IVSantaBarbara/4
LDC2005S11TDT4 Multilingual Broadcast News Speech Corpus
LDC2005T16TDT4 Multilingual Text and Annotations
LDC2005S30The West Point Company G3 American English Speech Data Corpus
LDC2005S28West Point Croatian Speech Corpus
LDC2004T152000 Communicator Dialogue Act Tagged
LDC2004T162001 Communicator Dialogue Act Tagged
LDC2004S042002 NIST Speaker Recognition Evaluation
LDC2004S112002 Rich Transcription Broadcast News and Conversational Telephone Speech
LDC2004T18Arabic English Parallel News Part 1
LDC2004T17Arabic News Translation Text Part 1
LDC2004T02Arabic Treebank: Part 2 v 2.0
LDC2004T11Arabic Treebank: Part 3 v 1.0
LDC2004L02Buckwalter Arabic Morphological Analyzer Version 2.0BuckwalterArabicMA
LDC2004T05Chinese Treebank 4.0
LDC2004S01Czech Broadcast News Speech
LDC2004T01Czech Broadcast News Transcripts
LDC2004S13Fisher English Training Speech Part 1 Speech
LDC2004T19Fisher English Training Speech Part 1 Transcripts
LDC2004V01FORM1 Kinematic Gesture
LDC2004T08Hong Kong Parallel Text
LDC2004S02ICSI Meeting Speech
LDC2004T04ICSI Meeting TranscriptsICSI-Transcripts
LDC2004S05ISL Meeting Speech Part 1
LDC2004T10ISL Meeting Transcripts Part 1
LDC2004L01Klex: Finite-State Lexical Transducer for Korean
LDC2004T03Morphologically Annotated Korean Text
LDC2004T07Multiple-Translation Chinese (MTC) Part 3
LDC2004S09NIST Meeting Pilot Corpus Speech
LDC2004T13NIST Meeting Pilot Corpus Transcripts and Metadata
LDC2004T23Prague Arabic Dependency Treebank 1.0
LDC2004T25Prague Czech-English Dependency Treebank 1.0
LDC2004T14Proposition Bank I
LDC2004S08RT-03 MDE Training Data Speech
LDC2004T12RT-03 MDE Training Data Text and Annotations
LDC2004S10Santa Barbara Corpus of Spoken American English Part IIISantaBarbara/3
LDC2004S07Switchboard Cellular Part 2 Audio
LDC2004S12TalkBank Ethology Data: Field Recordings of Vervet Monkey Calls
LDC2004T09TIDES Extraction (ACE) 2003 Multilingual Training Data
LDC2003T031997 HUB5 German Transcripts
LDC2003T041997 HUB5 Spanish Transcripts
LDC2003T021998 HUB5 English Transcripts
LDC2003S012001 Communicator Evaluation
LDC2003T012001 HUB5 Mandarin Transcripts
LDC2003T11ACE-2 Version 1.0
LDC2003T20American National Corpus(ANC) First Release
LDC2003T12Arabic Gigaword
LDC2003T07Arabic Treebank: Part 1 - 10K-word English Translation
LDC2003T06Arabic Treebank: Part 1 v 2.0
LDC2003T09Chinese GigawordChineseGigaword
LDC2003T05English GigawordEnglishGigaword
LDC2003V01FORM2 Kinematic Gesture
LDC2003L01Grassfields Bantu Fieldwork: Dschang Lexicon
LDC2003S02Grassfields Bantu Fieldwork: Dschang Tone Paradigms
LDC2003P01Korean Telephone Conversations Complete Set
LDC2003L02Korean Telephone Conversations Lexicon
LDC2003S03Korean Telephone Conversations Speech
LDC2003T08Korean Telephone Conversations Transcripts
LDC2003T13Message Understanding Conference (MUC) 6
LDC2003T18Multiple-Translation Arabic (MTA) Part 1MTA
LDC2003T17Multiple-Translation Chinese (MTC) Part 2
LDC2003T10SAIDSAID
LDC2003S06Santa Barbara Corpus of Spoken American English Part II
LDC2003T15SLX Corpus of Classic Sociolinguistic Interviews
LDC2003T16SummBank 1.0
LDC2003S05West Point Russian Speech
LDC2002S111997 HUB4 English Evaluation Speech and Transcripts
LDC2002S221997 HUB5 Arabic Evaluation
LDC2002T391997 HUB5 Arabic Transcripts
LDC2002S241997 HUB5 German Evaluation
LDC2002S251997 HUB5 Spanish Evaluation
LDC2002S101998 HUB5 English Evaluation
LDC2002S562000 Communicator Evaluation
LDC2002S132001 HUB5 English Evaluation
LDC2002S122001 HUB5 Mandarin Evaluation
LDC2002S342001 NIST Speaker Recognition Evaluation Corpus
LDC2002L49Buckwalter Arabic Morphological Analyzer Version 1.0
LDC2002S37CALLHOME Egyptian Arabic Speech Supplement
LDC2002T38CALLHOME Egyptian Arabic Transcripts Supplement
LDC2002L27Chinese-English Translation Lexicon Version 3.0
LDC2002S28Emotional Prosody Speech and Transcripts
LDC2002T26Korean English Treebank Annotations
LDC2002T01Multiple-Translation Chinese Corpus
LDC2002T07RST Discourse TreebankRST_discourse_treebank
LDC2002S06Switchboard-2 Phase III Audio
LDC2002T31The AQUAINT Corpus of English News TextAQUAINT
LDC2002S04Translanguage English Database (TED) Speech
LDC2002T03Translanguage English Database (TED) Transcripts
LDC2002S35Voicemail Corpus Part II
LDC2002S02West Point Arabic Speech Corpus
LDC2001S911997 HUB4 Broadcast News Evaluation Non-English Test Material
LDC2001S972000 NIST Speaker Recognition EvaluationNIST2000
LDC2001T55Arabic Newswire Part 1
LDC2001T61CALLHOME Spanish Dialogue Act Annotation
LDC2001T62CETEMpublico
LDC2001T11Chinese Treebank 2.0Chinese-Treebank
LDC2001S16Grassfields Bantu Fieldwork: Ngomba Tone Paradigms
LDC2001T02Message Understanding Conference (MUC) 7MUC_7
LDC2001T10Prague Dependency Treebank 1.0PDT1.0
LDC2001S04Speech in Noisy Environments (SPINE2) Part 1 Audio
LDC2001T05Speech in Noisy Environments (SPINE2) Part 1 Transcripts
LDC2001S06Speech in Noisy Environments (SPINE2) Part 2 Audio
LDC2001T07Speech in Noisy Environments (SPINE2) Part 2 Transcripts
LDC2001S08Speech in Noisy Environments (SPINE2) Part 3 Audio
LDC2001T09Speech in Noisy Environments (SPINE2) Part 3 Transcripts
LDC2001S99Speech in Noisy Environments 1 (SPINE1 CODED) Coded Audio
LDC2001S13Switchboard Cellular Part 1 Audio
LDC2001S15Switchboard Cellular Part 1 Transcribed Audio
LDC2001T14Switchboard Cellular Part 1 Transcription
LDC2001T60Syllable-Final /s/ Lenition
LDC2001S93TDT2 Mandarin Audio Corpus
LDC2001T57TDT2 Multilanguage Text Version 4.0TDT/TDT2-Multilingual
LDC2001S94TDT3 English Audio
LDC2001S95TDT3 Mandarin Audio
LDC2001T58TDT3 Multilanguage Text Version 2.0TDT/TDT3-Multilingual
LDC2000S861998 HUB4 Broadcast News Evaluation English Test Material
LDC2000S881999 HUB4 Broadcast News Evaluation English Test Material1999-HUB4-Test
LDC2000T43BLLIP 1987-89 WSJ Corpus Release 1BLLIP-WSJ
LDC2000T50Hong Kong Hansards Parallel TextHansard-Hong-Kong
LDC2000T47Hong Kong Laws Parallel TextHong-Kong-Laws
LDC2000T46Hong Kong News Parallel TextHong-Kong-News
LDC2000T45Korean Newswire
LDC2000S85Santa Barbara Corpus of Spoken American English Part ISantaBarbara/1
LDC2000S96Speech in Noisy Environments (SPINE) Evaluation Audio
LDC2000T54Speech in Noisy Environments (SPINE) Evaluation Transcripts
LDC2000S87Speech in Noisy Environments (SPINE) Training AudioSPINE
LDC2000T49Speech in Noisy Environments (SPINE) Training TranscriptsSPINE
LDC2000S92TDT2 Careful Transcription Audio
LDC2000T44TDT2 Careful Transcription TextTDT/TDT2-Careful
LDC2000T52TREC Mandarin
LDC2000T51TREC Spanish
LDC2000T53Voice of America (VOA) Broadcast News Czech Transcript Corpus
LDC2000S89Voice of America (VOA) Czech Broadcast News Audio
LDC99S801997 Speaker Recognition Benchmark
LDC99S811999 Speaker Recognition Benchmark
LDC99L23American English Spoken Lexicon
LDC99L22Egyptian Colloquial Arabic Lexicon
LDC99T34Japanese Business News Text Supplement
LDC99T40Portuguese Newswire Text
LDC99T41Spanish Newswire Text, Volume 2
LDC99S78SUSAS
LDC99T33SUSAS Transcripts
LDC99S79Switchboard-2 Phase II
LDC99S83Tactical Speaker Identification Speech Corpus (TSID)
LDC99S84TDT2 English Audio
LDC99T42Treebank-3Treebank/3
LDC99S82USC Marketplace Broadcast News Speech
LDC99T36USC Marketplace Broadcast News Transcripts
LDC98T311996 CSR HUB4 Language Model1996-CSR-Hub-4-LM
LDC98S711997 English Broadcast News Speech (HUB4)
LDC98T281997 English Broadcast News Transcripts (HUB4)English-Broadcast-News
LDC98S731997 Mandarin Broadcast News Speech (HUB4-NE)
LDC98T241997 Mandarin Broadcast News Transcripts (HUB4-NE)
LDC98S741997 Spanish Broadcast News Speech (HUB4-NE)
LDC98T291997 Spanish Broadcast News Transcripts (HUB4-NE)Spanish-Broadcast-News
LDC98S761998 Speaker Recognition Benchmark
LDC98L21COMLEX English Syntax Lexicon
LDC98S67HTIMIT
LDC98S69HUB5 Mandarin Telephone Speech Corpus
LDC98T26HUB5 Mandarin Transcripts
LDC98S70HUB5 Spanish Telephone Speech Corpus
LDC98T27HUB5 Spanish TranscriptsHub5-Spanish-Transcripts
LDC98T32JURIS
LDC98S68LLHDB
LDC98T30North American News Text Supplement
LDC98S75Switchboard-2 Phase I
LDC98S72Taiwanese Putonghua Speech and Transcripts
LDC98T25TDT Pilot Study CorpusTDT/TDT-Pilot-Corpus
LDC98S77Voicemail Corpus Part I
LDC97S661996 English Broadcast News Dev and Eval (HUB4)
LDC97S441996 English Broadcast News Speech (HUB4)
LDC97T221996 English Broadcast News Transcripts (HUB4)
LDC97L20CALLHOME American English Lexicon (PRONLEX)CALLHOME
LDC97S42CALLHOME American English Speech
LDC97T14CALLHOME American English TranscriptsCALLHOME
LDC97S45CALLHOME Egyptian Arabic Speech
LDC97T19CALLHOME Egyptian Arabic TranscriptsCALLHOME
LDC97L18CALLHOME German LexiconCALLHOME
LDC97S43CALLHOME German Speech
LDC97T15CALLHOME German TranscriptsCALLHOME
LDC97T12DSO Corpus of Sense-Tagged English
LDC97S62Switchboard-1 Release 2
LDC97S63The CMU Kids Corpus
LDC96S611996 Speaker Recognition Benchmark
LDC96S36Boston University Radio Speech CorpusBoston-University-Radio
LDC96S46CALLFRIEND American English-Non-Southern Dialect
LDC96S47CALLFRIEND American English-Southern Dialect
LDC96S48CALLFRIEND Canadian French
LDC96S49CALLFRIEND Egyptian Arabic
LDC96S50CALLFRIEND Farsi
LDC96S51CALLFRIEND German
LDC96S52CALLFRIEND Hindi
LDC96S53CALLFRIEND Japanese
LDC96S54CALLFRIEND Korean
LDC96S55CALLFRIEND Mandarin Chinese-Mainland Dialect
LDC96S56CALLFRIEND Mandarin Chinese-Taiwan Dialect
LDC96S57CALLFRIEND Spanish-Caribbean Dialect
LDC96S58CALLFRIEND Spanish-Non-Caribbean Dialect
LDC96S59CALLFRIEND Tamil
LDC96S60CALLFRIEND Vietnamese
LDC96L17CALLHOME Japanese LexiconCALLHOME
LDC96S37CALLHOME Japanese Speech
LDC96T18CALLHOME Japanese TranscriptsCALLHOME
LDC96L15CALLHOME Mandarin Chinese LexiconCALLHOME
LDC96S34CALLHOME Mandarin Chinese Speech
LDC96T16CALLHOME Mandarin Chinese TranscriptsCALLHOME
LDC96L16CALLHOME Spanish LexiconCALLHOME
LDC96S35CALLHOME Spanish Speech
LDC96T17CALLHOME Spanish TranscriptsCALLHOME
LDC96L14CELEX2CELEX
LDC96T11COMLEX Syntax Text Corpus Version 2.0
LDC96S33CSR-IV HUB3
LDC96S31CSR-IV HUB4
LDC96S30CTIMIT
LDC96S38DCIEM/HCRC
LDC96S32FFMTIMIT
LDC96S29Frontiers in Speech Processing 93
LDC96S40Frontiers in Speech Processing 94
LDC96S64-1JEIDA/JCSD-Channel 0 City Names
LDC96S64JEIDA/JCSD-Channel 0 Complete
LDC96S64-2JEIDA/JCSD-Channel 0 Control Words
LDC96S64-4JEIDA/JCSD-Channel 0 Four Digit Sequences
LDC96S64-3JEIDA/JCSD-Channel 0 Isolated Digits
LDC96S64-5JEIDA/JCSD-Channel 0 Mono Syllables
LDC96S65-1JEIDA/JCSD-Channel 1 City Names
LDC96S65JEIDA/JCSD-Channel 1 Complete
LDC96S65-2JEIDA/JCSD-Channel 1 Control Words
LDC96S65-4JEIDA/JCSD-Channel 1 Four Digit Sequences
LDC96S65-3JEIDA/JCSD-Channel 1 Isolated Digits
LDC96S65-5JEIDA/JCSD-Channel 1 Mono Syllables
LDC96T10Message Understanding Conference (MUC) 6 Additional News Text
LDC96S41VAHA (POLYPHONE II)
LDC95T20Hansard French/EnglishHansard-French
LDC95T8Japanese Business News TextJapanese-Business-News
LDC95T21North American News Text CorpusNorth-American-News
LDC95S25TRAINS Spoken Dialog CorpusTRAINS
LDC95T7Treebank-2Treebank/2
LDC94S14AAir Traffic Control CompleteAir-Traffic-Control
LDC94T5ECI Multilingual TextECI-Multilingual
LDC94S15SPIDRE
LDC94T4AUN Parallel Text (Complete)
LDC93T1ACL/DCI
LDC93S4AATIS0 Complete
LDC93S4BATIS0 Pilot
LDC93S4B-2ATIS0 Read
LDC93S4B-3ATIS0 SD Read
LDC93S5ATIS2
LDC93S6ACSR-I (WSJ0) Complete
LDC93S6CCSR-I (WSJ0) Other
LDC93S6BCSR-I (WSJ0) Sennheiser
LDC93S12HCRC Map Task CorpusHCRC-Maptask-Transcripts
LDC93S2NTIMIT
LDC93S3AResource Management Complete Set 2.0
LDC93S3BResource Management RM1 2.0
LDC93S3CResource Management RM2 2.0
LDC93S11Road Rally
LDC93S8Switchboard Credit Card
LDC93S7-TSwitchboard-1 Transcripts
LDC93S9TI 46-Word
LDC93S10TIDIGITSTIDIGITS
LDC93S1TIMIT Acoustic-Phonetic Continuous Speech CorpusTIMIT
LDC93T3ATIPSTER Complete
LDC93T3BTIPSTER Volume 1TREC/TREC-1
LDC93T3CTIPSTER Volume 2TREC/TREC-2
LDC93T3DTIPSTER Volume 3TREC/TREC-3

Non-LDC Corpora

If a corpus is stored on AFS, the table below shows its directory under /afs/ir/data/linguistic-data/. Corpora marked with an asterisk require you to agree an additional usage license. See Get Access for details.

Name Annotation Language AFS
Aleksova's corpus Bulgarian (spoken)
American Heritage Talking Dictionary (3rd edition) English
ATIS Syntax, POS, some argument structure (use TIGERSearch) English
Bavarian Archive of Speech Corpora (only annotations) Prosody, syntax, POS, transcribed German, English, Japanese
British National Corpus (BNC) World Edition (use gsearch) English BNC-world
British National Corpus (BNC) Web Version 2.0 On disk, easy-to-use interface English
Brown Corpus Syntax, POS, some argument structure (use TIGERSearch) English Brown
Census 1990 Names
English IE/census1990names
CHRISTINE
English CHRISTINE
CMU Pronouncing Dictionary
English CMU-Pronouncing-Dict
Cornell SMART Archive English SMART-Archive
Corpus Gesproken Nederlands Contemporary Dutch (spoken)
Corpus of Spoken Professional American English POS (use MonoConc) American English (spoken)
DARPA Extended Resource Management Continuous Speech Speaker-Dependent Corpus (RM2) English
EMILLE/CIIL Monolingual and parallel corpora, some Hindi annotated for demonstratives, some Urdu annotated with part-of-speech Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telugu, Urdu
Enron Email Corpus English Enron-Email-Corpus
Excite log English IR
International Computer Archive of Modern and Medieval English diachronic corpus English ICAME
International Corpus of English - British Component (use tgrep2) English ICE-GB
International Corpus of English - Singapore Component (use tgrep2) English ICE-Singapore
IViE Prosody, phonetic, etc. British dialects
John Rylands Univ Corpus of late 18c prose Early Modern English Rylands18cProse
Kristie Seymore's Information Extraction Data English IE/Kristie-Seymore-IE
LUCY English LUCY
Mooney Job Data English IE/Mooney-Job-Data
MuchMore Springer Bilingual Corpus Part-of-Speech, Morphology (inflection and decomposition), Chunks, Semantic Classes, Semantic Relations English, German MuchMore
MULTEXT-East lexica, annotated translations of Orwell's 1984 Bulgarian, Croatian, Czech, Estonian, English, Hungarian, Romanian, Slovene MULTEXT
NEGRA Syntax (LFG-based), POS, some argument structure (use TIGERSearch) German NEGRA
Nihon Kokugo Daijiten Japanese KokugoDaijiten
PPCME2* diachronic corpus PPCME2
PropBank predicate structure enriched treebank English Proposition-Bank-1
Remedia Story Comprehension* English QA
Reuters Corpus English Reuters-Corpus
RNC German radio news (Nachrichten) corpus Prosodically annotated & transcribed speech files German (spoken)
Switchboard Corpus Syntax, POS, some argument structure (use TIGERSearch) English (spoken) Switchboard
Switchboard LINK Project Corpus* Syntax, POS; some arg-str, animacy, information status, and coreference (use tgrep2) English (spoken) Treebank/LINK-swbd
SUSANNE Corpus, Release 5 English SUSANNE
TIGER Treebank Syntax (LFG-based), POS, some argument structure (use TIGERSearch) German
TIGER sample corpora Syntax, POS, some argument structure (use TIGERSearch) English TIGERCorpus
TREC Text Research Collection Vols. 4 (May 1996) & 5 (April 1997) English
Unified Medical Language System (UMLS) English UMLS
Verbmobil Dialogs German, English, Japanese Verbmobil-Dialogs
Wall Street Journal Syntax, POS, some argument structure (use TIGERSearch) English Treebank
Wolverhampton Coreference coreference and anaphora English Wolverhampton-Coreference
WordNet lexical information database English WordNet
YCOE* Syntax, POS, CAT, lemma (use TIGERSearch) English
Yomiuri Shinbun Japanese YomiuriShinbun