Corpora Installed at OSU
This list of resources was collected by Markus Dickinson and Detmar Meurers (OSU), February 2002. Funding for this project provided by OSU College of Humanities Seed Grant.
To see our internal tools and documentation, click here. To see a page of corpus resources on the web, click here.
Note
Our system (found under: /home/corpora/) corresponds to the 2-letter
language codes (ISO 639) found at The XML Cover
Pages.
GERMAN CORPORA
/home/corpora/DE/
Donau Kurier (ECI/MCI): /home/corpora/DE/dk/original/
or /home/corpora/VARIOUS/eci_mc1/original/data/eci2/ger04/
Installed in CQP (dk)
Frankfurter Rundschau (ECI/MCI):
/home/corpora/DE/fr/original/ or
/home/corpora/VARIOUS/eci_mc1/original/data/eci1/ger03/
Installed in CQP (fr)
Goethe: /home/corpora/DE/goethe/original/
Negra, version 2: /home/corpora/DE/negra2/original/
Installed in TIGERSearch
Taz: /home/corpora/DE/taz/original/, or in DEREKO
format: /home/corpora/DE/taz/DEREKO-0.01/
Years 1986-1991 installed in CQP (dereko_19**)
Year 1986 installed in TIGERSearch
VDI Nachrichten (ECI/MCI):
/home/corpora/VARIOUS/eci_mc1/original/data/eci1/ger02/
Installed in CQP (vdi)
In addition, we have some corpora only installed in ims-cwb, or CQP, format
(listed along with their ims-cwb id):
- Computer Zeitung (cz)
- Glaw (glaw)
- Mannheimer Korpus I (mk1 [regular], mk1-parsed [parsed])
- Tuebinger Newskorpus (tn)
ENGLISH CORPORA
/home/corpora/EN/
IViE (Intonational Variation in English):
/home/corpora/EN/IViE/original/
SUSANNE: /home/corpora/EN/SUSANNE/
Birkbeck Spelling Errors:
/home/corpora/EN/birkbeck_spelling_errors/original/
BNC (British National Corpus):
/home/corpora/EN/bnc/bncxml/
BNC-SAMP installed in CQP
Christine: /home/corpora/EN/christine/
Penn Treebank, version 3:
/home/corpora/EN/penn_treebank_3/original/
Brown, Wall Street Journal, ATIS, and Switchboard corpora all
installed in TIGERSearch
Wall Street Journal installed in CQP
Wall Street Journal, years 1987-1992:
/home/corpora/EN/wsj/original/
In addition, we have other English resources:
CMUDICT (Carnegie Mellon Pronouncing Dictionary),
version 0.6:
/home/corpora/EN/cmulex/
COMLEX-SYNTAX, version 3.1:
/home/corpora/EN/comlex_synt_3.1/original/
WordNet: /home/corpora/EN/wordnet/wordnet1.7/
ENGLISH-FRENCH CORPORA
KOREAN-ENGLISH CORPORA
RUSSIAN CORPORA
CHINESE CORPORA
Questions or comments? Contact Markus Dickinson.
|