People Research Academics Events Publications Resources

Corpora Installed at OSU

This list of resources was collected by Markus Dickinson and Detmar Meurers (OSU), February 2002. Funding for this project provided by OSU College of Humanities Seed Grant.

To see our internal tools and documentation, click here. To see a page of corpus resources on the web, click here.


Note

Our system (found under: /home/corpora/) corresponds to the 2-letter language codes (ISO 639) found at The XML Cover Pages.


GERMAN CORPORA

/home/corpora/DE/
  • Donau Kurier (ECI/MCI): /home/corpora/DE/dk/original/ or /home/corpora/VARIOUS/eci_mc1/original/data/eci2/ger04/

    Installed in CQP (dk)

  • Frankfurter Rundschau (ECI/MCI): /home/corpora/DE/fr/original/ or /home/corpora/VARIOUS/eci_mc1/original/data/eci1/ger03/

    Installed in CQP (fr)

  • Goethe: /home/corpora/DE/goethe/original/

  • Negra, version 2: /home/corpora/DE/negra2/original/

    Installed in TIGERSearch

  • Taz: /home/corpora/DE/taz/original/, or in DEREKO format: /home/corpora/DE/taz/DEREKO-0.01/

    Years 1986-1991 installed in CQP (dereko_19**)
    Year 1986 installed in TIGERSearch

  • VDI Nachrichten (ECI/MCI): /home/corpora/VARIOUS/eci_mc1/original/data/eci1/ger02/

    Installed in CQP (vdi)

In addition, we have some corpora only installed in ims-cwb, or CQP, format (listed along with their ims-cwb id):
  • Computer Zeitung (cz)
  • Glaw (glaw)
  • Mannheimer Korpus I (mk1 [regular], mk1-parsed [parsed])
  • Tuebinger Newskorpus (tn)

ENGLISH CORPORA

/home/corpora/EN/
  • IViE (Intonational Variation in English): /home/corpora/EN/IViE/original/

  • SUSANNE: /home/corpora/EN/SUSANNE/

  • Birkbeck Spelling Errors: /home/corpora/EN/birkbeck_spelling_errors/original/

  • BNC (British National Corpus): /home/corpora/EN/bnc/bncxml/

    BNC-SAMP installed in CQP

  • Christine: /home/corpora/EN/christine/

  • Penn Treebank, version 3: /home/corpora/EN/penn_treebank_3/original/

    Brown, Wall Street Journal, ATIS, and Switchboard corpora all installed in TIGERSearch
    Wall Street Journal installed in CQP

  • Wall Street Journal, years 1987-1992: /home/corpora/EN/wsj/original/

In addition, we have other English resources:
  • CMUDICT (Carnegie Mellon Pronouncing Dictionary), version 0.6: /home/corpora/EN/cmulex/

  • COMLEX-SYNTAX, version 3.1: /home/corpora/EN/comlex_synt_3.1/original/

  • WordNet: /home/corpora/EN/wordnet/wordnet1.7/


ENGLISH-FRENCH CORPORA

  • Hansard: /home/corpora/EN_FR/hansard/original/hansard.36/Release-2001.1a/


KOREAN-ENGLISH CORPORA

  • Korean-English Treebank: /home/corpora/KO/korean_english_treebank/original/


RUSSIAN CORPORA

  • Uppsala Corpus: /home/corpora/RU/uppsala/original/


CHINESE CORPORA

  • Penn Chinese Treebank: /home/corpora/ZH/chinese_treebank_2/original/


Questions or comments? Contact Markus Dickinson.


Last modified: June 16, 2005