Corpora and Corpus Annotation Tools on the WWW
This list of resources was collected by Markus Dickinson and Detmar Meurers (OSU), February 2002. Funding for this project provided by OSU College of Humanities Seed Grant.
Internal Documentation and Installed Corpora
You can find reference documentation for tools installed at OSU here.
You can find a list of our installed corpora here.
TOKENIZATION / SEGMENTATION TOOLS
LT TTT
(Text Tokenisation Tool), a text tokenization system from the Language
Technology Group
Segmenter
segments texts into topical chunks
SATZ, an adaptive
sentence boundary detector
MXTERMINATOR
by Adwait Ratnaparkhi
CPAN has Text::Sentence
(Ave), a module for splitting text into sentences.
Scott Piao's multilingual
concordancer has a sentence splitter (I think).
The Illinois Cognitive Computation Group has a sentence
splitter
Zhiping Zheng's QA system contains an online sentence
segmenter
Lingua-EN-Sentence-0.25 (Shlomo)
splits sentences based on regular expressions and lists of abbreviations.
Guenther(?),
a sentence segmenter which is to appear, I think (site in German)
Jorg Schuster has a Test
Sentencizer site which allows comparison of mxterminator, ave, and
shlomo.
Oliver Mason has a tokenizer called QTOKEN
TAGGERS
A demo
from Xerox Research Centre Europe (XRCE)
WinBrill
from Analyse et Traitement Informatique de la Langue Francaise (ATILF)
ACOPOST, a collection of
POS taggers, including a maximum entropy tagger, a
trigram tagger, an error-driven TBL tagger, and an
example-based tagger.
Decision
Tree Tagger, developed by Helmut Schmid
Online interface for TreeTagger found here.
CLAWS
POS Tagger (costs). A trial version is available here.
AUTomatic
Analysis SYStem (AUTASYS), using the LOB & ICE tagsets
XEROX tagger,
available via FTP
TNT Tagger by
Thorsten Brants. TnT = "Trigrams 'n Tags"
LT POS,
a part-of-speech tagger from the Language Technology Group
Brill Tagger, a
transformation-based POS tagger. Site also includes supervised &
unsupervise POS taggers & a PP-attachment program. The FTP location
is found here
Various demos, including one for the Brill Tagger, can be found at the
Centre for Language
Engineering Demonstrations
An online tagger for German can be found at the University of Zurich
Maximum
Entropy POS Tagger (MXPOST) developed by Adwait Ratnaparkhi. Site
also has MXTERMINATOR, a sentence boundary detector
QTAG, a
probabilistic tagger roughly based on HMMs.
MuTBL, a
transformation-based learning system which can train Brill taggers
fnTBL is machine
learning toolkit for NLP tasks.
MTP (Münster
Tagging Project), featuring Xlex, a suite of tools including a
tokenizer, segmenter, tagger, index tool, & collocation tool. An online
demo of Xlex can be found here.
AMALGAM
, Automatic Mapping Among Lexico-Grammatical Annotation, maps tagsets
and phrase structure grammar schemes. (includes a bibliography
on lexico-grammatical annotation models)
In addition to a shallow parser and a sentence splitter, the Cognitive
Computation Group at Illinois has a SNoW-based Tagger. SNoW
papers available here
VISL has a free upload interface
for automatic tagging/parsing of several languages at its website.
MORPHOLOGICAL ANALYZERS
Hermit Crab,
self-described as a "morphological parser and generator for classical
generative phonology and morphology"
POSTTAG for
use with Korean texts; a tagger & morphological analyzer. POSTPAR is
the syntactic analyzer
Morphy, a
morphological tool for German with some statistical POS tagging (site
is in German)
Morphix,
Günter Neumann's morphological component for inflectional languages
GERTWOL, a system
for automatic recognition of German word forms, using two-level
morphology
Word Manager is "a
system for the acquisition and management of reusable morphological
and phrasal dictionaries"
DeKo
(Derivations und Kompositionsmorphologie) analyzes complex words of
the German language
John Carroll has some tools for morphological analysis (morpha),
generation (morphg), and a/an insertion (ana).
PC-KIMMO is a two-level
processor for morphological analysis, available from sil.org. Also
available from sil is AMPLE,
which breaks words into morphemes.
ALE-RA, an ALE
extension with Realizational morphology and Automata Phonology
Project Deutscher
Wortschatz at the University of Leipzig (site in German)
Deutsche
Malaga-Morphologie (DMM) is a system for the automatic wordform
recognition of German.
CISLEX
from the University of Munich (site in German)
For Russian: RUSLO
a system for Russian derivational analysis and synthesis (not downloadable)
For Turkish: Turkish
Morphological Analyzer is an online analyzer which treats both
word formation and inflection; developed by Kemal Oflazer
Krzysztof Szafran's freeware Windows and Linux versions of a morphological
analyser for Polish
ChaSen is a morphological
analyzer for Japanese
PARSERS/CHUNKERS
TEXT ANALYSIS
VARIOUS TOOLS (ANNOTATE, SEARCH, TRANSCRIBE)
Corpuseye offers
different searching techniques on different types of corpora and
different languages.
NEGRA
an annotate tool
Test Suites for Natural
Language Processing (TSNLP), an annotation scheme for use on test
suites in German, French, & English
VERBMOBIL,
some general annotation tools
TIGER
Search, a specialized search engine for syntactically annotated
corpora
the trees for TIGERSearch use SVG
(Scalable Vector Graphics), which are run on Batik
Transcriber, a
tool for segmenting, labeling and transcribing speech from the
Linugistic Data Consortium (LDC)
INTEX has multiple
uses, including parsing & tagging
Xlex has a
variety of tools
Alembic
Workbench includes customizable tagsets & evaluation tools to
analyze annotated data
The Callisto annotation tool
supports "linguistic annotation of textual sources for any
Unicode-supported language."
WordFreak
is an annotation tool for manual and automatic annotation, as well as
human correction.
ACE
(Automatic Content Extraction) annotation tools support multiple
annotation layers.
MMAX Annotation Tool
(Multi-Modal Annotation in XML) supports stand-off annotation, among
other things.
NXT (NITE XML) supports
linguistic annotation for highly structured or cross-annotated data.
PALinkA
(Perspicuous and Adjustable Links Annotator) has been used to annotate
texts for anaphora resolution, centering, summarization, and so on.
Corpus
Workbench (CWB) is used for extraction and searching for
data-driven approaches. Uses the Corpus Query Processor (CQP).
SMES,
Günter Neumann's information extraction system (with chunker &
morphological analyzer)
Connexor has various
annotation tools and some online demos of annotating sentences in
various languages
As part of the BulTreeBank, the CLaRK system is an XML-based
software system for corpora development.
AGTK Annotation Graph ToolKit
TGrep, for searching through the Penn Treebank, is downloadable here. Information on using
tgrep is available here.
GSearch, a
search tool which uses syntactic criteria, even if the corpus is not
syntactically marked up.
LingPipe does named
entity recognition, as well as other processing
GATE (General Architecture for Text
Engineering) offers a lot of text processing tools
The TALP research center has
various analyzers for Spanish and has recently
released FreeLing, an open-source C++ library
providing language analysis services
XML TOOLS
CORPORA
The Mannheim
corpus, including links to COSMAS (COrpus
Storage Maintenance and Access System), which provides links to
corpora at IDS. A listing of the Mannheim corpus can be found here. (All
sites in German)
International Corpus
of English (ICE), a World Englishes corpus with syntactic
annotation -- uses the tool ICECUP (costs)
UCREL
(Lancaster) has a decent list of corpora
Linguistic Data Consortium
(LDC) contains various corpora, e.g. Portuguese newspapers &
Chinese Audio Treebank
(Ohio State has LDC membership for the years 1995, 1999, 2000, and 2001.)
ELRA, a
listing of different corpora
Project Gutenberg for English
texts. You can buy it here on CD.
And Project
Gutenberg-DE is the German version
ECI (European
Corpus Initiative) Multilingual Corpus including Frankfurter
Rundschau and Donaukurier
The British National Corpus (BNC)
ICAME has corpora
available, as well as online journals
Doub Biber and Mark Davies are working on tagging a Spanish corpus.
See details here.
The EMILLE corpus, containing monolingual written corpus data for 14 South Asian languages
The Lancaster
Corpus of Mandarin Chinese, which is part-of-speech tagged and
available free of charge
SYNTACTICALLY-ANNOTATED CORPORA
NEGRA,
a syntactically annotated corpus of German newspaper texts
VERBMOBIL, a
corpus among other things; this is the overview page.
TIGER
Project, Linguistic Interpretation of a German Corpus, which will
be about 50,000 sentences & annotated using LFG
TUSNELDA,
the Tübingen collection of reusable, empirical, linguistic data
structures
Penn Treebank
Project, a bank of trees, with part of speech tags, among other
annotations
DEREKO (Mannheim
page -- acquisition) (Tuebingen
page -- annotation) (Stuttgart
page -- exploitation) provides annotated German corpora
PARC has a dependency bank of 700 sentences available here.
ONLINE CORPORA
META-SEARCHES AND OTHER ONLINE RESOURCES
Michael Barlow has a very nice page here, devoted to many facets of corpus linguistics
David Lee has a very extensive site
devoted to corpora and corpus resources.
SFB441
has a listing of software for corpus linguistic research
Annotation: a
site by Steven Bird which lists all sorts of tools for linguistic
annotation. Many of them are speech-based.
Penn
Tools is a listing of corpora and tools available at UPenn
Resources
for Text, Speech and Language Processing
TIGER
lists several useful links for Treebank projects
Frequency lists of word found in the BNC can be found here
ICAME has a bibliography
online, as well as in searchable form.
EAGLES
(Expert Advisory Group on Language Engineering Standards) provides
recommendations on corpus typology.
W3C Corpus Linguistics Page at the University of Essex
Note
Our system (found under: /home/corpora) corresponds to the 2-letter
language codes (ISO 639) found at The XML Cover
Pages.
Questions or comments? Contact Markus Dickinson.
|