The OntoNotes Project: Building a Large Corpus of Semantically Annotated Text
|Venue:||SemEval Workshop at ACL 2007 in Prague|
|Date:||June 23, 2007|
flv flash video (260 MB, 320 * 256 pixels)
mpeg2 video (1500 MB, 720 * 576 pixels)
Many natural language processing (NLP) applications could benefit from a richer model of text meaning than the bag-of-words and n-gram models that currently predominate. Despite theoretical interest since the 1960s, however, no large-scale model exists; in fact, it is not even clear what such a model should minimally include. However, the introduction of large-scale public resources such as the Penn TreeBank and WordNet have generated a great deal of progress in the NLP community, and so it seems increasingly important to create some kind of meaning-oriented model and build a corresponding corpus that is large enough to support adequate machine learning.
Two years ago, a collaboration was formed between BBN Technologies, University of Colorado, University of Pennsylvania, and USC/ISI to construct the OntoNotes corpus. The goal is to add, by manual annotation, shallow semantic information (what we call ‘literal semantics’) to a corpus of 1 million words of English, Chinese, and Arabic text in various genres (newstext, broadcast news, weblogs, etc.) and domains. Under the current shallow semantics we include word sense disambiguation for nouns and verbs, the proposition structure of verbs and certain nouns (with proposition arguments connected), coreference links across sentences, and a structured reorganization of the word senses into an ontology.
We have released to LDC the first year’s corpus, which comprises 300K words of Wall Street Journal (English) text, with annotations of coreferences, the most frequent 600+ verbs’ senses, and the most frequent 350+ nouns’ senses. The second deliverable, which will be released to LDC toward the end of 2007, will include Arabic, Chinese, and English texts in various amounts, with increased coverage of verb and noun senses.
Naturally, it is important that the corpus be internally consistent. We have developed a methodology of annotation that guarantees 90% inter-annotator agreement. But this is not the only challenge facing creators of large-scale corpora – there are unresolved issues along almost every step of the way. Annotation is not (yet) an exact science. To help ensure internally consistent annotations suitable for machine learning, annotation projects have to address the following seven issues; 1. How does one decide what corpus to annotate: when is a corpus balanced and representative? 2. How does one decide what specific kinds of information to annotate? How does one adequately capture the theory behind the phenomena and express it in simple annotation instructions? 3. What interfaces does one build? How does one ensure that the interfaces support speed but do not influence the annotation results? 4. When hiring annotators, what characteristics are important? How does one ensure that they are adequately (and not over- or under-) trained? 5. How does one establish a simple, fast, and trustworthy annotation procedure? When are reconciliation and adjudication appropriate? 6. How does evaluate the results? What are the appropriate agreement measures? At which cutoff points should one redesign or re-do the annotations? 7. How should one formulate and store the results? How can one maintain coherence across the various layers of annotation? How does one ensure compatibility with other existing resources? How does one make results available and ensure maintenance when the project ends?
This talk argues for
the necessity of (even shallow) semantics-based NLP, describes the contents and
operation of the OntoNotes project, and in so doing introduces and explains the
general issues facing annotation projects. Our hope is that other people not
only try to use the OntoNotes corpus in their own work, but also create their
own annotations on the same material, so that more layers of shallow semantics
can be included into OntoNotes.
Related Link: OntoNotes website