|
Linguistic Annotation
LAL (Linguistic Annotation Language)
Natural language texts contain many ambiguities which are difficult for
natural language processing systems to resolve properly. Some of these
ambiguities are not resolved due to the immaturity of NLP technologies, but
some can be resolved only by writers. Consider the following example:
a small computer company
There are two interpretations; one is "a company which produces small computers," and the other is "a small company which produces computers." If there are linguistic
annotations as shown below, then NLP programs can recognize the latter one is correct.
a small <seg>computer company</seg>
In this example, a segment between <seg> and </seg> is recognized as a unit (or a phrase).
We have developed an XML-based tag set called LAL (Linguistic Annotation Language) to annotate linguistic information. For instance, LAL has the following tags:
s ... specifies a scope of a sentence.
w ... specifies a signle word.
seg ... specifies a phrase.
The following shows an example of LAL annotation.
IBM announced <lal:seg>a new computer system for children</lal:seg> with voice function.
In this sentence, the phrase "with voice function" may modify "children" or
"system" from the syntactic viewpoint. The seg tag specifies that phrase modifies
"system."
It is difficult and tough for end users to annotate these tags manually. Therefore,
we have developed a GUI-based LAL tagging editor for facilitate this annotation work.
If this annotation framework spreads worldwide and linguistic information is
annotated in many documents, we can expect that NLP poragms such as machine translation and automatic summarization will become much more beneficial to people.
|