In the document search and text mining system, we have to analyze unstructured content.
In analyzing unstructured data, we usually use a variety of natural language technologies
including tokenizing, parsing and named entity extraction.
To use each processing module, we have to know the detail of the technology and
many components that have same function have been developed.
To reuse components and integrate components easily,
IBM research has been developed UIMA (Unstructured Information Management Architecture)
that is the infrastructure to construct UIM (Unstructured Information Management) application,
and has released UIMA SDK
from alphaWorks.
UIMA defines the data structure that stores the original information and
the extracted information as CAS (Common Analysis System).
It also defines the interface of the processing module as TAE (Text Analysis Engine).
If one developer implements his module using these data structure and interface
and makes it UIMA compliant, another developer can reuse it on UIMA and
integrate with his application.
In TRL, we are constructing the text mining system on UIMA
and developing the base system that can process documents efficiently.