Authors:
Kazuo Hara
1
;
Ikumi Suzuki
1
;
Kousaku Okubo
1
and
Isamu Muto
2
Affiliations:
1
National Institute of Genetics, Japan
;
2
BITS. Co. and Ltd., Japan
Keyword(s):
Semi-Automated Information Extraction, Cohesive Text, Itemized Text.
Related
Ontology
Subjects/Areas/Topics:
Artificial Intelligence
;
Information Extraction
;
Knowledge Discovery and Information Retrieval
;
Knowledge-Based Systems
;
Mining Text and Semi-Structured Data
;
Pre-Processing and Post-Processing for Data Mining
;
Symbolic Systems
Abstract:
Anatomical knowledge written in a textbook is almost completely unreusable computationally, because it is embedded in a cohesive discourse. In discourse contexts, the frequent use of cohesive ties such as reference expressions and coordinated phrases not only troubles the function of automated systems (i.e., natural language parsers) to extract knowledge from the resulting complicated sentences, but also affects the identification of mentions of anatomical named entities (NEs). We propose to revamp the prose style of anatomical textbooks by transforming cohesive discourse into itemized text, which can be accomplished by annotating reference expressions and coordinating conjunctions. Then, automatically, each anaphor will be replaced by its antecedent in each reference expression, and the conjoined elements are distributed to sentences duplicated for each coordinating conjunction connecting phrases. We demonstrate that, compared to the original text, the transformed one is easy for ma
chines to process and hence convenient as a way of identifying mentions of anatomical NEs and their relations. Since the transformed text is human readable as well, we believe our approach provides a promising new model for language resources accessible by both human and machine, improving the computational reusability of textbooks.
(More)