Automatic knowledge extraction from documents

被引:49
作者
Fan, J. [1 ]
Kalyanpur, A. [1 ]
Gondek, D. C. [1 ]
Ferrucci, D. A. [1 ]
机构
[1] IBM Corp, Div Res, Thomas J Watson Res Ctr, Yorktown Hts, NY 10598 USA
基金
俄罗斯基础研究基金会;
关键词
Natural language processing systems - Aggregates - Data mining - Syntactics;
D O I
10.1147/JRD.2012.2186519
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Access to a large amount of knowledge is critical for success at answering open-domain questions for DeepQA systems such as IBM Watson (TM). Formal representation of knowledge has the advantage of being easy to reason with, but acquisition of structured knowledge in open domains from unstructured data is often difficult and expensive. Our central hypothesis is that shallow syntactic knowledge and its implied semantics can be easily acquired and can be used in many areas of a question-answering system. We take a two-stage approach to extract the syntactic knowledge and implied semantics. First, shallow knowledge from large collections of documents is automatically extracted. Second, additional semantics are inferred from aggregate statistics of the automatically extracted shallow knowledge. In this paper, we describe in detail what kind of shallow knowledge is extracted, how it is automatically done from a large corpus, and how additional semantics are inferred from aggregate statistics. We also briefly discuss the various ways extracted knowledge is used throughout the IBM DeepQA system.
引用
收藏
页数:10
相关论文
共 23 条
[11]  
Clark P, 2009, K-CAP'09: PROCEEDINGS OF THE FIFTH INTERNATIONAL CONFERENCE ON KNOWLEDGE CAPTURE, P153
[12]  
Fellbaum C, 1998, LANG SPEECH & COMMUN, P1
[13]   Building Watson: An Overview of the DeepQA Project [J].
Ferrucci, David ;
Brown, Eric ;
Chu-Carroll, Jennifer ;
Fan, James ;
Gondek, David ;
Kalyanpur, Aditya A. ;
Lally, Adam ;
Murdock, J. William ;
Nyberg, Eric ;
Prager, John ;
Schlaefer, Nico ;
Welty, Chris .
AI MAGAZINE, 2010, 31 (03) :59-79
[14]   Question analysis: How Watson reads a clue [J].
Lally, A. ;
Prager, J. M. ;
McCord, M. C. ;
Boguraev, B. K. ;
Patwardhan, S. ;
Fan, J. ;
Fodor, P. ;
Chu-Carroll, J. .
IBM JOURNAL OF RESEARCH AND DEVELOPMENT, 2012, 56 (3-4)
[15]   CYC - A LARGE-SCALE INVESTMENT IN KNOWLEDGE INFRASTRUCTURE [J].
LENAT, DB .
COMMUNICATIONS OF THE ACM, 1995, 38 (11) :33-38
[16]  
Levin B., 1993, ENGLISH VERB CLASSES
[17]  
Lin D., 2001, P 7 ITERNATIONAL C K
[18]  
Lin T., 2009, P 18 ACM C INFORM KN, P1787
[19]  
Penas A., 2010, P NAACL HLT 2010 1 I, P15
[20]  
Schubert L., 2002, P 2 INT C HUMAN LANG, P94