Cluster-based mention typing for named entity disambiguation

被引:1
作者
Celebi, Arda [1 ]
Ozgur, Arzucan [1 ]
机构
[1] Bogazici Univ, Dept Comp Engn, TR-34342 Istanbul, Turkey
关键词
Named entity disambiguation; Clustering; Mention typing; Information extraction; LINKING;
D O I
10.1017/S1351324920000443
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
An entity mention in text such as "Washington" may correspond to many different named entities such as the city "Washington D.C." or the newspaper "Washington Post." The goal of named entity disambiguation (NED) is to identify the mentioned named entity correctly among all possible candidates. If the type (e.g., location or person) of a mentioned entity can be correctly predicted from the context, it may increase the chance of selecting the right candidate by assigning low probability to the unlikely ones. This paper proposes cluster-based mention typing for NED. The aim of mention typing is to predict the type of a given mention based on its context. Generally, manually curated type taxonomies such as Wikipedia categories are used. We introduce cluster-based mention typing, where named entities are clustered based on their contextual similarities and the cluster ids are assigned as types. The hyperlinked mentions and their context in Wikipedia are used in order to obtain these cluster-based types. Then, mention typing models are trained on these mentions, which have been labeled with their cluster-based types through distant supervision. At the NED phase, first the cluster-based types of a given mention are predicted and then, these types are used as features in a ranking model to select the best entity among the candidates. We represent entities at multiple contextual levels and obtain different clusterings (and thus typing models) based on each level. As each clustering breaks the entity space differently, mention typing based on each clustering discriminates the mention differently. When predictions from all typing models are used together, our system achieves better or comparable results based on randomization tests with respect to the state-of-the-art levels on four defacto test sets.
引用
收藏
页码:1 / 37
页数:37
相关论文
共 81 条
[1]  
[Anonymous], 2016, ARXIV160303112
[2]  
[Anonymous], 1993, P EUROSPEECH
[3]  
[Anonymous], Proceedings of the 2008 ACM SIGMOD international conference on Management of data, SIGMOD '08
[4]  
[Anonymous], 2011, P 17 ACM SIGKDD INT, DOI DOI 10.1145/2020408.2020574
[5]  
[Anonymous], 2018, P 56 ANN M ASS COMP
[6]  
[Anonymous], 1979, INFORM RETRIEVAL
[7]  
[Anonymous], 1999, Proceedings of Joint Conference on Empirical Methods in NLP and Very Large Corpora
[8]   DBpedia: A nucleus for a web of open data [J].
Auer, Soeren ;
Bizer, Christian ;
Kobilarov, Georgi ;
Lehmann, Jens ;
Cyganiak, Richard ;
Ives, Zachary .
SEMANTIC WEB, PROCEEDINGS, 2007, 4825 :722-+
[9]  
Baroni M, 2014, PROCEEDINGS OF THE 52ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, P238
[10]   A systematic review and comparative analysis of cross-document coreference resolution methods and tools [J].
Beheshti, Seyed-Mehdi-Reza ;
Benatallah, Boualem ;
Venugopal, Srikumar ;
Ryu, Seung Hwan ;
Motahari-Nezhad, Hamid Reza ;
Wang, Wei .
COMPUTING, 2017, 99 (04) :313-349