Improving named entity recognition in noisy user-generated text with local distance neighbor feature

被引:15
作者
Wesam Al-Nabki, Mhd [1 ,2 ]
Fidalgo, Eduardo [1 ,2 ]
Alegre, Enrique [1 ,2 ]
Fernandez-Robles, Laura [1 ,2 ,3 ]
机构
[1] Univ Leon, Dept Elect Syst & Automat, Leon, Spain
[2] INCIBE Spanish Natl Cybersecur Inst, Leon, Spain
[3] Univ Leon, Dept Mech Informat & Aerosp Engn, Leon, Spain
关键词
Named entity recognition; Gazetteer; Text mining; Darknet; Hidden services;
D O I
10.1016/j.neucom.2019.11.072
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recognizing infrequent or emerging named entities in a user-generated text is a challenging task, especially when informal or slang text is used. Some recent works propose to use a gazetteer to solve this problem, but this solution is not general because the gazetteer is task-specific and its maintenance is costly. In this paper, we overcome this drawback by presenting Local Distance Neighbor (LDN), a novel feature that substitutes the gazetteer and makes that the model obtains state-of-the-art results. LDN captures an initial guess for each input token based on the categories of its neighboring tokens within an embedding space. We evaluated the proposed network on the W-NUT-2017 dataset, and we obtained the state-of-the-art F1 score for the Group, Person, and Product categories. We employed our new feature together with the model proposed by Aguilar et al. to recognize named entities in the Tor Darknet related to suspicious activities associated with weapons and drug selling. After increasing the samples of the W-NUT-2017 dataset with 851 manually annotated entries, we repeated our evaluation in this extended version of the dataset, achieving entity and surface F1 scores of 52.96% and 50.57%, respectively. Furthermore, we demonstrate that our proposal can be useful for Law Enforcement Agencies in mining the textual information in the Tor hidden services, being especially adequate for the Group, Person, and Product categories. (C) 2019 Elsevier B.V. All rights reserved.
引用
收藏
页码:1 / 11
页数:11
相关论文
共 67 条
[1]  
AKBIK A, 2018, P 27 INT C COMP LING, P1638
[2]  
Akbik A, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P724
[3]  
Al Nabki M. W., 2017, P JORN NAC INV CIB J, V1, P24
[4]  
Al Nabki MW, 2017, 15TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2017), VOL 1: LONG PAPERS, P35
[5]  
Al-Ash HS, 2018, PROCEEDINGS OF 2018 THE 10TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY AND ELECTRICAL ENGINEERING (ICITEE), P12, DOI 10.1109/ICITEED.2018.8534898
[6]   ToRank: Identifying the most influential suspicious domains in the Tor network [J].
Al-Nabki, Mhd Wesam ;
Fidalgo, Eduardo ;
Alegre, Enrique ;
Fernandez-Robles, Laura .
EXPERT SYSTEMS WITH APPLICATIONS, 2019, 123 :212-226
[7]  
[Anonymous], 1999, WWW 1999
[8]  
[Anonymous], P 2 WORKSH NOIS US G
[9]  
[Anonymous], DATA MINING DARK DAR
[10]  
[Anonymous], P 5 INT C LEARN REPR