Deduplication of Scholarly Documents using Locality Sensitive Hashing and Word Embeddings

被引:0
|
作者
Gyawali, Bikash [1 ]
Anastasiou, Lucas [1 ]
Knoth, Petr [1 ]
机构
[1] Open Univ, Knowledge Media Inst, Milton Keynes, Bucks, England
来源
PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020) | 2020年
关键词
Deduplication; Scholarly Documents; Locality Sensitive Hashing; Word Embeddings; Digital Repositories;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Deduplication is the task of identifying near and exact duplicate data items in a collection. In this paper, we present a novel method for deduplication of scholarly documents. We develop a hybrid model which uses structural similarity (locality sensitive hashing) and meaning representation (word embeddings) of document texts to determine (near) duplicates. Our collection constitutes a subset of multidisciplinary scholarly documents aggregated from research repositories. We identify several issues causing data inaccuracies in such collections and motivate the need for deduplication. In lack of existing dataset suitable for study of deduplication of scholarly documents, we create a ground truth dataset of 100K scholarly documents and conduct a series of experiments to empirically establish optimal values for the parameters of our deduplication method. Experimental evaluation shows that our method achieves a macro F1-score of 0.90. We productionise our method as a publicly accessible web API service serving deduplication of scholarly documents in real time.
引用
收藏
页码:901 / 910
页数:10
相关论文
共 50 条
  • [21] LSHWE: Improving Similarity-Based Word Embedding with Locality Sensitive Hashing for Cyberbullying Detection
    Zhao, Zehua
    Gao, Min
    Luo, Fengji
    Zhang, Yi
    Xiong, Qingyu
    2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [22] EFFICIENT MANIFOLD LEARNING FOR SPEECH RECOGNITION USING LOCALITY SENSITIVE HASHING
    Tomar, Vikrant Singh
    Rose, Richard C.
    2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 6995 - 6999
  • [23] Fast Fuzzy Search for Mixed Data Using Locality Sensitive Hashing
    Lee, Kyung Mi
    Lee, Keon Myung
    PROGRESS IN MECHATRONICS AND INFORMATION TECHNOLOGY, PTS 1 AND 2, 2014, 462-463 : 321 - +
  • [24] A Locality Sensitive Hashing Technique for Categorical Data
    Lee, Kyung Mi
    Lee, Keon Myung
    INDUSTRIAL INSTRUMENTATION AND CONTROL SYSTEMS, PTS 1-4, 2013, 241-244 : 3159 - 3164
  • [25] Using Locality Sensitive Hashing to Improve the KNN Algorithm in the MapReduce Framework
    Bagui, Sikha
    Mondal, Arup Kumar
    Bagui, Subhash
    ACMSE '18: PROCEEDINGS OF THE ACMSE 2018 CONFERENCE, 2018,
  • [26] Ultrafast Genomic Database Search using Layered Locality Sensitive Hashing
    Chakraborty, Angana
    Bandyopadhyay, Sanghamitra
    PROCEEDINGS OF 2018 FIFTH INTERNATIONAL CONFERENCE ON EMERGING APPLICATIONS OF INFORMATION TECHNOLOGY (EAIT), 2018,
  • [27] Locality Sensitive Hashing of Customer Load Profiles
    Beretka, Sandor F.
    Varga, Ervin D.
    2013 INTERNATIONAL CONFERENCE ON RENEWABLE ENERGY RESEARCH AND APPLICATIONS (ICRERA), 2013, : 353 - 356
  • [28] An Improved Algorithm for Locality-Sensitive Hashing
    Cen, Wei
    Miao, Kehua
    10TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE & EDUCATION (ICCSE 2015), 2015, : 61 - 64
  • [29] Optimal Parameters for Locality-Sensitive Hashing
    Slaney, Malcolm
    Lifshits, Yury
    He, Junfeng
    PROCEEDINGS OF THE IEEE, 2012, 100 (09) : 2604 - 2623
  • [30] Locality Sensitive Hashing with Extended Differential Privacy
    Fernandes, Natasha
    Kawamoto, Yusuke
    Murakami, Takao
    COMPUTER SECURITY - ESORICS 2021, PT II, 2021, 12973 : 563 - 583