Deduplication of Scholarly Documents using Locality Sensitive Hashing and Word Embeddings

被引:0
|
作者
Gyawali, Bikash [1 ]
Anastasiou, Lucas [1 ]
Knoth, Petr [1 ]
机构
[1] Open Univ, Knowledge Media Inst, Milton Keynes, Bucks, England
来源
PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020) | 2020年
关键词
Deduplication; Scholarly Documents; Locality Sensitive Hashing; Word Embeddings; Digital Repositories;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Deduplication is the task of identifying near and exact duplicate data items in a collection. In this paper, we present a novel method for deduplication of scholarly documents. We develop a hybrid model which uses structural similarity (locality sensitive hashing) and meaning representation (word embeddings) of document texts to determine (near) duplicates. Our collection constitutes a subset of multidisciplinary scholarly documents aggregated from research repositories. We identify several issues causing data inaccuracies in such collections and motivate the need for deduplication. In lack of existing dataset suitable for study of deduplication of scholarly documents, we create a ground truth dataset of 100K scholarly documents and conduct a series of experiments to empirically establish optimal values for the parameters of our deduplication method. Experimental evaluation shows that our method achieves a macro F1-score of 0.90. We productionise our method as a publicly accessible web API service serving deduplication of scholarly documents in real time.
引用
收藏
页码:901 / 910
页数:10
相关论文
共 50 条
  • [41] Using Locality-Sensitive Hashing for SVM Classification of Large Data Sets
    Gonzalez-Lima, Maria D.
    Ludena, Carenne C.
    MATHEMATICS, 2022, 10 (11)
  • [42] Image super resolution using distributed locality sensitive hashing for manifold learning
    Tripathi, Anurag
    Gupta, Abhinav
    Chaudhury, Santanu
    Singh, Arun
    MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (18) : 25673 - 25684
  • [43] Query by Humming by Using Locality Sensitive Hashing based on Combination of Pitch and Note
    Wang, Qiang
    Guo, Zhiyuan
    Liu, Gang
    Guo, Jun
    Lu, Yueming
    2012 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO WORKSHOPS (ICMEW), 2012, : 302 - 307
  • [44] Fast agglomerative hierarchical clustering algorithm using Locality-Sensitive Hashing
    Hisashi Koga
    Tetsuo Ishibashi
    Toshinori Watanabe
    Knowledge and Information Systems, 2007, 12 : 25 - 53
  • [45] Estimating user response rate using locality sensitive hashing in search marketing
    Maryam Almasharawi
    Ahmet Bulut
    Electronic Commerce Research, 2022, 22 : 37 - 51
  • [46] Image super resolution using distributed locality sensitive hashing for manifold learning
    Anurag Tripathi
    Abhinav Gupta
    Santanu Chaudhury
    Arun Singh
    Multimedia Tools and Applications, 2019, 78 : 25673 - 25684
  • [47] Locality Sensitive Hashing for Efficient Similar Polygon Retrieval
    Kaplan, Haim
    Tenenbaum, Jay
    38TH INTERNATIONAL SYMPOSIUM ON THEORETICAL ASPECTS OF COMPUTER SCIENCE (STACS 2021), 2021, 187
  • [48] Real-time recommendation with locality sensitive hashing
    Aytekin, Ahmet Maruf
    Aytekin, Tevfik
    JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2019, 53 (01) : 1 - 26
  • [49] Real-time recommendation with locality sensitive hashing
    Ahmet Maruf Aytekin
    Tevfik Aytekin
    Journal of Intelligent Information Systems, 2019, 53 : 1 - 26
  • [50] Estimating user response rate using locality sensitive hashing in search marketing
    Almasharawi, Maryam
    Bulut, Ahmet
    ELECTRONIC COMMERCE RESEARCH, 2022, 22 (01) : 37 - 51