Deduplication of Scholarly Documents using Locality Sensitive Hashing and Word Embeddings

被引:0
|
作者
Gyawali, Bikash [1 ]
Anastasiou, Lucas [1 ]
Knoth, Petr [1 ]
机构
[1] Open Univ, Knowledge Media Inst, Milton Keynes, Bucks, England
来源
PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020) | 2020年
关键词
Deduplication; Scholarly Documents; Locality Sensitive Hashing; Word Embeddings; Digital Repositories;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Deduplication is the task of identifying near and exact duplicate data items in a collection. In this paper, we present a novel method for deduplication of scholarly documents. We develop a hybrid model which uses structural similarity (locality sensitive hashing) and meaning representation (word embeddings) of document texts to determine (near) duplicates. Our collection constitutes a subset of multidisciplinary scholarly documents aggregated from research repositories. We identify several issues causing data inaccuracies in such collections and motivate the need for deduplication. In lack of existing dataset suitable for study of deduplication of scholarly documents, we create a ground truth dataset of 100K scholarly documents and conduct a series of experiments to empirically establish optimal values for the parameters of our deduplication method. Experimental evaluation shows that our method achieves a macro F1-score of 0.90. We productionise our method as a publicly accessible web API service serving deduplication of scholarly documents in real time.
引用
收藏
页码:901 / 910
页数:10
相关论文
共 50 条
  • [31] Locality Sensitive Hashing for Network Traffic Fingerprinting
    Mashnoor, Nowfel
    Thom, Jay
    Rouf, Abdur
    Sengupta, Shamik
    Charyyev, Batyr
    2023 IEEE 29TH INTERNATIONAL SYMPOSIUM ON LOCAL AND METROPOLITAN AREA NETWORKS, LANMAN, 2023,
  • [32] Faster compression methods for a weighted graph using locality sensitive hashing
    Khan, Kifayat Ullah
    Dolgorsuren, Batjargal
    Tu Nguyen Anh
    Nawaz, Waqas
    Lee, Young-Koo
    INFORMATION SCIENCES, 2017, 421 : 237 - 253
  • [33] Locality Sensitive Hashing with Extended Partitioning Boundaries
    Lee, Keon Myung
    MECHATRONICS AND INDUSTRIAL INFORMATICS, PTS 1-4, 2013, 321-324 : 804 - 807
  • [34] Neural Locality Sensitive Hashing for Entity Blocking
    Wang, Runhui
    Kong, Luyang
    Tao, Yefan
    Borthwick, Andrew
    Golac, Davor
    Johnson, Henrik
    Hijazi, Shadie
    Deng, Dong
    Zhang, Yongfeng
    PROCEEDINGS OF THE 2024 SIAM INTERNATIONAL CONFERENCE ON DATA MINING, SDM, 2024, : 887 - 895
  • [35] msCRUSH: Fast Tandem Mass Spectral Clustering Using Locality Sensitive Hashing
    Wang, Lei
    Li, Sujun
    Tang, Haixu
    JOURNAL OF PROTEOME RESEARCH, 2019, 18 (01) : 147 - 158
  • [36] Automatically detecting groups using locality-sensitive hashing in group recommendations
    Kumar, Chintoo
    Chowdary, C. Ravindranath
    Shukla, Deepika
    INFORMATION SCIENCES, 2022, 601 : 207 - 223
  • [37] Reducing Annotation Effort in Automatic Essay Evaluation Using Locality Sensitive Hashing
    Tashu, Tsegaye Misikir
    Szabo, David
    Horvath, Tomas
    INTELLIGENT TUTORING SYSTEMS (ITS 2019), 2019, 11528 : 186 - 192
  • [38] Omnibus outlier detection in sensor networks using windowed locality sensitive hashing
    Giatrakos, Nikos
    Deligiannakis, Antonios
    Garofalakis, Minos
    Kotidis, Yannis
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2020, 110 : 587 - 609
  • [39] EFFICIENT MANIFOLD PRESERVING AUDIO SOURCE SEPARATION USING LOCALITY SENSITIVE HASHING
    Kim, Minje
    Smaragdis, Paris
    Mysore, Gautham J.
    2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 479 - 483
  • [40] Fast agglomerative hierarchical clustering algorithm using Locality-Sensitive Hashing
    Koga, Hisashi
    Ishibashi, Tetsuo
    Watanabe, Toshinori
    KNOWLEDGE AND INFORMATION SYSTEMS, 2007, 12 (01) : 25 - 53