DMDD: A Large-Scale Dataset for Dataset Mentions Detection

被引:1
|
作者
Pan, Huitong [1 ]
Zhang, Qi [1 ]
Dragut, Eduard [1 ]
Caragea, Cornelia [2 ]
Latecki, Longin Jan [1 ]
机构
[1] Temple Univ, Philadelphia, PA 19122 USA
[2] Univ Illinois, Chicago, IL USA
基金
美国国家科学基金会;
关键词
D O I
10.1162/tacl_a_00592
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The recognition of dataset names is a critical task for automatic information extraction in scientific literature, enabling researchers to understand and identify research opportunities. However, existing corpora for dataset mention detection are limited in size and naming diversity. In this paper, we introduce the Dataset Mentions Detection Dataset (DMDD), the largest publicly available corpus for this task. DMDD consists of the DMDD main corpus, comprising 31,219 scientific articles with over 449,000 dataset mentions weakly annotated in the format of in-text spans, and an evaluation set, which comprises 450 scientific articles manually annotated for evaluation purposes. We use DMDD to establish baseline performance for dataset mention detection and linking. By analyzing the performance of various models on DMDD, we are able to identify open problems in dataset mention detection. We invite the community to use our dataset as a challenge to develop novel dataset mention detection models.
引用
收藏
页码:1132 / 1146
页数:15
相关论文
共 50 条
  • [1] Fraud Detection Using Large-scale Imbalance Dataset
    Rubaidi, Zainab Saad
    Ben Ammar, Boulbaba
    Ben Aouicha, Mohamed
    INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2022, 31 (08)
  • [2] KoDF: A Large-scale Korean DeepFake Detection Dataset
    Kwon, Patrick
    You, Jaeseong
    Nam, Gyuhyeon
    Park, Sungwoo
    Chae, Gyeongsu
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 10724 - 10733
  • [3] USED: A Large-scale Social Event Detection Dataset
    Ahmad, Kashif
    Conci, Nicola
    Boato, Giulia
    De Natale, Francesco G. B.
    PROCEEDINGS OF THE 7TH INTERNATIONAL CONFERENCE ON MULTIMEDIA SYSTEMS (MMSYS'16), 2016, : 380 - 385
  • [4] WAID: A Large-Scale Dataset for Wildlife Detection with Drones
    Mou, Chao
    Liu, Tengfei
    Zhu, Chengcheng
    Cui, Xiaohui
    APPLIED SCIENCES-BASEL, 2023, 13 (18):
  • [5] Nostalgia on Twitter: Detection and Analysis of a Large-Scale Dataset
    Stanley Jothiraj, Fiona Victoria
    Hong, Lingzi
    Mashhadi, Afra
    Proceedings of the Association for Information Science and Technology, 2024, 61 (01) : 349 - 360
  • [6] The Jester Dataset: A Large-Scale Video Dataset of Human Gestures
    Materzynska, Joanna
    Berger, Guillaume
    Bax, Ingo
    Memisevic, Roland
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 2874 - 2882
  • [7] Collaborative Camouflaged Object Detection: A Large-Scale Dataset and Benchmark
    Zhang, Cong
    Bi, Hongbo
    Xiang, Tian-Zhu
    Wu, Ranwan
    Tong, Jinghui
    Wang, Xiufang
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 35 (12) : 1 - 15
  • [8] DOTA: A Large-scale Dataset for Object Detection in Aerial Images
    Xia, Gui-Song
    Bai, Xiang
    Ding, Jian
    Zhu, Zhen
    Belongie, Serge
    Luo, Jiebo
    Datcu, Mihai
    Pelillo, Marcello
    Zhang, Liangpei
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 3974 - 3983
  • [9] SeaShips: A Large-Scale Precisely Annotated Dataset for Ship Detection
    Shao, Zhenfeng
    Wu, Wenjing
    Wang, Zhongyuan
    Du, Wan
    Li, Chengyuan
    IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (10) : 2593 - 2604
  • [10] LEVEN: A Large-Scale Chinese Legal Event Detection Dataset
    Yao, Feng
    Xiao, Chaojun
    Wang, Xiaozhi
    Liu, Zhiyuan
    Hou, Lei
    Tu, Cunchao
    Li, Juanzi
    Liu, Yun
    Shen, Weixing
    Sun, Maosong
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 183 - 201