DMDD: A Large-Scale Dataset for Dataset Mentions Detection

被引：1

作者：

Pan, Huitong ^{[1
]}

Zhang, Qi ^{[1
]}

Dragut, Eduard ^{[1
]}

Caragea, Cornelia ^{[2
]}

Latecki, Longin Jan ^{[1
]}

机构：

[1] Temple Univ, Philadelphia, PA 19122 USA

[2] Univ Illinois, Chicago, IL USA

来源：

TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS | 2023年 / 11卷

基金：

美国国家科学基金会;

关键词：

D O I：

10.1162/tacl_a_00592

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The recognition of dataset names is a critical task for automatic information extraction in scientific literature, enabling researchers to understand and identify research opportunities. However, existing corpora for dataset mention detection are limited in size and naming diversity. In this paper, we introduce the Dataset Mentions Detection Dataset (DMDD), the largest publicly available corpus for this task. DMDD consists of the DMDD main corpus, comprising 31,219 scientific articles with over 449,000 dataset mentions weakly annotated in the format of in-text spans, and an evaluation set, which comprises 450 scientific articles manually annotated for evaluation purposes. We use DMDD to establish baseline performance for dataset mention detection and linking. By analyzing the performance of various models on DMDD, we are able to identify open problems in dataset mention detection. We invite the community to use our dataset as a challenge to develop novel dataset mention detection models.

引用

页码：1132 / 1146

页数：15

共 50 条

[1] Fraud Detection Using Large-scale Imbalance Dataset
Rubaidi, Zainab Saad
Ben Ammar, Boulbaba
Ben Aouicha, Mohamed
INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2022, 31 (08)
[2] KoDF: A Large-scale Korean DeepFake Detection Dataset
Kwon, Patrick
You, Jaeseong
Nam, Gyuhyeon
Park, Sungwoo
Chae, Gyeongsu
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 10724 - 10733
[3] USED: A Large-scale Social Event Detection Dataset
Ahmad, Kashif
Conci, Nicola
Boato, Giulia
De Natale, Francesco G. B.
PROCEEDINGS OF THE 7TH INTERNATIONAL CONFERENCE ON MULTIMEDIA SYSTEMS (MMSYS'16), 2016, : 380 - 385
[4] WAID: A Large-Scale Dataset for Wildlife Detection with Drones
Mou, Chao
Liu, Tengfei
Zhu, Chengcheng
Cui, Xiaohui
APPLIED SCIENCES-BASEL, 2023, 13 (18):
[5] Nostalgia on Twitter: Detection and Analysis of a Large-Scale Dataset
Stanley Jothiraj, Fiona Victoria
Hong, Lingzi
Mashhadi, Afra
Proceedings of the Association for Information Science and Technology, 2024, 61 (01) : 349 - 360
[6] The Jester Dataset: A Large-Scale Video Dataset of Human Gestures
Materzynska, Joanna
Berger, Guillaume
Bax, Ingo
Memisevic, Roland
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 2874 - 2882
[7] Collaborative Camouflaged Object Detection: A Large-Scale Dataset and Benchmark
Zhang, Cong
Bi, Hongbo
Xiang, Tian-Zhu
Wu, Ranwan
Tong, Jinghui
Wang, Xiufang
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 35 (12) : 1 - 15
[8] DOTA: A Large-scale Dataset for Object Detection in Aerial Images
Xia, Gui-Song
Bai, Xiang
Ding, Jian
Zhu, Zhen
Belongie, Serge
Luo, Jiebo
Datcu, Mihai
Pelillo, Marcello
Zhang, Liangpei
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 3974 - 3983
[9] SeaShips: A Large-Scale Precisely Annotated Dataset for Ship Detection
Shao, Zhenfeng
Wu, Wenjing
Wang, Zhongyuan
Du, Wan
Li, Chengyuan
IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (10) : 2593 - 2604
[10] LEVEN: A Large-Scale Chinese Legal Event Detection Dataset
Yao, Feng
Xiao, Chaojun
Wang, Xiaozhi
Liu, Zhiyuan
Hou, Lei
Tu, Cunchao
Li, Juanzi
Liu, Yun
Shen, Weixing
Sun, Maosong
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 183 - 201

← 1 2 3 4 5 →