Information Extraction of Domain-specific Business Documents with Limited Data

被引:2
作者
Minh-Tien Nguyen [1 ,2 ]
Le Thai Linh [1 ]
Dung Tien Le [1 ]
Nguyen Hong Son [1 ]
Do Hoang Thai Duong [1 ]
Bui Cong Minh [1 ]
Akira Shojiguchi [1 ]
机构
[1] CINNAMON LAB, 10th Floor,Geleximco Bldg,36 Hoang Cau, Hanoi, Vietnam
[2] Hung Yen Univ Technol & Educ, Hung Yen, Vietnam
来源
2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN) | 2021年
关键词
Information extraction; Document analysis;
D O I
10.1109/IJCNN52387.2021.9534328
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Information extraction is a key corner-stone in the digitization of office data which requires the conversion of unstructured to structured data. However, in the actual application to business cases, there is a big deadlock to adapt common extraction systems to domain-specific documents due to the limitation of preparation of training data. To overcome this issue, we introduce a model, which employs pre-trained language models with a customized CNN layer for domain adaptation. The model is validated on three Japanese domain-specific and two benchmark machine reading comprehension data sets (SQuADs). Experimental results confirm that our model achieves promising results which are applicable for actual business scenarios.
引用
收藏
页数:8
相关论文
共 50 条
[41]   Information Extraction Methods for Text Documents in a Cognitive Integrated Management Information System [J].
Hernes, Marcin .
2015 IEEE 2ND INTERNATIONAL CONFERENCE ON CYBERNETICS (CYBCONF), 2015, :287-292
[42]   Domain-Specific Entity Recognition as Token-Pair Relation Classification [J].
Liu, Jinxuan ;
Shi, Hongxun ;
Li, Chuankun ;
Chang, Qingtao ;
Wang, Jianbin .
IEEE ACCESS, 2023, 11 :118363-118371
[43]   Extraction of Information from Public Health Emergency Web Documents [J].
Wang, Li ;
Zhang, Yuanpeng ;
Qian, Danmin ;
Yao, Min .
PROCEEDINGS OF THE 2015 INTERNATIONAL CONFERENCE ON AUTOMATION, MECHANICAL CONTROL AND COMPUTATIONAL ENGINEERING, 2015, 124 :765-770
[44]   XML as a means to support information extraction from legal documents [J].
Martínez, MM ;
de la Fuente, P ;
Derniame, JC .
COMPUTER SYSTEMS SCIENCE AND ENGINEERING, 2003, 18 (05) :263-277
[45]   Learning from similarity and information extraction from structured documents [J].
Holecek, Martin .
INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2021,
[46]   Learning from similarity and information extraction from structured documents [J].
Martin Holeček .
International Journal on Document Analysis and Recognition (IJDAR), 2021, 24 :149-165
[47]   Learning from similarity and information extraction from structured documents [J].
Holecek, Martin .
INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2021, 24 (03) :149-165
[48]   Reading Order Independent Metrics for Information Extraction in Handwritten Documents [J].
Villanova-Aparisi, David ;
Tarride, Solene ;
Martinez-Hinarejos, Carlos-D ;
Romero, Veronica ;
Kermorvant, Christopher ;
Pastor-Gadea, Moises .
DOCUMENT ANALYSIS AND RECOGNITION-ICDAR 2024, PT II, 2024, 14805 :191-215
[49]   BioInfer: a corpus for information extraction in the biomedical domain [J].
Sampo Pyysalo ;
Filip Ginter ;
Juho Heimonen ;
Jari Björne ;
Jorma Boberg ;
Jouni Järvinen ;
Tapio Salakoski .
BMC Bioinformatics, 8
[50]   Domain-General Versus Domain-Specific Named Entity Recognition: A Case Study Using TEXT [J].
Lim, Cheng Yang ;
Tan, Ian K. T. ;
Selvaretnam, Bhawani .
MULTI-DISCIPLINARY TRENDS IN ARTIFICIAL INTELLIGENCE, 2019, 11909 :238-246