Information Extraction of Domain-specific Business Documents with Limited Data

被引：2

作者：

Minh-Tien Nguyen ^{[1
,2
]}

Le Thai Linh ^{[1
]}

Dung Tien Le ^{[1
]}

Nguyen Hong Son ^{[1
]}

Do Hoang Thai Duong ^{[1
]}

Bui Cong Minh ^{[1
]}

Akira Shojiguchi ^{[1
]}

机构：

[1] CINNAMON LAB, 10th Floor,Geleximco Bldg,36 Hoang Cau, Hanoi, Vietnam

[2] Hung Yen Univ Technol & Educ, Hung Yen, Vietnam

来源：

2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN) | 2021年

关键词：

Information extraction; Document analysis;

D O I：

10.1109/IJCNN52387.2021.9534328

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Information extraction is a key corner-stone in the digitization of office data which requires the conversion of unstructured to structured data. However, in the actual application to business cases, there is a big deadlock to adapt common extraction systems to domain-specific documents due to the limitation of preparation of training data. To overcome this issue, we introduce a model, which employs pre-trained language models with a customized CNN layer for domain adaptation. The model is validated on three Japanese domain-specific and two benchmark machine reading comprehension data sets (SQuADs). Experimental results confirm that our model achieves promising results which are applicable for actual business scenarios.

引用

页数：8

共 50 条

[41] Information Extraction Methods for Text Documents in a Cognitive Integrated Management Information System [J].

Hernes, Marcin .

2015 IEEE 2ND INTERNATIONAL CONFERENCE ON CYBERNETICS (CYBCONF), 2015, :287-292

[42] Domain-Specific Entity Recognition as Token-Pair Relation Classification [J].

Liu, Jinxuan ;

Shi, Hongxun ;

Li, Chuankun ;

Chang, Qingtao ;

Wang, Jianbin .

IEEE ACCESS, 2023, 11 :118363-118371

[43] Extraction of Information from Public Health Emergency Web Documents [J].

Wang, Li ;

Zhang, Yuanpeng ;

Qian, Danmin ;

Yao, Min .

PROCEEDINGS OF THE 2015 INTERNATIONAL CONFERENCE ON AUTOMATION, MECHANICAL CONTROL AND COMPUTATIONAL ENGINEERING, 2015, 124 :765-770

[44] XML as a means to support information extraction from legal documents [J].

Martínez, MM ;

de la Fuente, P ;

Derniame, JC .

COMPUTER SYSTEMS SCIENCE AND ENGINEERING, 2003, 18 (05) :263-277

[45] Learning from similarity and information extraction from structured documents [J].

Holecek, Martin .

INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2021,

[46] Learning from similarity and information extraction from structured documents [J].

Martin Holeček .

International Journal on Document Analysis and Recognition (IJDAR), 2021, 24 :149-165

[47] Learning from similarity and information extraction from structured documents [J].

Holecek, Martin .

INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2021, 24 (03) :149-165

[48] Reading Order Independent Metrics for Information Extraction in Handwritten Documents [J].

Villanova-Aparisi, David ;

Tarride, Solene ;

Martinez-Hinarejos, Carlos-D ;

Romero, Veronica ;

Kermorvant, Christopher ;

Pastor-Gadea, Moises .

DOCUMENT ANALYSIS AND RECOGNITION-ICDAR 2024, PT II, 2024, 14805 :191-215

[49] BioInfer: a corpus for information extraction in the biomedical domain [J].

Sampo Pyysalo ;

Filip Ginter ;

Juho Heimonen ;

Jari Björne ;

Jorma Boberg ;

Jouni Järvinen ;

Tapio Salakoski .

BMC Bioinformatics, 8

[50] Domain-General Versus Domain-Specific Named Entity Recognition: A Case Study Using TEXT [J].

Lim, Cheng Yang ;

Tan, Ian K. T. ;

Selvaretnam, Bhawani .

MULTI-DISCIPLINARY TRENDS IN ARTIFICIAL INTELLIGENCE, 2019, 11909 :238-246

← 1 2 3 4 5 →