Transformers-based information extraction with limited data for domain-specific business documents

被引:18
|
作者
Nguyen, Minh-Tien [1 ,2 ]
Le, Dung Tien [1 ]
Le, Linh [3 ]
机构
[1] CINNAMON LAB, 10th Floor,Geleximco Bldg,36 Hoang Cau, Hanoi, Vietnam
[2] Hung Yen Univ Technol & Educ, Hung Yen, Vietnam
[3] Univ Queensland, Brisbane, Qld, Australia
关键词
Information extraction; Transfer learning; Transformers; NETWORKS;
D O I
10.1016/j.engappai.2020.104100
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Information extraction plays an important role for data transformation in business cases. However, building extraction systems in actual cases face two challenges: (i) the availability of labeled data is usually limited and (ii) highly detailed classification is required. This paper introduces a model for addressing the two challenges. Different from prior studies that usually require a large number of training samples, our extraction model is trained with a small number of data for extracting a large number of information types. To do that, the model takes into account the contextual aspect of pre-trained language models trained on a huge amount of data on general domains for word representation. To adapt to our downstream task, the model employs transfer learning by stacking Convolutional Neural Networks to learn hidden representation for classification. To confirm the efficiency of our method, we apply the model to two actual cases of document processing for bidding and sale documents of two Japanese companies. Experimental results on real testing sets show that, with a small number of training data, our model achieves high accuracy accepted by our clients.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Information Extraction of Domain-specific Business Documents with Limited Data
    Minh-Tien Nguyen
    Le Thai Linh
    Dung Tien Le
    Nguyen Hong Son
    Do Hoang Thai Duong
    Bui Cong Minh
    Akira Shojiguchi
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [2] AURORA: An Information Extraction System of Domain-specific Business Documents with Limited Data
    Minh-Tien Nguyen
    Dung Tien Le
    Le Thai Linh
    Nguyen Hong Son
    Do Hoang Thai Duong
    Bui Cong Minh
    Nguyen Hai Phong
    Nguyen Huu Hiep
    CIKM '20: PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, 2020, : 3437 - 3440
  • [3] Relation Identification in Business Rules for Domain-specific Documents
    Bhattacharyya, Abhidip
    Chittimalli, Pavan Kumar
    Naik, Ravindra
    ISEC'18: PROCEEDINGS OF THE 11TH INNOVATIONS IN SOFTWARE ENGINEERING CONFERENCE, 2018,
  • [4] Domain-specific information extraction structures
    Lyons, S
    Smith, D
    13TH INTERNATIONAL WORKSHOP ON DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, 2002, : 80 - 84
  • [5] Extraction of Informative Expressions from Domain-specific Documents
    Yamamoto, Eiko
    Isahara, Hitoshi
    Terada, Akira
    Abe, Yasunori
    SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, 2008, : 1611 - 1617
  • [6] Extracting Web Business Information Based on Domain-Specific Ontology
    Shen, J.
    Bi, L.
    Xu, F. Y.
    He, K.
    Wei, L. H.
    Zhu, Y.
    ITESS: 2008 PROCEEDINGS OF INFORMATION TECHNOLOGY AND ENVIRONMENTAL SYSTEM SCIENCES, PT 1, 2008, : 997 - 1003
  • [7] Prioritization of Domain-Specific Web Information Extraction
    Huang, Jian
    Yu, Cong
    PROCEEDINGS OF THE TWENTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-10), 2010, : 1327 - 1333
  • [8] Automatic extraction of domain-specific stopwords from labeled documents
    Makrehchi, Masoud
    Kamel, Mohamed S.
    ADVANCES IN INFORMATION RETRIEVAL, 2008, 4956 : 222 - 233
  • [9] Term extraction from sparse, ungrammatical domain-specific documents
    Ittoo, Ashwin
    Bouma, Gosse
    EXPERT SYSTEMS WITH APPLICATIONS, 2013, 40 (07) : 2530 - 2540
  • [10] Domain Specific Transformers-Based Prioritization of Re-admission for Patients in Healthcare
    Hira, Naseem
    Cagatay, Catal
    PROCEEDINGS OF 2023 7TH INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL, NLPIR 2023, 2023, : 51 - 56