Sequence-to-Sequence Models for Extracting Information from Registration and Legal Documents

被引:2
|
作者
Pires, Ramon [1 ,2 ]
de Souza, Fabio C. [1 ,3 ]
Rosa, Guilherme [1 ,3 ]
Lotufo, Roberto A. [1 ,3 ]
Nogueira, Rodrigo [1 ,3 ]
机构
[1] NeuralMind Inteligencia Artificial, Sao Paulo, SP, Brazil
[2] Univ Estadual Campinas, Inst Comp, Campinas, SP, Brazil
[3] Univ Estadual Campinas, Sch Elect & Comp Engn, Campinas, SP, Brazil
来源
DOCUMENT ANALYSIS SYSTEMS, DAS 2022 | 2022年 / 13237卷
关键词
Information extraction; Sequence-to-sequence; Legal texts;
D O I
10.1007/978-3-031-06555-2_6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A typical information extraction pipeline consists of token- or span-level classification models coupled with a series of pre- and post-processing scripts. In a production pipeline, requirements often change, with classes being added and removed, which leads to nontrivial modifications to the source code and the possible introduction of bugs. In this work, we evaluate sequence-to-sequence models as an alternative to token-level classification methods for information extraction of legal and registration documents. We finetune models that jointly extract the information and generate the output already in a structured format. Post-processing steps are learned during training, thus eliminating the need for rule-based methods and simplifying the pipeline. Furthermore, we propose a novel method to align the output with the input text, thus facilitating system inspection and auditing. Our experiments on four real-world datasets show that the proposed method is an alternative to classical pipelines. The source code is available at https://github.com/neuralmind-ai/information-extraction-t5.
引用
收藏
页码:83 / 95
页数:13
相关论文
共 22 条
  • [1] Extracting social determinants of health from clinical note text with classification and sequence-to-sequence approaches
    Romanowski, Brian
    Ben Abacha, Asma
    Fan, Yadan
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2023, 30 (08) : 1448 - 1455
  • [2] BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese
    Nguyen Luong Tran
    Duong Minh Le
    Dat Quoc Nguyen
    INTERSPEECH 2022, 2022, : 1751 - 1755
  • [3] Multitask Sequence-to-Sequence Models for Grapheme-to-Phoneme Conversion
    Milde, Benjamin
    Schmidt, Christoph
    Koehler, Joachim
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2536 - 2540
  • [4] Sequence-to-sequence models for workload interference prediction on batch processing datacenters
    Buchaca, David
    Marcual, Joan
    LLuis Berral, Josep
    Carrera, David
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2020, 110 (110): : 155 - 166
  • [5] DISTILLING SEQUENCE-TO-SEQUENCE VOICE CONVERSION MODELS FOR STREAMING CONVERSION APPLICATIONS
    Tanaka, Kou
    Kameoka, Hirokazu
    Kaneko, Takuhiro
    Seki, Shogo
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 1022 - 1028
  • [6] Sequence-to-Sequence Models for Grapheme to Phoneme Conversion on Large Myanmar Pronunciation Dictionary
    Hlaing, Aye Mya
    Pa, Win Pa
    2019 22ND CONFERENCE OF THE ORIENTAL COCOSDA INTERNATIONAL COMMITTEE FOR THE CO-ORDINATION AND STANDARDISATION OF SPEECH DATABASES AND ASSESSMENT TECHNIQUES (O-COCOSDA), 2019, : 149 - 153
  • [7] Rescoring Sequence-to-Sequence Models for Text Line Recognition with CTC-Prefixes
    Wick, Christoph
    Zollner, Jochen
    Gruning, Tobias
    DOCUMENT ANALYSIS SYSTEMS, DAS 2022, 2022, 13237 : 260 - 274
  • [8] INTEGRATING SOURCE-CHANNEL AND ATTENTION-BASED SEQUENCE-TO-SEQUENCE MODELS FOR SPEECH RECOGNITION
    Li, Qiujia
    Zhang, Chao
    Woodland, Philip C.
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 39 - 46
  • [9] Deep Non-Rigid Structure-From-Motion: A Sequence-to-Sequence Translation Perspective
    Deng, Hui
    Zhang, Tong
    Dai, Yuchao
    Shi, Jiawei
    Zhong, Yiran
    Li, Hongdong
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (12) : 10814 - 10828
  • [10] From Speech to Facial Activity: Towards Cross-modal Sequence-to-Sequence Attention Networks
    Stappen, Lukas
    Karas, Vincent
    Cummins, Nicholas
    Ringeval, Fabien
    Scherer, Klaus
    Schuller, Bjorn
    2019 IEEE 21ST INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP 2019), 2019,