Sequence-to-Sequence Models for Extracting Information from Registration and Legal Documents

被引：2

作者：

Pires, Ramon ^{[1
,2
]}

de Souza, Fabio C. ^{[1
,3
]}

Rosa, Guilherme ^{[1
,3
]}

Lotufo, Roberto A. ^{[1
,3
]}

Nogueira, Rodrigo ^{[1
,3
]}

机构：

[1] NeuralMind Inteligencia Artificial, Sao Paulo, SP, Brazil

[2] Univ Estadual Campinas, Inst Comp, Campinas, SP, Brazil

[3] Univ Estadual Campinas, Sch Elect & Comp Engn, Campinas, SP, Brazil

来源：

DOCUMENT ANALYSIS SYSTEMS, DAS 2022 | 2022年 / 13237卷

关键词：

Information extraction; Sequence-to-sequence; Legal texts;

D O I：

10.1007/978-3-031-06555-2_6

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

A typical information extraction pipeline consists of token- or span-level classification models coupled with a series of pre- and post-processing scripts. In a production pipeline, requirements often change, with classes being added and removed, which leads to nontrivial modifications to the source code and the possible introduction of bugs. In this work, we evaluate sequence-to-sequence models as an alternative to token-level classification methods for information extraction of legal and registration documents. We finetune models that jointly extract the information and generate the output already in a structured format. Post-processing steps are learned during training, thus eliminating the need for rule-based methods and simplifying the pipeline. Furthermore, we propose a novel method to align the output with the input text, thus facilitating system inspection and auditing. Our experiments on four real-world datasets show that the proposed method is an alternative to classical pipelines. The source code is available at https://github.com/neuralmind-ai/information-extraction-t5.

引用

页码：83 / 95

页数：13

共 22 条

[1] Extracting social determinants of health from clinical note text with classification and sequence-to-sequence approaches
Romanowski, Brian
Ben Abacha, Asma
Fan, Yadan
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2023, 30 (08) : 1448 - 1455
[2] BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese
Nguyen Luong Tran
Duong Minh Le
Dat Quoc Nguyen
INTERSPEECH 2022, 2022, : 1751 - 1755
[3] Multitask Sequence-to-Sequence Models for Grapheme-to-Phoneme Conversion
Milde, Benjamin
Schmidt, Christoph
Koehler, Joachim
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2536 - 2540
[4] Sequence-to-sequence models for workload interference prediction on batch processing datacenters
Buchaca, David
Marcual, Joan
LLuis Berral, Josep
Carrera, David
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2020, 110 (110): : 155 - 166
[5] DISTILLING SEQUENCE-TO-SEQUENCE VOICE CONVERSION MODELS FOR STREAMING CONVERSION APPLICATIONS
Tanaka, Kou
Kameoka, Hirokazu
Kaneko, Takuhiro
Seki, Shogo
2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 1022 - 1028
[6] Sequence-to-Sequence Models for Grapheme to Phoneme Conversion on Large Myanmar Pronunciation Dictionary
Hlaing, Aye Mya
Pa, Win Pa
2019 22ND CONFERENCE OF THE ORIENTAL COCOSDA INTERNATIONAL COMMITTEE FOR THE CO-ORDINATION AND STANDARDISATION OF SPEECH DATABASES AND ASSESSMENT TECHNIQUES (O-COCOSDA), 2019, : 149 - 153
[7] Rescoring Sequence-to-Sequence Models for Text Line Recognition with CTC-Prefixes
Wick, Christoph
Zollner, Jochen
Gruning, Tobias
DOCUMENT ANALYSIS SYSTEMS, DAS 2022, 2022, 13237 : 260 - 274
[8] INTEGRATING SOURCE-CHANNEL AND ATTENTION-BASED SEQUENCE-TO-SEQUENCE MODELS FOR SPEECH RECOGNITION
Li, Qiujia
Zhang, Chao
Woodland, Philip C.
2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 39 - 46
[9] Deep Non-Rigid Structure-From-Motion: A Sequence-to-Sequence Translation Perspective
Deng, Hui
Zhang, Tong
Dai, Yuchao
Shi, Jiawei
Zhong, Yiran
Li, Hongdong
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (12) : 10814 - 10828
[10] From Speech to Facial Activity: Towards Cross-modal Sequence-to-Sequence Attention Networks
Stappen, Lukas
Karas, Vincent
Cummins, Nicholas
Ringeval, Fabien
Scherer, Klaus
Schuller, Bjorn
2019 IEEE 21ST INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP 2019), 2019,

← 1 2 3 →