Segmenting Brazilian legislative text using weak supervision and active learning

被引:0
|
作者
Siqueira, Felipe A. [1 ]
Pressato, Diany [1 ]
Pereira, Fabiola S. F. [1 ,2 ]
da Silva, Nadia F. F. [1 ,3 ]
Souza, Ellen [1 ,4 ]
Dias, Marcio S. [1 ,5 ]
de Carvalho, Andre C. P. L. F. [1 ]
机构
[1] Univ Sao Paulo, Inst Math Sci & Computat, Sao Carlos, SP, Brazil
[2] Univ Fed Uberlandia, Uberlandia, MG, Brazil
[3] Univ Fed Goias, Goiania, GO, Brazil
[4] Rural Fed Univ Pernambuco, Serra Talhada, PE, Brazil
[5] Fed Univ Catalao, Catalao, Go, Brazil
基金
巴西圣保罗研究基金会;
关键词
Text segmentation; Legislative domain; Weak supervision; Active Learning; Portuguese data;
D O I
10.1007/s10506-024-09419-5
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Legislative houses all over the world are adopting tools based on artificial intelligence to support their work. The incorporation of these tools can improve the analysis of the text of the proposed new laws and speed the preparation and discussion of new laws. The performance of artificial intelligence tools for text processing tasks is largely affected by the corpora used, which should ideally be adapted for the specific domain. When dealing with legislative corpora, text segmentation is often necessary due to the distinct purposes of legislative segments within the overall bill structure. While rule-based approaches can be effective in cases where the data follows a consistent format, they fail when inconsistencies arise in the formatting of legislative bills. In this study, we extensively investigate the use of weak supervision and active learning to accurately segment over 100,000 Brazilian federal legislative bills using a sequence tagging approach. The experiments demonstrated that both BERT and LSTM models achieved high statistical performance without the limitations of rule-based systems. In segmenting long documents beyond the limited context window of BERT, we find that simple moving windows suffice because the required context for accurate legislative segmentation is mostly local. We also conducted an analysis of transfer learning from our monolingual models to French, Italian, German, and English (US) legislative texts. According to our experimental results our models present non-trivial zero-shot and effective out-of-distribution fine-tuning performance, suggesting potential avenues for multilingual legislative segmentation without the need for computationally expensive models. The models, data, and code are publicly available at https://github.com/ulysses-camara/ulysses-segmenter.
引用
收藏
页数:82
相关论文
共 50 条
  • [1] Active and Incremental Learning with Weak Supervision
    Brust, Clemens-Alexander
    Kaeding, Christoph
    Denzler, Joachim
    KUNSTLICHE INTELLIGENZ, 2020, 34 (02): : 165 - 180
  • [2] Active and Incremental Learning with Weak Supervision
    Clemens-Alexander Brust
    Christoph Käding
    Joachim Denzler
    KI - Künstliche Intelligenz, 2020, 34 : 165 - 180
  • [3] Learning Structured Representations of Entity Names using Active Learning and Weak Supervision
    Qian, Kun
    Raman, Poornima Chozhiyath
    Popa, Lucian
    Li, Yunyao
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 6376 - 6383
  • [4] Policy Learning Using Weak Supervision
    Wang, Jingkang
    Guo, Hongyi
    Zhu, Zhaowei
    Liu, Yang
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [5] Learning Concept Abstractness Using Weak Supervision
    Rabinovich, Ella
    Sznajder, Benjamin
    Spector, Artem
    Shnayderman, Ilya
    Aharonov, Ranit
    Konopnicki, David
    Slonim, Noam
    2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 4854 - 4859
  • [6] Contextualized Weak Supervision for Text Classification
    Mekala, Dheeraj
    Shang, Jingbo
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 323 - 333
  • [7] Erasing Scene Text with Weak Supervision
    Zdenek, Jan
    Nakayama, Hideki
    2020 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2020, : 2227 - 2235
  • [8] A clinical text classification paradigm using weak supervision and deep representation
    Wang, Yanshan
    Sohn, Sunghwan
    Liu, Sijia
    Shen, Feichen
    Wang, Liwei
    Atkinson, Elizabeth J.
    Amin, Shreyasee
    Liu, Hongfang
    BMC MEDICAL INFORMATICS AND DECISION MAKING, 2019, 19 (1)
  • [9] A clinical text classification paradigm using weak supervision and deep representation
    Yanshan Wang
    Sunghwan Sohn
    Sijia Liu
    Feichen Shen
    Liwei Wang
    Elizabeth J. Atkinson
    Shreyasee Amin
    Hongfang Liu
    BMC Medical Informatics and Decision Making, 19
  • [10] Learning to Align Images using Weak Geometric Supervision
    Dong, Jing
    Boots, Byron
    Dellaert, Frank
    Chandra, Ranveer
    Sinha, Sudipta N.
    2018 INTERNATIONAL CONFERENCE ON 3D VISION (3DV), 2018, : 700 - 709