Segmenting Brazilian legislative text using weak supervision and active learning

被引:0
|
作者
Siqueira, Felipe A. [1 ]
Pressato, Diany [1 ]
Pereira, Fabiola S. F. [1 ,2 ]
da Silva, Nadia F. F. [1 ,3 ]
Souza, Ellen [1 ,4 ]
Dias, Marcio S. [1 ,5 ]
de Carvalho, Andre C. P. L. F. [1 ]
机构
[1] Univ Sao Paulo, Inst Math Sci & Computat, Sao Carlos, SP, Brazil
[2] Univ Fed Uberlandia, Uberlandia, MG, Brazil
[3] Univ Fed Goias, Goiania, GO, Brazil
[4] Rural Fed Univ Pernambuco, Serra Talhada, PE, Brazil
[5] Fed Univ Catalao, Catalao, Go, Brazil
基金
巴西圣保罗研究基金会;
关键词
Text segmentation; Legislative domain; Weak supervision; Active Learning; Portuguese data;
D O I
10.1007/s10506-024-09419-5
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Legislative houses all over the world are adopting tools based on artificial intelligence to support their work. The incorporation of these tools can improve the analysis of the text of the proposed new laws and speed the preparation and discussion of new laws. The performance of artificial intelligence tools for text processing tasks is largely affected by the corpora used, which should ideally be adapted for the specific domain. When dealing with legislative corpora, text segmentation is often necessary due to the distinct purposes of legislative segments within the overall bill structure. While rule-based approaches can be effective in cases where the data follows a consistent format, they fail when inconsistencies arise in the formatting of legislative bills. In this study, we extensively investigate the use of weak supervision and active learning to accurately segment over 100,000 Brazilian federal legislative bills using a sequence tagging approach. The experiments demonstrated that both BERT and LSTM models achieved high statistical performance without the limitations of rule-based systems. In segmenting long documents beyond the limited context window of BERT, we find that simple moving windows suffice because the required context for accurate legislative segmentation is mostly local. We also conducted an analysis of transfer learning from our monolingual models to French, Italian, German, and English (US) legislative texts. According to our experimental results our models present non-trivial zero-shot and effective out-of-distribution fine-tuning performance, suggesting potential avenues for multilingual legislative segmentation without the need for computationally expensive models. The models, data, and code are publicly available at https://github.com/ulysses-camara/ulysses-segmenter.
引用
收藏
页数:82
相关论文
共 50 条
  • [31] Learning Dependency Structures for Weak Supervision Models
    Varma, Paroma
    Sala, Frederic
    He, Ann
    Ratner, Alexander
    Re, Christopher
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97
  • [32] Rethinking Weak Supervision in Helping Contrastive Learning
    Cui, Jingyi
    Huang, Weiran
    Wang, Yifei
    Wang, Yisen
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 202, 2023, 202
  • [33] Learning with Weak Supervision for Email Intent Detection
    Shu, Kai
    Mukherjee, Subhabrata
    Zheng, Guoqing
    Awadallah, Ahmed Hassan
    Shokouhi, Milad
    Dumais, Susan
    PROCEEDINGS OF THE 43RD INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '20), 2020, : 1051 - 1060
  • [34] Learning Transformation Invariant Representations with Weak Supervision
    Coors, Benjamin
    Condurache, Alexandru
    Mertins, Alfred
    Geiger, Andreas
    PROCEEDINGS OF THE 13TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER VISION, IMAGING AND COMPUTER GRAPHICS THEORY AND APPLICATIONS (VISIGRAPP 2018), VOL 5: VISAPP, 2018, : 64 - 72
  • [35] Improving the performance of weak supervision searches using transfer and meta-learning
    Beauchesne, Hugues
    Chen, Zong-En
    Chiang, Cheng-Wei
    JOURNAL OF HIGH ENERGY PHYSICS, 2024, 2024 (02)
  • [36] LITE: Intent-based Task Representation Learning Using Weak Supervision
    Otani, Naoki
    Gamon, Michael
    Jauhar, Sujay Kumar
    Yang, Mei
    Malireddi, Sri Raghu
    Riva, Oriana
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 2410 - 2424
  • [37] Deep learning of cuneiform sign detection with weak supervision using transliteration alignment
    Dencker, Tobias
    Klinkisch, Pablo
    Maul, Stefan M.
    Ommer, Bjoern
    PLOS ONE, 2020, 15 (12):
  • [38] Using Active Learning in Text Classification of Quranic Sciences
    Goudjil, Mohamed
    Bedda, Mouldi
    Koudil, Mouloud
    Ghoggali, Noureddine
    2013 TAIBAH UNIVERSITY INTERNATIONAL CONFERENCE ON ADVANCES IN INFORMATION TECHNOLOGY FOR THE HOLY QURAN AND ITS SCIENCES, 2013, : 209 - 213
  • [39] Segmenting handwritten text using supervised classification techniques
    Sun, Y
    Butler, TS
    Shafarenko, A
    Adams, R
    Loomes, M
    Davey, N
    2004 IEEE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOLS 1-4, PROCEEDINGS, 2004, : 657 - 662
  • [40] On using partial supervision for text categorization
    Aggarwal, CC
    Gates, SC
    Yu, PS
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2004, 16 (02) : 245 - 255