Segmenting Brazilian legislative text using weak supervision and active learning

被引:0
|
作者
Siqueira, Felipe A. [1 ]
Pressato, Diany [1 ]
Pereira, Fabiola S. F. [1 ,2 ]
da Silva, Nadia F. F. [1 ,3 ]
Souza, Ellen [1 ,4 ]
Dias, Marcio S. [1 ,5 ]
de Carvalho, Andre C. P. L. F. [1 ]
机构
[1] Univ Sao Paulo, Inst Math Sci & Computat, Sao Carlos, SP, Brazil
[2] Univ Fed Uberlandia, Uberlandia, MG, Brazil
[3] Univ Fed Goias, Goiania, GO, Brazil
[4] Rural Fed Univ Pernambuco, Serra Talhada, PE, Brazil
[5] Fed Univ Catalao, Catalao, Go, Brazil
基金
巴西圣保罗研究基金会;
关键词
Text segmentation; Legislative domain; Weak supervision; Active Learning; Portuguese data;
D O I
10.1007/s10506-024-09419-5
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Legislative houses all over the world are adopting tools based on artificial intelligence to support their work. The incorporation of these tools can improve the analysis of the text of the proposed new laws and speed the preparation and discussion of new laws. The performance of artificial intelligence tools for text processing tasks is largely affected by the corpora used, which should ideally be adapted for the specific domain. When dealing with legislative corpora, text segmentation is often necessary due to the distinct purposes of legislative segments within the overall bill structure. While rule-based approaches can be effective in cases where the data follows a consistent format, they fail when inconsistencies arise in the formatting of legislative bills. In this study, we extensively investigate the use of weak supervision and active learning to accurately segment over 100,000 Brazilian federal legislative bills using a sequence tagging approach. The experiments demonstrated that both BERT and LSTM models achieved high statistical performance without the limitations of rule-based systems. In segmenting long documents beyond the limited context window of BERT, we find that simple moving windows suffice because the required context for accurate legislative segmentation is mostly local. We also conducted an analysis of transfer learning from our monolingual models to French, Italian, German, and English (US) legislative texts. According to our experimental results our models present non-trivial zero-shot and effective out-of-distribution fine-tuning performance, suggesting potential avenues for multilingual legislative segmentation without the need for computationally expensive models. The models, data, and code are publicly available at https://github.com/ulysses-camara/ulysses-segmenter.
引用
收藏
页数:82
相关论文
共 50 条
  • [21] Weak Supervision for Learning Discourse Structure
    Badene, Sonia
    Thompson, Kate
    Lorre, Jean-Pierre
    Asher, Nicholas
    2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 2296 - 2305
  • [22] Learning Node Abnormality with Weak Supervision
    Zhou, Qinghai
    Ding, Kaize
    Liu, Huan
    Tong, Hanghang
    PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023, 2023, : 3584 - 3594
  • [23] Merging Weak and Active Supervision for Semantic Parsing
    Ni, Ansong
    Yin, Pengcheng
    Neubig, Graham
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 8536 - 8543
  • [24] Active neural learners for text with dual supervision
    Shama Sastry, Chandramouli
    Milios, Evangelos E.
    NEURAL COMPUTING & APPLICATIONS, 2020, 32 (17): : 13343 - 13362
  • [25] Active neural learners for text with dual supervision
    Chandramouli Shama Sastry
    Evangelos E. Milios
    Neural Computing and Applications, 2020, 32 : 13343 - 13362
  • [26] Acoustic Inspection of Concrete Structures Using Active Weak Supervision and Visual Information
    Kasahara, Jun Younes Louhi
    Yamashita, Atsushi
    Asama, Hajime
    SENSORS, 2020, 20 (03)
  • [27] GGTWEAK: Gene Tagging with Weak Supervision for German Clinical Text
    Steinwand, Sandro
    Borchert, Florian
    Winkler, Silvia
    Schapranow, Matthieu-P
    ARTIFICIAL INTELLIGENCE IN MEDICINE, AIME 2023, 2023, 13897 : 183 - 192
  • [28] META: Metadata-Empowered Weak Supervision for Text Classification
    Mekala, Dheeraj
    Zhang, Xinyang
    Shang, Jingbo
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 8351 - 8361
  • [29] X-Class: Text Classification with Extremely Weak Supervision
    Wang, Zihan
    Mekala, Dheeraj
    Shang, Jingbo
    2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 3043 - 3053