Construction and Evaluation of a High-Quality Corpus for Legal Intelligence Using Semiautomated Approaches

被引:14
作者
Chen, Haihua [1 ]
Pieptea, Lavinia F. [2 ]
Ding, Junhua [1 ]
机构
[1] Univ North Texas, Dept Informat Sci, Denton, TX 76203 USA
[2] Univ North Texas, Dept Math, Denton, TX 76203 USA
基金
美国国家科学基金会;
关键词
Law; Annotations; Data integrity; Machine learning; Data mining; Task analysis; Deep learning; BERT; data augmentation; data quality; deep learning; expectation-maximization (EM); generative adversarial network (GAN); legal argument; legal artificial intelligence (legal AI); machine learning corpus; CLASSIFICATION; ARGUMENTATION; DOCUMENTS;
D O I
10.1109/TR.2022.3156126
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
A high-quality corpus is essential for building an effective legal intelligence system. The quality of a corpus includes both the quality of original data and the quality of its corresponding labeling. The major quality dimensions of a legal corpus include comprehensiveness, freshness, and correctness. However, building a comprehensive, correct, and fresh legal corpus is a grand challenge. In this article, we propose a semiautomated machine learning framework to address the challenge. We first created an initial corpus with 4937 instances that were manually labeled. Several strategies were implemented to assure its quality. The initial results showed that class imbalance and insufficiency of training data are the two major quality issues that negatively impacted the quality of the system that was built on the data. We experimented and compared three class-imbalance-handling techniques and found that the mixed-sampling method, which combines upsampling and downsampling, was the most effective way to address the issue. In order to address the insufficiency of training data, we experimented several machine learning methods for automated data augmentation including pseudolabeling, co-training, expectation-maximization, and generative adversarial network (GAN). The results showed that GAN with deep learning models achieved the best performance. Finally, ensemble learning of different classifiers was proposed and experimented with for the construction of a legal corpus, which achieves higher quality in comprehensiveness, freshness, and correctness compared to existing work. The semiautomated machine learning framework and the data quality evaluation method developed in this research can be used for data augmentation and quality evaluation of a large dataset as well as a reference for the selection of machine learning methods for data augmentation and generation. The machine learning models, the training data, and the legal corpus are published and publicly accessible at [Online]. Available: https://github.com/haihua0913/legalArgumentmining.
引用
收藏
页码:657 / 673
页数:17
相关论文
共 90 条
[1]   Effects of annotation quality on model performance [J].
Alhazmi, Khaled ;
Alsumari, Walaa ;
Seppo, Indrek ;
Podkuiko, Lara ;
Simon, Martin .
3RD INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE IN INFORMATION AND COMMUNICATION (IEEE ICAIIC 2021), 2021, :63-67
[2]  
[Anonymous], 2004, ICML
[3]  
Aridas Christos., 2015, P 19 PANHELLENIC C I, P123, DOI DOI 10.1145/2801948.2802011
[4]  
Ashley K. D., 2014, SMARTCITIES15 INT WO, P1
[5]  
Ashley KD., 2013, P 14 INT C ARTIFICIA, P176
[6]  
Bai X., 2020, INT J PERFORMABILITY, V16, P979, DOI DOI 10.23940/IJPE.20.06.P16.979990
[7]  
Bajwa IS, 2017, J COMPUT, V12, P451, DOI 10.17706/jcp.12.5.451-461
[8]   Accuracy, completeness and accessibility of online information on fibromyalgia [J].
Basavakumar, Deepika ;
Flegg, Mirika ;
Eccles, Jessica ;
Ghezzi, Pietro .
RHEUMATOLOGY INTERNATIONAL, 2019, 39 (04) :735-742
[9]   Building text classifiers using positive and unlabeled examples [J].
Bing, L ;
Yang, D ;
Li, XL ;
Lee, WS ;
Yu, PS .
THIRD IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2003, :179-186
[10]  
Bommarito I. I, 2021, RES HDB BIG DATA LAW