Construction and Evaluation of a High-Quality Corpus for Legal Intelligence Using Semiautomated Approaches

被引:14
作者
Chen, Haihua [1 ]
Pieptea, Lavinia F. [2 ]
Ding, Junhua [1 ]
机构
[1] Univ North Texas, Dept Informat Sci, Denton, TX 76203 USA
[2] Univ North Texas, Dept Math, Denton, TX 76203 USA
基金
美国国家科学基金会;
关键词
Law; Annotations; Data integrity; Machine learning; Data mining; Task analysis; Deep learning; BERT; data augmentation; data quality; deep learning; expectation-maximization (EM); generative adversarial network (GAN); legal argument; legal artificial intelligence (legal AI); machine learning corpus; CLASSIFICATION; ARGUMENTATION; DOCUMENTS;
D O I
10.1109/TR.2022.3156126
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
A high-quality corpus is essential for building an effective legal intelligence system. The quality of a corpus includes both the quality of original data and the quality of its corresponding labeling. The major quality dimensions of a legal corpus include comprehensiveness, freshness, and correctness. However, building a comprehensive, correct, and fresh legal corpus is a grand challenge. In this article, we propose a semiautomated machine learning framework to address the challenge. We first created an initial corpus with 4937 instances that were manually labeled. Several strategies were implemented to assure its quality. The initial results showed that class imbalance and insufficiency of training data are the two major quality issues that negatively impacted the quality of the system that was built on the data. We experimented and compared three class-imbalance-handling techniques and found that the mixed-sampling method, which combines upsampling and downsampling, was the most effective way to address the issue. In order to address the insufficiency of training data, we experimented several machine learning methods for automated data augmentation including pseudolabeling, co-training, expectation-maximization, and generative adversarial network (GAN). The results showed that GAN with deep learning models achieved the best performance. Finally, ensemble learning of different classifiers was proposed and experimented with for the construction of a legal corpus, which achieves higher quality in comprehensiveness, freshness, and correctness compared to existing work. The semiautomated machine learning framework and the data quality evaluation method developed in this research can be used for data augmentation and quality evaluation of a large dataset as well as a reference for the selection of machine learning methods for data augmentation and generation. The machine learning models, the training data, and the legal corpus are published and publicly accessible at [Online]. Available: https://github.com/haihua0913/legalArgumentmining.
引用
收藏
页码:657 / 673
页数:17
相关论文
共 90 条
[21]  
Croce D, 2020, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), P2114
[22]   MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38
[23]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[24]   A Machine Learning Based Framework for Verification and Validation of Massive Scale Image Data [J].
Ding, Junhua ;
Hu, Xin-Hua ;
Gudivada, Venkat .
IEEE TRANSACTIONS ON BIG DATA, 2021, 7 (02) :451-467
[25]  
Dube P., 2019, Nature, V192255, P122
[26]  
Feng S. Y., 2021, ABS210503075 CORR
[27]  
Fernández-Delgado M, 2014, J MACH LEARN RES, V15, P3133
[28]  
Gao J., 2019, ABS190709657 CORR
[29]  
Gienapp L., 2020, P 58 ANN M ASS COMPU, P5772
[30]  
Goodfellow IJ, 2014, ADV NEUR IN, V27, P2672