Construction and Evaluation of a High-Quality Corpus for Legal Intelligence Using Semiautomated Approaches

被引:14
作者
Chen, Haihua [1 ]
Pieptea, Lavinia F. [2 ]
Ding, Junhua [1 ]
机构
[1] Univ North Texas, Dept Informat Sci, Denton, TX 76203 USA
[2] Univ North Texas, Dept Math, Denton, TX 76203 USA
基金
美国国家科学基金会;
关键词
Law; Annotations; Data integrity; Machine learning; Data mining; Task analysis; Deep learning; BERT; data augmentation; data quality; deep learning; expectation-maximization (EM); generative adversarial network (GAN); legal argument; legal artificial intelligence (legal AI); machine learning corpus; CLASSIFICATION; ARGUMENTATION; DOCUMENTS;
D O I
10.1109/TR.2022.3156126
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
A high-quality corpus is essential for building an effective legal intelligence system. The quality of a corpus includes both the quality of original data and the quality of its corresponding labeling. The major quality dimensions of a legal corpus include comprehensiveness, freshness, and correctness. However, building a comprehensive, correct, and fresh legal corpus is a grand challenge. In this article, we propose a semiautomated machine learning framework to address the challenge. We first created an initial corpus with 4937 instances that were manually labeled. Several strategies were implemented to assure its quality. The initial results showed that class imbalance and insufficiency of training data are the two major quality issues that negatively impacted the quality of the system that was built on the data. We experimented and compared three class-imbalance-handling techniques and found that the mixed-sampling method, which combines upsampling and downsampling, was the most effective way to address the issue. In order to address the insufficiency of training data, we experimented several machine learning methods for automated data augmentation including pseudolabeling, co-training, expectation-maximization, and generative adversarial network (GAN). The results showed that GAN with deep learning models achieved the best performance. Finally, ensemble learning of different classifiers was proposed and experimented with for the construction of a legal corpus, which achieves higher quality in comprehensiveness, freshness, and correctness compared to existing work. The semiautomated machine learning framework and the data quality evaluation method developed in this research can be used for data augmentation and quality evaluation of a large dataset as well as a reference for the selection of machine learning methods for data augmentation and generation. The machine learning models, the training data, and the legal corpus are published and publicly accessible at [Online]. Available: https://github.com/haihua0913/legalArgumentmining.
引用
收藏
页码:657 / 673
页数:17
相关论文
共 90 条
[61]  
Polyzotis N, 2019, P SYSML, V1, P334
[62]   Investigating Correlations of Inter-coder Agreement and Machine Annotation Performance for Historical Video Data [J].
Pustu-Iren, Kader ;
Muehling, Markus ;
Korfhage, Nikolaus ;
Bars, Joanna ;
Bernhoeft, Sabrina ;
Hoerth, Angelika ;
Freisleben, Bernd ;
Ewerth, Ralph .
DIGITAL LIBRARIES FOR OPEN KNOWLEDGE, TPDL 2019, 2019, 11799 :107-114
[63]   Completeness and consistency analysis for evolving knowledge bases [J].
Rashid, Mohammad Rifat Ahmmad ;
Rizzo, Giuseppe ;
Torchiano, Marco ;
Mihindukulasooriya, Nandana ;
Corcho, Oscar ;
Garcia-Castro, Raul .
JOURNAL OF WEB SEMANTICS, 2019, 54 :48-71
[64]  
Salimans T, 2016, ADV NEUR IN, V29
[65]  
Sambasivan Nithya, 2021, P 2021 CHI C HUM FAC, P1, DOI DOI 10.1145/3411764.3445518
[66]  
Schick T., 2020, ABS201013641 CORR
[67]  
Shaheen Z., 2020, ABS201012871 CORR
[68]  
Su C., 2020, INT J PERFORMABILITY, V16, P118
[69]  
Su Y., 2021, ABS210203752 CORR
[70]  
Sun Z., 2020, ABS201108626 CORR