Text mining of hypereutectic Al-Si alloys literature based on active learning

被引:6
作者
Liu, Yingli [1 ,2 ]
Yao, Changhui [1 ,2 ]
Niu, Chen [1 ]
Li, Wuliang [1 ,2 ]
Yin, Jiancheng [3 ]
Shen, Tao [1 ,2 ]
机构
[1] Kunming Univ Sci & Technol, Fac Informat Engn & Automat, Kunming 650093, Yunnan, Peoples R China
[2] Kunming Univ Sci & Technol, Yunnan Key Lab Comp Technol Applicat, Kunming 650500, Yunnan, Peoples R China
[3] Kunming Univ Sci & Technol, Fac Mat Sci & Engn, Kunming 650093, Yunnan, Peoples R China
来源
MATERIALS TODAY COMMUNICATIONS | 2021年 / 26卷
基金
中国国家自然科学基金;
关键词
Materials Genome Initiative (MGI); Hypereutectic Al-Si alloy entity dataset (HASE); Material entities recognition (MER); Active learning;
D O I
10.1016/j.mtcomm.2021.102032
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
The data-driven model is the core issue of the Material Genome Initiative (MGI), but how to quickly obtain a large amount of material data has become a critical issue that needs to be resolved. At present, the sharing of material databases is low, so it is not easy to obtain useful material data from public resources. Therefore, we use the text mining method to obtain valid data from the literature of hypereutectic Al-Si alloy. Natural language processing (NLP) is a commonly used text mining method. Named entity recognition (NER), as one of the main tasks of NLP, can effectively extract information from the literature. However, there is no public dataset suitable for material entities recognition (MER) research in the material field. To effectively apply named entity recognition to the material field, five types of entities are selected from the material literature in this paper, and the hypereutectic Al-Si alloy material entity dataset (HASE) is constructed by manual annotation, which includes 8,845 material entities in total. At the same time, in the field of materials with only a small amount of annotation data, the MER method combined with active learning is proposed. Combined with the characteristics of the material entity, active learning adopts automatic annotation based on dictionary and rules, CRF model, and BiGRU-CRF model. In the end, a total of 16,677 material entities were annotated. The method of combining active learning not only improves the performance of the MER model but also reduces the cost of annotation. This method can more accurately extract effective material data in the literature. This research result provides an effective way for MGI researchers to quickly obtain a large amount of material data, which has theoretical significance and practical application value.
引用
收藏
页数:8
相关论文
共 24 条
[1]   Dry sliding wear behaviour of hypereutectic Al-Si piston alloys containing iron-rich intermetallics [J].
Abouei, V. ;
Shabestari, S. G. ;
Saghafian, H. .
MATERIALS CHARACTERIZATION, 2010, 61 (11) :1089-1096
[2]  
Ananiadou S., 2016, INT C DAT AN MAN DAT, DOI [10.1007/978-3-319-57135-5_5., DOI 10.1007/978-3-319-57135-5_5, 10.1007/978-3-319-57135-5_5]
[3]  
Ananiadou S, 2016, LEARNING RECOGNISE N
[4]   Long short-term memory [J].
Hochreiter, S ;
Schmidhuber, J .
NEURAL COMPUTATION, 1997, 9 (08) :1735-1780
[5]   Text Mining-Based Review of Articles Published in the Journal of Professional Issues in Engineering Education and Practice [J].
Chen, Wei ;
Xu, Yidong ;
Jin, Ruoyu ;
Wanatowski, Dariusz .
JOURNAL OF PROFESSIONAL ISSUES IN ENGINEERING EDUCATION AND PRACTICE, 2019, 145 (04)
[7]   Fulfilling the promise of the materials genome initiative with high-throughput experimental methodologies [J].
Green, M. L. ;
Choi, C. L. ;
Hattrick-Simpers, J. R. ;
Joshi, A. M. ;
Takeuchi, I. ;
Barron, S. C. ;
Campo, E. ;
Chiang, T. ;
Empedocles, S. ;
Gregoire, J. M. ;
Kusne, A. G. ;
Martin, J. ;
Mehta, A. ;
Persson, K. ;
Trautt, Z. ;
Van Duren, J. ;
Zakutayev, A. .
APPLIED PHYSICS REVIEWS, 2017, 4 (01)
[8]  
Hady M.F.A., 2013, Handbook on Neural Information Processing, P215, DOI DOI 10.1007/978-3-642-36657-4_7
[9]   Materials Synthesis Insights from Scientific Literature via Text Extraction and Machine Learning [J].
Kim, Edward ;
Huang, Kevin ;
Saunders, Adam ;
McCallum, Andrew ;
Ceder, Gerbrand ;
Olivetti, Elsa .
CHEMISTRY OF MATERIALS, 2017, 29 (21) :9436-9444
[10]   Automatic extraction of named entities of cyber threats using a deep Bi-LSTM-CRF network [J].
Kim, Gyeongmin ;
Lee, Chanhee ;
Jo, Jaechoon ;
Lim, Heuiseok .
INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2020, 11 (10) :2341-2355