Extracting enhanced artificial intelligence model metadata from software repositories

被引:2
作者
Tsay, Jason [1 ]
Braz, Alan [2 ]
Hirzel, Martin [1 ]
Shinnar, Avraham [1 ]
Mummert, Todd [1 ]
机构
[1] IBM Res, Yorktown Hts, NY 10598 USA
[2] IBM Res Brazil, Sao Paulo, Brazil
关键词
Artificial intelligence; Machine learning; Mining software repositories; Model mining; Model metadata; Model catalog; Metadata extraction;
D O I
10.1007/s10664-022-10206-6
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
While artificial intelligence (AI) models have improved at understanding large-scale data, understanding AI models themselves at any scale is difficult. For example, even two models that implement the same network architecture may differ in frameworks, datasets, or even domains. Furthermore, attempting to use either model often requires much manual effort to understand it. As software engineering and AI development share many of the same languages and tools, techniques in mining software repositories should enable more scalable insights into AI models and AI development. However, much of the relevant metadata around models are not easily extractable. This paper (an extension of our MSR 2020 paper) presents a library called AIMMX for AI Model Metadata eXtraction from software repositories into enhanced metadata that conforms to a flexible metadata schema. We evaluated AIMMX against 7,998 open-source models from three sources: model zoos, arXiv AI papers, and state-of-the-art AI papers. We also explored how AIMMX can enable studies and tools to advance engineering support for AI development. As preliminary examples, we present an exploratory analysis for data and method reproducibility over the models in the evaluation dataset and a catalog tool for discovering and managing models. We also demonstrate the flexibility of extracted metadata by using the evaluation dataset in an existing natural language processing (NLP) analysis platform to identify trends in the dataset. Overall, we hope AIMMX fosters research towards better AI development.
引用
收藏
页数:37
相关论文
共 62 条
[1]  
Ajv, 2018, AJV AN JSON SCH VAL
[2]   Software Engineering for Machine Learning: A Case Study [J].
Amershi, Saleema ;
Begel, Andrew ;
Bird, Christian ;
DeLine, Robert ;
Gall, Harald ;
Kamar, Ece ;
Nagappan, Nachiappan ;
Nushi, Besmira ;
Zimmermann, Thomas .
2019 IEEE/ACM 41ST INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: SOFTWARE ENGINEERING IN PRACTICE (ICSE-SEIP 2019), 2019, :291-300
[3]  
Apache, 2019, AP COUCHDB
[4]  
Archive G, 2021, GH ARCH
[5]  
arXiv, 1991, ARXIV
[6]  
arXiv, 2018, ARXIV
[7]   Cayenne - a language with dependent types [J].
Augustsson, L .
ACM SIGPLAN NOTICES, 1999, 34 (01) :239-250
[8]  
Bangash Abdul Ali, 2019, 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), P260, DOI 10.1109/MSR.2019.00052
[9]   The Open-Closed Principle of Modern Machine Learning Frameworks [J].
Ben Braiek, Houssem ;
Khomh, Foutse ;
Adams, Bram .
2018 IEEE/ACM 15TH INTERNATIONAL CONFERENCE ON MINING SOFTWARE REPOSITORIES (MSR), 2018, :353-363
[10]  
Breck E., 2019, C SYSTEMS MACHINE LE