Curated Data In - Trustworthy In Silico Models Out: The Impact of Data Quality on the Reliability of Artificial Intelligence Models as Alternatives to Animal Testing

被引:27
作者
Alves, Vinicius M. [1 ]
Auerbach, Scott S. [2 ]
Kleinstreuer, Nicole [3 ]
Rooney, John P. [4 ]
Muratov, Eugene N. [5 ,6 ]
Rusyn, Ivan [7 ]
Tropsha, Alexander [5 ]
Schmitt, Charles [1 ]
机构
[1] NIEHS, Off Data Sci, Div Natl Toxicol Program DNTP, Durham, NC 27560 USA
[2] NIEHS, Toxinformat Grp, Predict Toxicol Branch, DNTP, Durham, NC 27560 USA
[3] NIEHS, Natl Toxicol Program Interagcy Ctr Evaluat Altern, Sci Directors Off, DNTP, Durham, NC 27560 USA
[4] Integrated Lab Syst LLC, Morrisville, NC USA
[5] Univ N Carolina, UNC Eshelman Sch Pharm, Lab Mol Modeling, Chapel Hill, NC 27599 USA
[6] Univ Fed Paraiba, Dept Pharmaceut Sci, Joao Pessoa, Paraiba, Brazil
[7] Texas A&M Univ, Coll Vet Med & Biomed Sci, Dept Vet Integrat Biosci, College Stn, TX USA
来源
ATLA-ALTERNATIVES TO LABORATORY ANIMALS | 2021年 / 49卷 / 03期
关键词
artificial intelligence; data curation; data quality; data reproducibility; QSAR; QSAR; PREDICTION; REPRODUCIBILITY; TOXICOLOGY; TOXICITY; STRATEGY; VERIFY; BEWARE; CHEMBL; TRUST;
D O I
10.1177/02611929211029635
中图分类号
R-3 [医学研究方法]; R3 [基础医学];
学科分类号
1001 ;
摘要
New Approach Methodologies (NAMs) that employ artificial intelligence (AI) for predicting adverse effects of chemicals have generated optimistic expectations as alternatives to animal testing. However, the major underappreciated challenge in developing robust and predictive AI models is the impact of the quality of the input data on the model accuracy. Indeed, poor data reproducibility and quality have been frequently cited as factors contributing to the crisis in biomedical research, as well as similar shortcomings in the fields of toxicology and chemistry. In this article, we review the most recent efforts to improve confidence in the robustness of toxicological data and investigate the impact that data curation has on the confidence in model predictions. We also present two case studies demonstrating the effect of data curation on the performance of AI models for predicting skin sensitisation and skin irritation. We show that, whereas models generated with uncurated data had a 7-24% higher correct classification rate (CCR), the perceived performance was, in fact, inflated owing to the high number of duplicates in the training set. We assert that data curation is a critical step in building computational models, to help ensure that reliable predictions of chemical toxicity are achieved through use of the models.
引用
收藏
页码:73 / 82
页数:10
相关论文
共 84 条
  • [51] ChEMBL: towards direct deposition of bioassay data
    Mendez, David
    Gaulton, Anna
    Bento, A. Patricia
    Chambers, Jon
    De Veij, Marleen
    Felix, Eloy
    Magarinos, Maria Paula
    Mosquera, Juan F.
    Mutowo, Prudence
    Nowotka, Michal
    Gordillo-Maranon, Maria
    Hunter, Fiona
    Junco, Laura
    Mugumbate, Grace
    Rodriguez-Lopez, Milagros
    Atkinson, Francis
    Bosc, Nicolas
    Radoux, ChrisJ
    Segura-Cabrera, Aldo
    Hersey, Anne
    Leach, Andrew R.
    [J]. NUCLEIC ACIDS RESEARCH, 2019, 47 (D1) : D930 - D940
  • [52] Editorial: Method and Data Sharing and Reproducibility of Scientific Results
    Merz, Kenneth M., Jr.
    Amaro, Rommie
    Cournia, Zoe
    Rarey, Matthias
    Soares, Thereza
    Tropsha, Alexander
    Wahab, Habibah A.
    Wang, Renxiao
    [J]. JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2020, 60 (12) : 5868 - 5869
  • [53] Untitled
    Merz, Kenneth M., Jr.
    Rarey, Matthias
    Tropsha, Alexander
    Wahab, Habibah A.
    [J]. JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2015, 55 (04) : 719 - 720
  • [54] Improving Reproducibility in Toxicology
    Miller, Gary W.
    [J]. TOXICOLOGICAL SCIENCES, 2014, 139 (01) : 1 - 3
  • [55] Muratov EN, 2020, CHEM SOC REV, V49, P3525, DOI 10.1039/d0cs00098a
  • [56] Existing and Developing Approaches for QSAR Analysis of Mixtures
    Muratov, Eugene N.
    Varlamova, Ekaterina V.
    Artemenko, Anatoly G.
    Polishchuk, Pavel G.
    Kuz'min, Victor E.
    [J]. MOLECULAR INFORMATICS, 2012, 31 (3-4) : 202 - 221
  • [57] In silico toxicology protocols
    Myatt, Glenn J.
    Ahlberg, Ernst
    Akahori, Yumi
    Allen, David
    Amberg, Alexander
    Anger, Lennart T.
    Aptula, Aynur
    Auerbach, Scott
    Beilke, Lisa
    Bellion, Phillip
    Benigni, Romualdo
    Bercu, Joel
    Booth, Ewan D.
    Bower, Dave
    Brigo, Alessandro
    Burden, Natalie
    Cammerer, Zoryana
    Cronin, Mark T. D.
    Cross, Kevin P.
    Custer, Laura
    Dettwiler, Magdalena
    Dobo, Krista
    Ford, Kevin A.
    Fortin, Marie C.
    Gad-McDonald, Samantha E.
    Gellatly, Nichola
    Gervais, Veronique
    Glover, Kyle P.
    Glowienke, Susanne
    Van Gompel, Jacky
    Gutsell, Steve
    Hardy, Barry
    Harvey, James S.
    Hillegass, Jedd
    Honma, Masamitsu
    Hsieh, Jui-Hua
    Hsu, Chia-Wen
    Hughes, Kathy
    Johnson, Candice
    Jolly, Robert
    Jones, David
    Kemper, Ray
    Kenyon, Michelle O.
    Kim, Marlene T.
    Kruhlak, Naomi L.
    Kulkarni, Sunil A.
    Kuemmerer, Klaus
    Leavitt, Penny
    Majer, Bernhard
    Masten, Scott
    [J]. REGULATORY TOXICOLOGY AND PHARMACOLOGY, 2018, 96 : 1 - 17
  • [58] Automated Framework for Developing Predictive Machine Learning Models for Data-Driven Drug Discovery
    Neves, Bruno J.
    Moreira-Filho, Jose T.
    Silva, Arthur C.
    Borba, Joyce V. V. B.
    Mottin, Melina
    Alves, Vinicius M.
    Braga, Rodolpho C.
    Muratov, Eugene N.
    Andrade, Carolina H.
    [J]. JOURNAL OF THE BRAZILIAN CHEMICAL SOCIETY, 2021, 32 (01) : 110 - 122
  • [59] In Silico Repositioning-Chemogenomics Strategy Identifies New Drugs with Potential Activity against Multiple Life Stages of Schistosoma mansoni
    Neves, Bruno J.
    Braga, Rodolpho C.
    Bezerra, Jose C. B.
    Cravo, Pedro V. L.
    Andrade, Carolina H.
    [J]. PLOS NEGLECTED TROPICAL DISEASES, 2015, 9 (01):
  • [60] NIH, 2018, NIH STRAT PLAN DAT S