Evaluation of crowdsourced mortality prediction models as a framework for assessing artificial intelligence in medicine

被引:2
作者
Bergquist, Timothy [1 ,2 ]
Schaffter, Thomas [1 ]
Yan, Yao [1 ,3 ]
Yu, Thomas [1 ]
Prosser, Justin [4 ]
Gao, Jifan [5 ]
Chen, Guanhua [5 ]
Lukasz, Charzewski [6 ,7 ]
Nawalany, Zofia
Brugere, Ivan [8 ]
Retkute, Renata [9 ]
Prusokas, Alidivinas [10 ,11 ]
Choi, Yonghwa [12 ]
Lee, Sanghoon [12 ]
Choe, Junseok [12 ]
Lee, Inggeol [13 ]
Kim, Sunkyu
Kang, Jaewoo
Mooney, Sean D.
Guinney, Justin [2 ]
机构
[1] Sage Bionetworks, Seattle, WA 98121 USA
[2] Univ Washington, Dept Biomed Informat & Med Educ, Seattle, WA USA
[3] Univ Washington, Mol Engn & Sci Inst, Seattle, WA 98109 USA
[4] Univ Washington, Inst Translat Hlth Sci, Seattle, WA 98109 USA
[5] Univ Wisconsin, Dept Biostat & Med Informat, Madison, WI 98109 USA
[6] Proacta, Warsaw, Poland
[7] Univ Warsaw, Div Biophys, Warsaw, Poland
[8] Univ Illinois, Dept Comp Sci, Chicago, IL USA
[9] Univ Cambridge, Dept Plant Sci, Cambridge, England
[10] Newcastle Univ, Sch Nat & Environm Sci, Plant & Mol Sci, Newcastle Upon Tyne, England
[11] Imperial Coll London, Dept Life Sci, London, England
[12] Korea Univ, Coll Informat, Dept Comp Sci & Engn, Seoul, South Korea
[13] Korea Univ, Coll Informat, Dept Interdisciplinary Program Bioinformat, Seoul, South Korea
基金
美国国家卫生研究院;
关键词
evaluation; machine learning; health informatics; PROTEIN; PHENOTYPES; HEALTH; CAFA;
D O I
10.1093/jamia/ocad159
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objective Applications of machine learning in healthcare are of high interest and have the potential to improve patient care. Yet, the real-world accuracy of these models in clinical practice and on different patient subpopulations remains unclear. To address these important questions, we hosted a community challenge to evaluate methods that predict healthcare outcomes. We focused on the prediction of all-cause mortality as the community challenge question. Materials and methods Using a Model-to-Data framework, 345 registered participants, coalescing into 25 independent teams, spread over 3 continents and 10 countries, generated 25 accurate models all trained on a dataset of over 1.1 million patients and evaluated on patients prospectively collected over a 1-year observation of a large health system. Results The top performing team achieved a final area under the receiver operator curve of 0.947 (95% CI, 0.942-0.951) and an area under the precision-recall curve of 0.487 (95% CI, 0.458-0.499) on a prospectively collected patient cohort. Discussion Post hoc analysis after the challenge revealed that models differ in accuracy on subpopulations, delineated by race or gender, even when they are trained on the same data. Conclusion This is the largest community challenge focused on the evaluation of state-of-the-art machine learning methods in a healthcare system performed to date, revealing both opportunities and pitfalls of clinical AI.
引用
收藏
页码:35 / 44
页数:10
相关论文
共 33 条
  • [1] Reports from the fifth edition of CAGI: The Critical Assessment of Genome Interpretation
    Andreoletti, Gaia
    Pal, Lipika R.
    Moult, John
    Brenner, Steven E.
    [J]. HUMAN MUTATION, 2019, 40 (09) : 1197 - 1201
  • [2] Piloting a model-to-data approach to enable predictive analytics in health care through patient mortality prediction
    Bergquist, Timothy
    Yan, Yao
    Schaffter, Thomas
    Yu, Thomas
    Pejaver, Vikas
    Hammarlund, Noah
    Prosser, Justin
    Guinney, Justin
    Mooney, Sean
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2020, 27 (09) : 1393 - 1400
  • [3] Matching phenotypes to whole genomes: Lessons learned from four iterations of the personal genome project community challenges
    Cai, Binghuang
    Li, Biao
    Kiga, Nikki
    Thusberg, Janita
    Bergquist, Timothy
    Chen, Yun-Ching
    Niknafs, Noushin
    Carter, Hannah
    Tokheim, Collin
    Beleva-Guthrie, Violeta
    Douville, Christopher
    Bhattacharya, Rohit
    Yeo, Hui Ting Grace
    Fan, Jean
    Sengupta, Sohini
    Kim, Dewey
    Cline, Melissa
    Turner, Tychele
    Diekhans, Mark
    Zaucha, Jan
    Pal, Lipika R.
    Cao, Chen
    Yu, Chen-Hsin
    Yin, Yizhou
    Carraro, Marco
    Giollo, Manuel
    Ferrari, Carlo
    Leonardi, Emanuela
    Tosatto, Silvio C. E.
    Bobe, Jason
    Ball, Madeleine
    Hoskins, Roger A.
    Repo, Susanna
    Church, George
    Brenner, Steven E.
    Moult, John
    Gough, Julian
    Stanke, Mario
    Karchin, Rachel
    Mooney, Sean D.
    [J]. HUMAN MUTATION, 2017, 38 (09) : 1266 - 1276
  • [4] Decaying relevance of clinical data towards future decisions in data-driven inpatient clinical order sets
    Chen, Jonathan H.
    Alagappan, Muthuraman
    Goldstein, Mary K.
    Asch, Steven M.
    Altman, Russ B.
    [J]. INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2017, 102 : 71 - 79
  • [5] XGBoost: A Scalable Tree Boosting System
    Chen, Tianqi
    Guestrin, Carlos
    [J]. KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, : 785 - 794
  • [6] Working toward precision medicine: Predicting phenotypes from exomes in the Critical Assessment of Genome Interpretation (CAGI) challenges
    Daneshjou, Roxana
    Wang, Yanran
    Bromberg, Yana
    Bovo, Samuele
    Martelli, Pier L.
    Babbi, Giulia
    Di Lena, Pietro
    Casadio, Rita
    Edwards, Matthew
    Gifford, David
    Jones, David T.
    Sundaram, Laksshman
    Bhat, Rajendra Rana
    Li, Xiaolin
    Pal, Lipika R.
    Kundu, Kunal
    Yin, Yizhou
    Moult, John
    Jiang, Yuxiang
    Pejaver, Vikas
    Pagel, Kymberleigh A.
    Li, Biao
    Mooney, Sean D.
    Radivojac, Predrag
    Shah, Sohela
    Carraro, Marco
    Gasparini, Alessandra
    Leonardi, Emanuela
    Giollo, Manuel
    Ferrari, Carlo
    Tosatto, Silvio C. E.
    Bachar, Eran
    Azaria, Johnathan R.
    Ofran, Yanay
    Unger, Ron
    Niroula, Abhishek
    Vihinen, Mauno
    Chang, Billy
    Wang, Maggie H.
    Franke, Andre
    Petersen, Britt-Sabina
    Pirooznia, Mehdi
    Zandi, Peter
    McCombie, Richard
    Potash, James B.
    Altman, Russ B.
    Klein, Teri E.
    Hoskins, Roger A.
    Repo, Susanna
    Brenner, Steven E.
    [J]. HUMAN MUTATION, 2017, 38 (09) : 1182 - 1192
  • [7] Calibration drift in regression and machine learning models for acute kidney injury
    Davis, Sharon E.
    Lasko, Thomas A.
    Chen, Guanhua
    Siew, Edward D.
    Matheny, Michael E.
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2017, 24 (06) : 1052 - 1061
  • [8] CAFA and the Open World of protein function predictions
    Dessimoz, Christophe
    Skunca, Nives
    Thomas, Paul D.
    [J]. TRENDS IN GENETICS, 2013, 29 (11) : 609 - 610
  • [9] Docker, ENT CONT PLATF
  • [10] Development and validation of clinical prediction models for mortality, functional outcome and cognitive impairment after stroke: a study protocol
    Fahey, Marion
    Rudd, Anthony
    Bejot, Yannick
    Wolfe, Charles
    Douiri, Abdel
    [J]. BMJ OPEN, 2017, 7 (08):