Toward Automated Interpretation of LC-MS Data for Quality Assurance of a Screening Collection

被引:1
作者
Addison, Daniel H. [1 ]
机构
[1] AstraZeneca, Screening Sci & Sample Management, Bldg 310,Cambridge Sci Pk,Milton Rd, Cambridge CB4 0FZ, England
来源
JALA | 2016年 / 21卷 / 06期
关键词
LC-MS; data mining; random forests; WEKA; Pipeline Pilot; DECISION TREE CLASSIFICATION; MASS-SPECTROMETRY; PROTEINS;
D O I
10.1177/2211068215620765
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
The AstraZeneca Compound Management group uses high-performance liquid chromatography-mass spectrometry for structure elucidation and purity determination of the AstraZeneca compound collection. These activities are conducted in a high-throughput environment where the rate-limiting step is the review and interpretation of analytical results, which is time-consuming and experience dependent. Despite the development of a semiautomated review system, manual interpretation of results remains a bottleneck. Data-mining techniques were applied to archived data to further automate the review process. Various classification models were evaluated using WEKA and Pipeline Pilot (Pipeline Pilot version 8.5.0.200, BIOVIA, San Diego, CA). Results were assessed using criteria including precision, recall, and receiver operating characteristic area. Each model was evaluated as a cost-insensitive classifier and again using MetaCost to apply cost sensitivity. Pruning and variable importance were also investigated. A 10-tree random forest generated with Pipeline Pilot reduced the number of analyses requiring manual review to 36.4% using a threshold of 90% confidence in predictions. This represents a 45% reduction in manual reviews compared with the previous system, delivering an annual savings of $45,000 or an increase in capacity from 25,000 analyses per month up to 45,000 with the same resource levels.
引用
收藏
页码:743 / 755
页数:13
相关论文
共 31 条
[11]   Analyses of compound libraries obtained by high-throughput parallel synthesis:: strategy of quality control by high-performance liquid chromatography, mass spectrometry and nuclear magnetic resonance techniques [J].
Duléry, BD ;
Verne-Mismer, J ;
Wolf, E ;
Kugel, C ;
Van Hijfte, L .
JOURNAL OF CHROMATOGRAPHY B-ANALYTICAL TECHNOLOGIES IN THE BIOMEDICAL AND LIFE SCIENCES, 1999, 725 (01) :39-47
[12]  
Famili F, 2010, LECT NOTES ARTIF INT, V6098, P102, DOI 10.1007/978-3-642-13033-5_11
[13]  
Freund Y., 1996, P 13 INT C MACH LEAR, V96, P148, DOI DOI 10.5555/3091696.3091715
[14]  
Gini C., 1912, Variabilita e mutabilita: contributo allo studio delle distribuzioni e delle relazioni statistiche. Fasc. I.. Studi economico-giuridici pubblicati per cura della facolta di Giurisprudenza della R. Universita di Cagliari
[15]  
Hall M., 2009, SIGKDD EXPLORATIONS, V11, P10, DOI [DOI 10.1145/1656274.1656278, 10.1145/1656274.1656278]
[16]   ON INFORMATION AND SUFFICIENCY [J].
KULLBACK, S ;
LEIBLER, RA .
ANNALS OF MATHEMATICAL STATISTICS, 1951, 22 (01) :79-86
[17]   High-throughput high-performance liquid chromatography/mass spectrometry for modern drug discovery [J].
Kyranos, JN ;
Cai, H ;
Wei, D ;
Goetzinger, WK .
CURRENT OPINION IN BIOTECHNOLOGY, 2001, 12 (01) :105-111
[18]   Decision tree classification of proteins identified by mass spectrometry of blood serum samples from people with and without lung cancer [J].
Markey, MK ;
Tourassi, GD ;
Floyd, CE .
PROTEOMICS, 2003, 3 (09) :1678-1679
[19]  
Mayumi Oshiro Thais, 2012, Machine Learning and Data Mining in Pattern Recognition. Proceedings 8th International Conference, MLDM 2012, P154, DOI 10.1007/978-3-642-31537-4_13
[20]  
Olshen L. B. J. F. R, 1984, CLASSIFICATION REGRE, P101