51
|
Turki T, Wei Z, Wang JTL. A transfer learning approach via procrustes analysis and mean shift for cancer drug sensitivity prediction. J Bioinform Comput Biol 2019; 16:1840014. [PMID: 29945499 DOI: 10.1142/s0219720018400140] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Transfer learning (TL) algorithms aim to improve the prediction performance in a target task (e.g. the prediction of cisplatin sensitivity in triple-negative breast cancer patients) via transferring knowledge from auxiliary data of a related task (e.g. the prediction of docetaxel sensitivity in breast cancer patients), where the distribution and even the feature space of the data pertaining to the tasks can be different. In real-world applications, we sometimes have a limited training set in a target task while we have auxiliary data from a related task. To obtain a better prediction performance in the target task, supervised learning requires a sufficiently large training set in the target task to perform well in predicting future test examples of the target task. In this paper, we propose a TL approach for cancer drug sensitivity prediction, where our approach combines three techniques. First, we shift the representation of a subset of examples from auxiliary data of a related task to a representation closer to a target training set of a target task. Second, we align the shifted representation of the selected examples of the auxiliary data to the target training set to obtain examples with representation aligned to the target training set. Third, we train machine learning algorithms using both the target training set and the aligned examples. We evaluate the performance of our approach against baseline approaches using the Area Under the receiver operating characteristic (ROC) Curve (AUC) on real clinical trial datasets pertaining to multiple myeloma, nonsmall cell lung cancer, triple-negative breast cancer, and breast cancer. Experimental results show that our approach is better than the baseline approaches in terms of performance and statistical significance.
Collapse
Affiliation(s)
- Turki Turki
- * Department of Computer Science, King Abdulaziz University, Jeddah 21589, Saudi Arabia
| | - Zhi Wei
- † Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102, USA
| | - Jason T L Wang
- † Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102, USA
| |
Collapse
|
52
|
Analytical performance evaluation and enhancement of the ADVIA Centaur® HIV Ag/Ab Combo assay. J Clin Virol 2019; 118:36-40. [PMID: 31415958 DOI: 10.1016/j.jcv.2019.07.007] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2019] [Revised: 07/22/2019] [Accepted: 07/24/2019] [Indexed: 11/27/2022]
Abstract
BACKGROUND Fourth-generation immunoassays (such as the ADVIA Centaur® HIV Ag/Ab Combo (CHIV) assay) have improved the early diagnosis of human immunodeficiency virus (HIV), and their sensitivity and specificity usually exceed 99%. In regions with a low prevalence of HIV infection, however, the regular occurrence of false positives interferes with a medical laboratory's workflow. The additional reagent and staff costs associated with false positives can nevertheless be avoided or reduced by gaining a better knowledge of the CHIV assay's performance. OBJECTIVES/STUDY DESIGN To improve our HIV diagnosis strategy, we retrospectively analyzed all the Centaur® CHIV assays and confirmatory tests performed at Amiens University Medical Center between 2012 and 2018. We used open-source machine learning software to process this large database, develop a predictive model, and identify a new cut-off for Centaur® CHIV index interpretation. RESULTS A total of 56,682 HIV serological assay results were analyzed. The results of the CHIV assay were initially reactive or indeterminate for 449 samples. After p24 antigen and/or immunoblotting, there were 171 (38%) false positives and 278 (62%) confirmed true positives. The application of a cut-off of 2.12 led to reclassification of 130 of the 171 false positives as true negatives. Combining our predictive model with medical record analysis reduced the number of false positive CHIV assay results from 171 to 12. CONCLUSIONS The efficiency of the Centaur® CHIV assay can be increased by adjusting its cut-off for positivity. This adjustment may reduce the number of unnecessary confirmatory tests and accelerate the delivery of HIV test results.
Collapse
|
53
|
Lind AP, Anderson PC. Predicting drug activity against cancer cells by random forest models based on minimal genomic information and chemical properties. PLoS One 2019; 14:e0219774. [PMID: 31295321 PMCID: PMC6622537 DOI: 10.1371/journal.pone.0219774] [Citation(s) in RCA: 55] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2018] [Accepted: 07/01/2019] [Indexed: 12/27/2022] Open
Abstract
A key goal of precision medicine is predicting the best drug therapy for a specific patient from genomic information. In oncology, cancers that appear similar pathologically can vary greatly in how they respond to the same drug. Fortunately, data from high-throughput screening programs often reveal important relationships between genomic variability of cancer cells and their response to drugs. Nevertheless, many current computational methods to predict compound activity against cancer cells require large quantities of genomic, epigenomic, and additional cellular data to develop and to apply. Here we integrate recent screening data and machine learning to train classification models that predict the activity/inactivity of compounds against cancer cells based on the mutational status of only 145 oncogenes and a set of compound structural descriptors. Using IC50 values of 1 μM as activity cutoffs, our predictive models have sensitivities of 87%, specificities of 87%, and yield an area under the receiver operating characteristic curve equal to 0.94. We also develop regression models to predict log(IC50) values of compounds for cancer cells; the models achieve a Pearson correlation coefficient of 0.86 for cross-validation and up to 0.65-0.73 against blind test sets. Predictive performance remains strong when as few as 50 oncogenes are included. Finally, even when 40% of experimental IC50 values are missing from screening data, they can be imputed with sufficient reliability that classification accuracy is not diminished. The presented models are fast to generate and may serve as easily implemented screening tools for personalized oncology medicine, drug repurposing, and drug discovery.
Collapse
Affiliation(s)
- Alex P. Lind
- Physical Sciences Division, University of Washington Bothell, Bothell, Washington, United States of America
| | - Peter C. Anderson
- Physical Sciences Division, University of Washington Bothell, Bothell, Washington, United States of America
| |
Collapse
|
54
|
Martinez-Ruiz A, Montañola-Sales C. Big data in multi-block data analysis: An approach to parallelizing Partial Least Squares Mode B algorithm. Heliyon 2019; 5:e01451. [PMID: 31183412 PMCID: PMC6495082 DOI: 10.1016/j.heliyon.2019.e01451] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2018] [Revised: 01/21/2019] [Accepted: 03/26/2019] [Indexed: 11/23/2022] Open
Abstract
Partial Least Squares (PLS) Mode B is a multi-block method and a tightly coupled algorithm for estimating structural equation models (SEMs). Describing key aspects of parallel computing, we approach the parallelization of the PLS Mode B algorithm to operate on large distributed data. We show the scalability and performance of the algorithm at a very fine-grained level thanks to the versatility of pbdR, a R-project library for parallel computing. We vary several factors under different data distribution schemes in a supercomputing environment. Shorter elapsed times are obtained for the square-blocking factor 16×16 using a grid of processors as square as possible and non-square blocking factors 1000×4 and 10000×4 using an one-column grid of processors. Depending on the configuration, distributing data in a larger number of cores allows reaching speedups of up to 121 over the CPU implementation. Moreover, we show that SEMs can be estimated with big data sets using current state-of-the-art algorithms for multi-block data analysis.
Collapse
Affiliation(s)
- Alba Martinez-Ruiz
- Universidad Católica de la Santísima Concepción, Alonso de Ribera 2850, Concepción, Chile
| | - Cristina Montañola-Sales
- IQS-Universitat Ramon Llull (URL), Via Augusta, 390, 08017 Barcelona, Spain.,Barcelona Supercomputing Center, Centro Nacional de Supercomputación (BSC-CNS), Jordi Girona 29, 08034, Barcelona, Spain
| |
Collapse
|
55
|
Guan NN, Zhao Y, Wang CC, Li JQ, Chen X, Piao X. Anticancer Drug Response Prediction in Cell Lines Using Weighted Graph Regularized Matrix Factorization. MOLECULAR THERAPY. NUCLEIC ACIDS 2019; 17:164-174. [PMID: 31265947 PMCID: PMC6610642 DOI: 10.1016/j.omtn.2019.05.017] [Citation(s) in RCA: 53] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/25/2019] [Revised: 05/17/2019] [Accepted: 05/20/2019] [Indexed: 12/14/2022]
Abstract
Precision medicine has become a novel and rising concept, which depends much on the identification of individual genomic signatures for different patients. The cancer cell lines could reflect the “omic” diversity of primary tumors, based on which many works have been carried out to study the cancer biology and drug discovery both in experimental and computational aspects. In this work, we presented a novel method to utilize weighted graph regularized matrix factorization (WGRMF) for inferring anticancer drug response in cell lines. We constructed a p-nearest neighbor graph to sparsify drug similarity matrix and cell line similarity matrix, respectively. Using the sparsified matrices in the graph regularization terms, we performed matrix factorization to generate the latent matrices for drug and cell line. The graph regularization terms including neighbor information could help to exclude the noisy ingredient and improve the prediction accuracy. The 10-fold cross-validation was implemented, and the Pearson correlation coefficient (PCC), root-mean-square error (RMSE), PCCsr, and RMSEsr averaged over all drugs were calculated to evaluate the performance of WGRMF. The results on the Genomics of Drug Sensitivity in Cancer (GDSC) dataset are 0.64 ± 0.16, 1.37 ± 0.35, 0.73 ± 0.14, and 1.71 ± 0.44 for PCC, RMSE, PCCsr, and RMSEsr in turn. And for the Cancer Cell Line Encyclopedia (CCLE) dataset, WGRMF got results of 0.72 ± 0.09, 0.56 ± 0.19, 0.79 ± 0.07, and 0.69 ± 0.19, respectively. The results showed the superiority of WGRMF compared with previous methods. Besides, based on the prediction results using the GDSC dataset, three types of case studies were carried out. The results from both cross-validation and case studies have shown the effectiveness of WGRMF on the prediction of drug response in cell lines.
Collapse
Affiliation(s)
- Na-Na Guan
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China
| | - Yan Zhao
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
| | - Chun-Chun Wang
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
| | - Jian-Qiang Li
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China.
| | - Xing Chen
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China.
| | - Xue Piao
- School of Medical Informatics, Xuzhou Medical University, Xuzhou 221004, China.
| |
Collapse
|
56
|
Xu X, Gu H, Wang Y, Wang J, Qin P. Autoencoder Based Feature Selection Method for Classification of Anticancer Drug Response. Front Genet 2019; 10:233. [PMID: 30972101 PMCID: PMC6445890 DOI: 10.3389/fgene.2019.00233] [Citation(s) in RCA: 35] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2018] [Accepted: 03/04/2019] [Indexed: 12/14/2022] Open
Abstract
Anticancer drug responses can be varied for individual patients. This difference is mainly caused by genetic reasons, like mutations and RNA expression. Thus, these genetic features are often used to construct classification models to predict the drug response. This research focuses on the feature selection issue for the classification models. Because of the vast dimensions of the feature space for predicting drug response, the autoencoder network was first built, and a subset of inputs with the important contribution was selected. Then by using the Boruta algorithm, a further small set of features was determined for the random forest, which was used to predict drug response. Two datasets, GDSC and CCLE, were used to illustrate the efficiency of the proposed method.
Collapse
Affiliation(s)
- Xiaolu Xu
- Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian, China
| | - Hong Gu
- Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian, China
| | - Yang Wang
- Institute of Cancer Stem Cell, Dalian Medical University, Dalian, China
| | - Jia Wang
- Department of Breast Surgery, Institute of Breast Disease, Second Hospital of Dalian Medical University, Dalian, China
| | - Pan Qin
- Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian, China
| |
Collapse
|
57
|
Li Q, Shi R, Liang F. Drug sensitivity prediction with high-dimensional mixture regression. PLoS One 2019; 14:e0212108. [PMID: 30811440 PMCID: PMC6392252 DOI: 10.1371/journal.pone.0212108] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2018] [Accepted: 01/27/2019] [Indexed: 11/28/2022] Open
Abstract
This paper proposes a mixture regression model-based method for drug sensitivity prediction. The proposed method explicitly addresses two fundamental issues in drug sensitivity prediction, namely, population heterogeneity and feature selection pertaining to each of the subpopulations. The mixture regression model is estimated using the imputation-conditional consistency algorithm, and the resulting estimator is consistent. This paper also proposes an average-BIC criterion for determining the number of components for the mixture regression model. The proposed method is applied to the CCLE dataset, and the numerical results indicate that the proposed method can make a drastic improvement over the existing ones, such as random forest, support vector regression, and regularized linear regression, in both drug sensitivity prediction and feature selection. The p-values for the comparisons in drug sensitivity prediction can reach the order O(10-8) or lower for the drugs with heterogeneous populations.
Collapse
Affiliation(s)
- Qianyun Li
- Department of Biostatistics, University of Florida, Gainesville, FL 32611, United States of America
| | - Runmin Shi
- Department of Statistics, University of Florida, Gainesville, FL 32611, United States of America
| | - Faming Liang
- Department of Statistics, Purdue University, West Lafayette, IN 47906, United States of America
| |
Collapse
|
58
|
Rahman R, Dhruba SR, Ghosh S, Pal R. Functional random forest with applications in dose-response predictions. Sci Rep 2019; 9:1628. [PMID: 30733524 PMCID: PMC6367407 DOI: 10.1038/s41598-018-38231-w] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2018] [Accepted: 12/20/2018] [Indexed: 12/18/2022] Open
Abstract
Drug sensitivity prediction for individual tumors is a significant challenge in personalized medicine. Current modeling approaches consider prediction of a single metric of the drug response curve such as AUC or IC50. However, the single summary metric of a dose-response curve fails to provide the entire drug sensitivity profile which can be used to design the optimal dose for a patient. In this article, we assess the problem of predicting the complete dose-response curve based on genetic characterizations. We propose an enhancement to the popular ensemble-based Random Forests approach that can directly predict the entire functional profile of a dose-response curve rather than a single summary metric. We design functional regression trees with node costs modified based on dose/response region dependence methodologies and response distribution based approaches. Our results relative to large pharmacological databases such as CCLE and GDSC show a higher accuracy in predicting dose-response curves of the proposed functional framework in contrast to univariate or multivariate Random Forest predicting sensitivities at different dose levels. Furthermore, we also considered the problem of predicting functional responses from functional predictors i.e., estimating the dose-response curves with a model built on dose-dependent expression data. The superior performance of Functional Random Forest using functional data as compared to existing approaches have been shown using the HMS-LINCS dataset. In summary, Functional Random Forest presents an enhanced predictive modeling framework to predict the entire functional response profile considering both static and functional predictors instead of predicting the summary metrics of the response curves.
Collapse
Affiliation(s)
- Raziur Rahman
- Texas Tech University, Department of Electrical and Computer Engineering, Lubbock, Texas, 79409, USA
| | - Saugato Rahman Dhruba
- Texas Tech University, Department of Electrical and Computer Engineering, Lubbock, Texas, 79409, USA
| | - Souparno Ghosh
- Texas Tech University, Department of Mathematics and Statistics, Lubbock, Texas, 79409, USA
| | - Ranadip Pal
- Texas Tech University, Department of Electrical and Computer Engineering, Lubbock, Texas, 79409, USA.
| |
Collapse
|
59
|
Namkung J. Statistical Methods for Identifying Biomarkers from miRNA Profiles of Cancers. Methods Mol Biol 2019; 1882:261-286. [PMID: 30378062 DOI: 10.1007/978-1-4939-8879-2_24] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Biomarkers play important roles in early diagnosis and treatment plan for cancer patients and the importance is growing. With advances in high-throughput molecular profiling technology for various types of molecules such as DNA, RNA, proteins, or metabolites, it is now possible to perform massive profiling analysis that allows accelerating discovery of novel biomolecules. Because no single marker is sufficiently accurate for clinical use, the cancer biomarker is developed in the form of multiple biomarker panels. No single marker is sufficiently accurate for clinical use, and thus cancer biomarkers are developed in the form of multiple biomarker panels. Of various types of molecular biomarkers, microRNA (miRNA) has emerged as a class of promising cancer biomarker recently. MiRNAs are small noncoding RNAs that regulate gene expression. The chapter overviews the process of identification of biomarker panels from miRNA profiles focusing on statistical methods. Introduction to molecular cancer biomarkers is touched first. From sample design to miRNA profiling process is reviewed in the method section.Statistical methods for biomarker development are introduced according to three typical purposes of molecular biomarkers: tumor subtype classification, early detection, and prediction of treatment response or prognosis of patients. Example codes for R program are provided as well for selected methods.
Collapse
|
60
|
Xia F, Shukla M, Brettin T, Garcia-Cardona C, Cohn J, Allen JE, Maslov S, Holbeck SL, Doroshow JH, Evrard YA, Stahlberg EA, Stevens RL. Predicting tumor cell line response to drug pairs with deep learning. BMC Bioinformatics 2018; 19:486. [PMID: 30577754 PMCID: PMC6302446 DOI: 10.1186/s12859-018-2509-3] [Citation(s) in RCA: 61] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023] Open
Abstract
BACKGROUND The National Cancer Institute drug pair screening effort against 60 well-characterized human tumor cell lines (NCI-60) presents an unprecedented resource for modeling combinational drug activity. RESULTS We present a computational model for predicting cell line response to a subset of drug pairs in the NCI-ALMANAC database. Based on residual neural networks for encoding features as well as predicting tumor growth, our model explains 94% of the response variance. While our best result is achieved with a combination of molecular feature types (gene expression, microRNA and proteome), we show that most of the predictive power comes from drug descriptors. To further demonstrate value in detecting anticancer therapy, we rank the drug pairs for each cell line based on model predicted combination effect and recover 80% of the top pairs with enhanced activity. CONCLUSIONS We present promising results in applying deep learning to predicting combinational drug response. Our feature analysis indicates screening data involving more cell lines are needed for the models to make better use of molecular features.
Collapse
Affiliation(s)
- Fangfang Xia
- Computing, Environment and Life Sciences, Argonne National Laboratory, Lemont, IL, USA
- Computation Institute, The University of Chicago, Chicago, IL, USA
| | - Maulik Shukla
- Computing, Environment and Life Sciences, Argonne National Laboratory, Lemont, IL, USA
| | - Thomas Brettin
- Computing, Environment and Life Sciences, Argonne National Laboratory, Lemont, IL, USA
| | | | - Judith Cohn
- Computer Science, Los Alamos National Laboratory, Los Alamos, NM, USA
| | - Jonathan E. Allen
- Computation Directorate, Lawrence Livermore National Laboratory, Livermore, CA, USA
| | - Sergei Maslov
- Department of Bioengineering and Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Susan L. Holbeck
- Developmental Therapeutics Branch, National Cancer Institute, Frederick, MD, USA
| | - James H. Doroshow
- Developmental Therapeutics Branch, National Cancer Institute, Frederick, MD, USA
| | - Yvonne A. Evrard
- Developmental Therapeutics Branch, National Cancer Institute, Frederick, MD, USA
| | - Eric A. Stahlberg
- Data Science and Information Technology Program, Frederick National Laboratory for Cancer Research, Frederick, MD, USA
| | - Rick L. Stevens
- Computing, Environment and Life Sciences, Argonne National Laboratory, Lemont, IL, USA
- Computation Institute, The University of Chicago, Chicago, IL, USA
| |
Collapse
|
61
|
Fang Y, Xu P, Yang J, Qin Y. A quantile regression forest based method to predict drug response and assess prediction reliability. PLoS One 2018; 13:e0205155. [PMID: 30289891 PMCID: PMC6173405 DOI: 10.1371/journal.pone.0205155] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2017] [Accepted: 09/20/2018] [Indexed: 12/24/2022] Open
Abstract
Drug response prediction is a critical step for personalized treatment of cancer patients and ultimately leads to precision medicine. A lot of machine-learning based methods have been proposed to predict drug response from different types of genomic data. However, currently available methods could only give a "point" prediction of drug response value but fail to provide the reliability and distribution of the prediction, which are of equal interest in clinical practice. In this paper, we proposed a method based on quantile regression forest and applied it to the CCLE dataset. Through the out-of-bag validation, our method achieved much higher prediction accuracy of drug response than other available tools. The assessment of prediction reliability by prediction intervals and its significance in personalized medicine were illustrated by several examples. Functional analysis of selected drug response associated genes showed that the proposed method achieves more biologically plausible results.
Collapse
Affiliation(s)
- Yun Fang
- Department of Mathematics, Shanghai Normal University, Shanghai, China
| | - Peirong Xu
- Department of Mathematics, Shanghai Normal University, Shanghai, China
| | - Jialiang Yang
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, United States of America
| | - Yufang Qin
- College of Information Technology, Shanghai Ocean University, Shanghai, China
| |
Collapse
|
62
|
Liu H, Zhao Y, Zhang L, Chen X. Anti-cancer Drug Response Prediction Using Neighbor-Based Collaborative Filtering with Global Effect Removal. MOLECULAR THERAPY. NUCLEIC ACIDS 2018; 13:303-311. [PMID: 30321817 PMCID: PMC6197792 DOI: 10.1016/j.omtn.2018.09.011] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/22/2018] [Revised: 09/17/2018] [Accepted: 09/18/2018] [Indexed: 02/06/2023]
Abstract
Patients of the same cancer may differ in their responses to a specific medical therapy. Identification of predictive molecular features for drug sensitivity holds the key in the era of precision medicine. Human cell lines have harbored most of the same genetic changes found in patients’ tumors and thus are widely used in the research of drug response. In this work, we formulated drug-response prediction as a recommender system problem and then adopted a neighbor-based collaborative filtering with global effect removal (NCFGER) method to estimate anti-cancer drug responses of cell lines by integrating cell-line similarity networks and drug similarity networks based on the fact that similar cell lines and similar drugs exhibit similar responses. Specifically, we removed the global effect in the available responses and shrunk the similarity score for each cell line pair as well as each drug pair. We then used the K most similar neighbors (hybrid of cell-line-oriented and drug-oriented) in the available responses to predict the unknown ones. Through 10-fold cross-validation, this approach was shown to reach accurate and reproducible outcomes of drug sensitivity. We also discussed the biological outcomes based on the newly predicted response values.
Collapse
Affiliation(s)
- Hui Liu
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, China
| | - Yan Zhao
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, China
| | - Lin Zhang
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, China
| | - Xing Chen
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, China.
| |
Collapse
|
63
|
Zhang L, Chen X, Guan NN, Liu H, Li JQ. A Hybrid Interpolation Weighted Collaborative Filtering Method for Anti-cancer Drug Response Prediction. Front Pharmacol 2018; 9:1017. [PMID: 30258362 PMCID: PMC6143790 DOI: 10.3389/fphar.2018.01017] [Citation(s) in RCA: 30] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2018] [Accepted: 08/22/2018] [Indexed: 12/16/2022] Open
Abstract
Individualized therapies ask for the most effective regimen for each patient, while the patients' response may differ from each other. However, it is impossible to clinically evaluate each patient's response due to the large population. Human cell lines have harbored most of the same genetic changes found in patients' tumors, thus are widely used to help understand initial responses of drugs. Based on the more credible assumption that similar cell lines and similar drugs exhibit similar responses, we formulated drug response prediction as a recommender system problem, and then adopted a hybrid interpolation weighted collaborative filtering (HIWCF) method to predict anti-cancer drug responses of cell lines by incorporating cell line similarity and drug similarity shown from gene expression profiles, drug chemical structure as well as drug response similarity. Specifically, we estimated the baseline based on the available responses and shrunk the similarity score for each cell line pair as well as each drug pair. The similarity scores were then shrunk and weighted by the correlation coefficients drawn from the know response between each pair. Before used to find the K most similar neighbors for further prediction, they went through the case amplification strategy to emphasize high similarity and neglect low similarity. In the last step for prediction, cell line-oriented and drug-oriented collaborative filtering models were carried out, and the average of predicted values from both models was used as the final predicted sensitivity. Through 10-fold cross validation, this approach was shown to reach accurate and reproducible outcome for those missing drug sensitivities. We also found that the drug response similarity between cell lines or drugs may play important role in the prediction. Finally, we discussed the biological outcomes based on the newly predicted response values in GDSC dataset.
Collapse
Affiliation(s)
- Lin Zhang
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, China
| | - Xing Chen
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, China
| | - Na-Na Guan
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China
| | - Hui Liu
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, China
| | - Jian-Qiang Li
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China
| |
Collapse
|
64
|
Ali M, Aittokallio T. Machine learning and feature selection for drug response prediction in precision oncology applications. Biophys Rev 2018; 11:31-39. [PMID: 30097794 PMCID: PMC6381361 DOI: 10.1007/s12551-018-0446-z] [Citation(s) in RCA: 110] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2018] [Accepted: 07/22/2018] [Indexed: 02/07/2023] Open
Abstract
In-depth modeling of the complex interplay among multiple omics data measured from cancer cell lines or patient tumors is providing new opportunities toward identification of tailored therapies for individual cancer patients. Supervised machine learning algorithms are increasingly being applied to the omics profiles as they enable integrative analyses among the high-dimensional data sets, as well as personalized predictions of therapy responses using multi-omics panels of response-predictive biomarkers identified through feature selection and cross-validation. However, technical variability and frequent missingness in input "big data" require the application of dedicated data preprocessing pipelines that often lead to some loss of information and compressed view of the biological signal. We describe here the state-of-the-art machine learning methods for anti-cancer drug response modeling and prediction and give our perspective on further opportunities to make better use of high-dimensional multi-omics profiles along with knowledge about cancer pathways targeted by anti-cancer compounds when predicting their phenotypic responses.
Collapse
Affiliation(s)
- Mehreen Ali
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, FI-00290, Helsinki, Finland.,Helsinki Institute for Information Technology (HIIT), Aalto University, FI-02150, Espoo, Finland
| | - Tero Aittokallio
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, FI-00290, Helsinki, Finland. .,Helsinki Institute for Information Technology (HIIT), Aalto University, FI-02150, Espoo, Finland. .,Department of Mathematics and Statistics, University of Turku, FI-20014, Turku, Finland.
| |
Collapse
|
65
|
Chang Y, Park H, Yang HJ, Lee S, Lee KY, Kim TS, Jung J, Shin JM. Cancer Drug Response Profile scan (CDRscan): A Deep Learning Model That Predicts Drug Effectiveness from Cancer Genomic Signature. Sci Rep 2018; 8:8857. [PMID: 29891981 PMCID: PMC5996063 DOI: 10.1038/s41598-018-27214-6] [Citation(s) in RCA: 145] [Impact Index Per Article: 20.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2018] [Accepted: 05/29/2018] [Indexed: 12/18/2022] Open
Abstract
In the era of precision medicine, cancer therapy can be tailored to an individual patient based on the genomic profile of a tumour. Despite the ever-increasing abundance of cancer genomic data, linking mutation profiles to drug efficacy remains a challenge. Herein, we report Cancer Drug Response profile scan (CDRscan) a novel deep learning model that predicts anticancer drug responsiveness based on a large-scale drug screening assay data encompassing genomic profiles of 787 human cancer cell lines and structural profiles of 244 drugs. CDRscan employs a two-step convolution architecture, where the genomic mutational fingerprints of cell lines and the molecular fingerprints of drugs are processed individually, then merged by 'virtual docking', an in silico modelling of drug treatment. Analysis of the goodness-of-fit between observed and predicted drug response revealed a high prediction accuracy of CDRscan (R2 > 0.84; AUROC > 0.98). We applied CDRscan to 1,487 approved drugs and identified 14 oncology and 23 non-oncology drugs having new potential cancer indications. This, to our knowledge, is the first-time application of a deep learning model in predicting the feasibility of drug repurposing. By further clinical validation, CDRscan is expected to allow selection of the most effective anticancer drugs for the genomic profile of the individual patient.
Collapse
Affiliation(s)
- Yoosup Chang
- Yongin in silico Medical Research Centre, Syntekabio Inc., 283 Dongbaekjungang-ro, C508, Giheung-gu, Yongin, Gyeonggi-do, 17006, South Korea
| | - Hyejin Park
- Yongin in silico Medical Research Centre, Syntekabio Inc., 283 Dongbaekjungang-ro, C508, Giheung-gu, Yongin, Gyeonggi-do, 17006, South Korea
| | - Hyun-Jin Yang
- Gwanghwamun Medical Study Centre, Syntekabio Inc., 92 Saemunan-ro, #1708, Jongno-gu, Seoul, 03186, South Korea
| | - Seungju Lee
- Yongin in silico Medical Research Centre, Syntekabio Inc., 283 Dongbaekjungang-ro, C508, Giheung-gu, Yongin, Gyeonggi-do, 17006, South Korea
| | - Kwee-Yum Lee
- Gwanghwamun Medical Study Centre, Syntekabio Inc., 92 Saemunan-ro, #1708, Jongno-gu, Seoul, 03186, South Korea
- Faculty of Medicine, University of Queensland, Brisbane, QLD, 4072, Australia
| | - Tae Soon Kim
- Gwanghwamun Medical Study Centre, Syntekabio Inc., 92 Saemunan-ro, #1708, Jongno-gu, Seoul, 03186, South Korea
- Department of Clinical Medical Sciences, Seoul National University College of Medicine, 71 Ihwajang-gil, Jongno-gu, 03087, Seoul, South Korea
| | - Jongsun Jung
- Genome Data Integration Centre, Syntekabio Inc., 187 Techno 2-ro, B512, Yuseong-gu, Daejeon, 34025, South Korea.
| | - Jae-Min Shin
- Yongin in silico Medical Research Centre, Syntekabio Inc., 283 Dongbaekjungang-ro, C508, Giheung-gu, Yongin, Gyeonggi-do, 17006, South Korea.
| |
Collapse
|
66
|
Abstract
BACKGROUND A significant problem in precision medicine is the prediction of drug sensitivity for individual cancer cell lines. Predictive models such as Random Forests have shown promising performance while predicting from individual genomic features such as gene expressions. However, accessibility of various other forms of data types including information on multiple tested drugs necessitates the examination of designing predictive models incorporating the various data types. RESULTS We explore the predictive performance of model stacking and the effect of stacking on the predictive bias and squared error. In addition we discuss the analytical underpinnings supporting the advantages of stacking in reducing squared error and inherent bias of random forests in prediction of outliers. The framework is tested on a setup including gene expression, drug target, physical properties and drug response information for a set of drugs and cell lines. CONCLUSION The performance of individual and stacked models are compared. We note that stacking models built on two heterogeneous datasets provide superior performance to stacking different models built on the same dataset. It is also noted that stacking provides a noticeable reduction in the bias of our predictors when the dominant eigenvalue of the principle axis of variation in the residuals is significantly higher than the remaining eigenvalues.
Collapse
Affiliation(s)
- Kevin Matlock
- Department of Electrical and Computer Engineering, Texas Tech University, 1012 Boston Ave, Lubbock, 79409 TX USA
| | - Carlos De Niz
- Department of Electrical and Computer Engineering, Texas Tech University, 1012 Boston Ave, Lubbock, 79409 TX USA
| | - Raziur Rahman
- Department of Electrical and Computer Engineering, Texas Tech University, 1012 Boston Ave, Lubbock, 79409 TX USA
| | - Souparno Ghosh
- Department of Mathematics and Statistics, Texas Tech University, 1108 Memorial Circle, Lubbock, 79409 TX USA
| | - Ranadip Pal
- Department of Electrical and Computer Engineering, Texas Tech University, 1012 Boston Ave, Lubbock, 79409 TX USA
| |
Collapse
|
67
|
|
68
|
Dang CC, Peón A, Ballester PJ. Unearthing new genomic markers of drug response by improved measurement of discriminative power. BMC Med Genomics 2018; 11:10. [PMID: 29409485 PMCID: PMC5801688 DOI: 10.1186/s12920-018-0336-z] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2016] [Accepted: 01/29/2018] [Indexed: 12/29/2022] Open
Abstract
Background Oncology drugs are only effective in a small proportion of cancer patients. Our current ability to identify these responsive patients before treatment is still poor in most cases. Thus, there is a pressing need to discover response markers for marketed and research oncology drugs. Screening these drugs against a large panel of cancer cell lines has led to the discovery of new genomic markers of in vitro drug response. However, while the identification of such markers among thousands of candidate drug-gene associations in the data is error-prone, an appraisal of the effectiveness of such detection task is currently lacking. Methods Here we present a new non-parametric method to measuring the discriminative power of a drug-gene association. Unlike parametric statistical tests, the adopted non-parametric test has the advantage of not making strong assumptions about the data distorting the identification of genomic markers. Furthermore, we introduce a new benchmark to further validate these markers in vitro using more recent data not used to identify the markers. Results The application of this new methodology has led to the identification of 128 new genomic markers distributed across 61% of the analysed drugs, including 5 drugs without previously known markers, which were missed by the MANOVA test initially applied to analyse data from the Genomics of Drug Sensitivity in Cancer consortium. Conclusions Discovering markers using more than one statistical test and testing them on independent data is unusual. We found this helpful to discard statistically significant drug-gene associations that were actually spurious correlations. This approach also revealed new, independently validated, in vitro markers of drug response such as Temsirolimus-CDKN2A (resistance) and Gemcitabine-EWS_FLI1 (sensitivity). Electronic supplementary material The online version of this article (10.1186/s12920-018-0336-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Cuong C Dang
- Cancer Research Center of Marseille, INSERM U1068, F-13009, Marseille, France.,Institut Paoli-Calmettes, F-13009, Marseille, France.,Aix-Marseille Université, F-13284, Marseille, France.,CNRS UMR7258, F-13009, Marseille, France
| | - Antonio Peón
- Cancer Research Center of Marseille, INSERM U1068, F-13009, Marseille, France.,Institut Paoli-Calmettes, F-13009, Marseille, France.,Aix-Marseille Université, F-13284, Marseille, France.,CNRS UMR7258, F-13009, Marseille, France
| | - Pedro J Ballester
- Cancer Research Center of Marseille, INSERM U1068, F-13009, Marseille, France. .,Institut Paoli-Calmettes, F-13009, Marseille, France. .,Aix-Marseille Université, F-13284, Marseille, France. .,CNRS UMR7258, F-13009, Marseille, France.
| |
Collapse
|
69
|
Abstract
BACKGROUND Predicting the response to a drug for cancer disease patients based on genomic information is an important problem in modern clinical oncology. This problem occurs in part because many available drug sensitivity prediction algorithms do not consider better quality cancer cell lines and the adoption of new feature representations; both lead to the accurate prediction of drug responses. By predicting accurate drug responses to cancer, oncologists gain a more complete understanding of the effective treatments for each patient, which is a core goal in precision medicine. RESULTS In this paper, we model cancer drug sensitivity as a link prediction, which is shown to be an effective technique. We evaluate our proposed link prediction algorithms and compare them with an existing drug sensitivity prediction approach based on clinical trial data. The experimental results based on the clinical trial data show the stability of our link prediction algorithms, which yield the highest area under the ROC curve (AUC) and are statistically significant. CONCLUSIONS We propose a link prediction approach to obtain new feature representation. Compared with an existing approach, the results show that incorporating the new feature representation to the link prediction algorithms has significantly improved the performance.
Collapse
Affiliation(s)
- Turki Turki
- Department of Computer Science, King Abdulaziz University, P.O. Box 80221, Jeddah, 21589, Saudi Arabia. .,Bioinformatics Program and Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, 07102, USA.
| | - Zhi Wei
- Bioinformatics Program and Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, 07102, USA.
| |
Collapse
|
70
|
Naulaerts S, Dang CC, Ballester PJ. Precision and recall oncology: combining multiple gene mutations for improved identification of drug-sensitive tumours. Oncotarget 2017; 8:97025-97040. [PMID: 29228590 PMCID: PMC5722542 DOI: 10.18632/oncotarget.20923] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2017] [Accepted: 08/14/2017] [Indexed: 02/07/2023] Open
Abstract
Cancer drug therapies are only effective in a small proportion of patients. To make things worse, our ability to identify these responsive patients before administering a treatment is generally very limited. The recent arrival of large-scale pharmacogenomic data sets, which measure the sensitivity of molecularly profiled cancer cell lines to a panel of drugs, has boosted research on the discovery of drug sensitivity markers. However, no systematic comparison of widely-used single-gene markers with multi-gene machine-learning markers exploiting genomic data has been so far conducted. We therefore assessed the performance offered by these two types of models in discriminating between sensitive and resistant cell lines to a given drug. This was carried out for each of 127 considered drugs using genomic data characterising the cell lines. We found that the proportion of cell lines predicted to be sensitive that are actually sensitive (precision) varies strongly with the drug and type of model used. Furthermore, the proportion of sensitive cell lines that are correctly predicted as sensitive (recall) of the best single-gene marker was lower than that of the multi-gene marker in 118 of the 127 tested drugs. We conclude that single-gene markers are only able to identify those drug-sensitive cell lines with the considered actionable mutation, unlike multi-gene markers that can in principle combine multiple gene mutations to identify additional sensitive cell lines. We also found that cell line sensitivities to some drugs (e.g. Temsirolimus, 17-AAG or Methotrexate) are better predicted by these machine-learning models.
Collapse
Affiliation(s)
- Stefan Naulaerts
- Computational Biology and Drug Design, Cancer Research Center of Marseille, INSERM U1068, Marseille, France.,Institut Paoli-Calmettes, Marseille, France.,Aix-Marseille Université, Marseille, France.,CNRS UMR7258, Marseille, France
| | - Cuong C Dang
- Faculty of Information Technology, VNU University of Engineering and Technology, Hanoi, Vietnam
| | - Pedro J Ballester
- Computational Biology and Drug Design, Cancer Research Center of Marseille, INSERM U1068, Marseille, France.,Institut Paoli-Calmettes, Marseille, France.,Aix-Marseille Université, Marseille, France.,CNRS UMR7258, Marseille, France
| |
Collapse
|
71
|
Turki T. Learning approaches to improve prediction of drug sensitivity in breast cancer patients. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2017; 2016:3314-3320. [PMID: 28269014 DOI: 10.1109/embc.2016.7591437] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
Predicting drug response to cancer disease is an important problem in modern clinical oncology that attracted increasing recent attention from various domains such as computational biology, machine learning, and data mining. Cancer patients respond differently to each cancer therapy owing to disease diversity, genetic factors, and environmental causes. Thus, oncologists aim to identify the effective therapies for cancer patients and avoid adverse drug reactions in patients. By predicting the drug response to cancer, oncologists gain full understanding of the effective treatments on each patient, which leads to better personalized treatment. In this paper, we present three learning approaches to improve the prediction of breast cancer patients' response to chemotherapy drug: the instance selection approach, the oversampling approach, and the hybrid approach. We evaluate the performance of our approaches and compare them against the baseline approach using the Area Under the ROC Curve (AUC) on clinical trial data, in addition to testing the stability of the approaches. Our experimental results show the stability of our approaches giving the highest AUC with statistical significance.
Collapse
|
72
|
Ammad-ud-din M, Khan SA, Wennerberg K, Aittokallio T. Systematic identification of feature combinations for predicting drug response with Bayesian multi-view multi-task linear regression. Bioinformatics 2017; 33:i359-i368. [PMID: 28881998 PMCID: PMC5870540 DOI: 10.1093/bioinformatics/btx266] [Citation(s) in RCA: 46] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
MOTIVATION A prime challenge in precision cancer medicine is to identify genomic and molecular features that are predictive of drug treatment responses in cancer cells. Although there are several computational models for accurate drug response prediction, these often lack the ability to infer which feature combinations are the most predictive, particularly for high-dimensional molecular datasets. As increasing amounts of diverse genome-wide data sources are becoming available, there is a need to build new computational models that can effectively combine these data sources and identify maximally predictive feature combinations. RESULTS We present a novel approach that leverages on systematic integration of data sources to identify response predictive features of multiple drugs. To solve the modeling task we implement a Bayesian linear regression method. To further improve the usefulness of the proposed model, we exploit the known human cancer kinome for identifying biologically relevant feature combinations. In case studies with a synthetic dataset and two publicly available cancer cell line datasets, we demonstrate the improved accuracy of our method compared to the widely used approaches in drug response analysis. As key examples, our model identifies meaningful combinations of features for the well known EGFR, ALK, PLK and PDGFR inhibitors. AVAILABILITY AND IMPLEMENTATION The source code of the method is available at https://github.com/suleimank/mvlr . CONTACT muhammad.ammad-ud-din@helsinki.fi or suleiman.khan@helsinki.fi. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Muhammad Ammad-ud-din
- Institute for Molecular Medicine Finland FIMM, University of Helsinki, Helsinki, Finland
- Department of Computer Science, Helsinki Institute for Information Technology HIIT, Aalto University, Espoo, Finland
| | - Suleiman A Khan
- Institute for Molecular Medicine Finland FIMM, University of Helsinki, Helsinki, Finland
- Department of Computer Science, Helsinki Institute for Information Technology HIIT, Aalto University, Espoo, Finland
| | - Krister Wennerberg
- Institute for Molecular Medicine Finland FIMM, University of Helsinki, Helsinki, Finland
| | - Tero Aittokallio
- Institute for Molecular Medicine Finland FIMM, University of Helsinki, Helsinki, Finland
- Department of Computer Science, Helsinki Institute for Information Technology HIIT, Aalto University, Espoo, Finland
- Department of Mathematics and Statistics, University of Turku, Turku, Finland
| |
Collapse
|
73
|
Peón A, Naulaerts S, Ballester PJ. Predicting the Reliability of Drug-target Interaction Predictions with Maximum Coverage of Target Space. Sci Rep 2017; 7:3820. [PMID: 28630414 PMCID: PMC5476590 DOI: 10.1038/s41598-017-04264-w] [Citation(s) in RCA: 49] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2017] [Accepted: 05/26/2017] [Indexed: 02/05/2023] Open
Abstract
Many computational methods to predict the macromolecular targets of small organic molecules have been presented to date. Despite progress, target prediction methods still have important limitations. For example, the most accurate methods implicitly restrict their predictions to a relatively small number of targets, are not systematically validated on drugs (whose targets are harder to predict than those of non-drug molecules) and often lack a reliability score associated with each predicted target. Here we present a systematic validation of ligand-centric target prediction methods on a set of clinical drugs. These methods exploit a knowledge-base covering 887,435 known ligand-target associations between 504,755 molecules and 4,167 targets. Based on this dataset, we provide a new estimate of the polypharmacology of drugs, which on average have 11.5 targets below IC50 10 µM. The average performance achieved across clinical drugs is remarkable (0.348 precision and 0.423 recall, with large drug-dependent variability), especially given the unusually large coverage of the target space. Furthermore, we show how a sparse ligand-target bioactivity matrix to retrospectively validate target prediction methods could underestimate prospective performance. Lastly, we present and validate a first-in-kind score capable of accurately predicting the reliability of target predictions.
Collapse
Affiliation(s)
- Antonio Peón
- Centre de Recherche en Cancérologie de Marseille (CRCM), Inserm, U1068, Marseille, F-13009, France
- CNRS, UMR7258, Marseille, F-13009, France
- Institut Paoli-Calmettes, Marseille, F-13009, France
- Aix-Marseille University, UM 105, F-13284, Marseille, France
| | - Stefan Naulaerts
- Centre de Recherche en Cancérologie de Marseille (CRCM), Inserm, U1068, Marseille, F-13009, France
- CNRS, UMR7258, Marseille, F-13009, France
- Institut Paoli-Calmettes, Marseille, F-13009, France
- Aix-Marseille University, UM 105, F-13284, Marseille, France
| | - Pedro J Ballester
- Centre de Recherche en Cancérologie de Marseille (CRCM), Inserm, U1068, Marseille, F-13009, France.
- CNRS, UMR7258, Marseille, F-13009, France.
- Institut Paoli-Calmettes, Marseille, F-13009, France.
- Aix-Marseille University, UM 105, F-13284, Marseille, France.
| |
Collapse
|
74
|
Nguyen L, Dang CC, Ballester PJ. Systematic assessment of multi-gene predictors of pan-cancer cell line sensitivity to drugs exploiting gene expression data. F1000Res 2016; 5. [PMID: 28299173 PMCID: PMC5310525 DOI: 10.12688/f1000research.10529.2] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 03/10/2017] [Indexed: 12/19/2022] Open
Abstract
Background: Selected gene mutations are routinely used to guide the selection of cancer drugs for a given patient tumour. Large pharmacogenomic data sets, such as those by Genomics of Drug Sensitivity in Cancer (GDSC) consortium, were introduced to discover more of these single-gene markers of drug sensitivity. Very recently, machine learning regression has been used to investigate how well cancer cell line sensitivity to drugs is predicted depending on the type of molecular profile. The latter has revealed that gene expression data is the most predictive profile in the pan-cancer setting. However, no study to date has exploited GDSC data to systematically compare the performance of machine learning models based on multi-gene expression data against that of widely-used single-gene markers based on genomics data.
Methods: Here we present this systematic comparison using Random Forest (RF) classifiers exploiting the expression levels of 13,321 genes and an average of 501 tested cell lines per drug. To account for time-dependent batch effects in IC
50 measurements, we employ independent test sets generated with more recent GDSC data than that used to train the predictors and show that this is a more realistic validation than standard k-fold cross-validation.
Results and Discussion: Across 127 GDSC drugs, our results show that the single-gene markers unveiled by the MANOVA analysis tend to achieve higher precision than these RF-based multi-gene models, at the cost of generally having a poor recall (i.e. correctly detecting only a small part of the cell lines sensitive to the drug). Regarding overall classification performance, about two thirds of the drugs are better predicted by the multi-gene RF classifiers. Among the drugs with the most predictive of these models, we found pyrimethamine, sunitinib and 17-AAG.
Conclusions: Thanks to this unbiased validation, we now know that this type of models can predict
in vitro tumour response to some of these drugs. These models can thus be further investigated on
in vivo tumour models. R code to facilitate the construction of alternative machine learning models and their validation in the presented benchmark is available at
http://ballester.marseille.inserm.fr/gdsc.transcriptomicDatav2.tar.gz.
Collapse
Affiliation(s)
- Linh Nguyen
- Cancer Research Center of Marseille, INSERM U1068, Marseille, France; Institut Paoli-Calmettes, Marseille, France; Aix-Marseille Université, Marseille, France; Cancer Research Center of Marseille UMR7258, Marseille, France
| | - Cuong C Dang
- Cancer Research Center of Marseille, INSERM U1068, Marseille, France; Institut Paoli-Calmettes, Marseille, France; Aix-Marseille Université, Marseille, France; Cancer Research Center of Marseille UMR7258, Marseille, France
| | - Pedro J Ballester
- Cancer Research Center of Marseille, INSERM U1068, Marseille, France; Institut Paoli-Calmettes, Marseille, France; Aix-Marseille Université, Marseille, France; Cancer Research Center of Marseille UMR7258, Marseille, France
| |
Collapse
|
75
|
Nguyen L, Dang CC, Ballester PJ. Systematic assessment of multi-gene predictors of pan-cancer cell line sensitivity to drugs exploiting gene expression data. F1000Res 2016; 5. [PMID: 28299173 DOI: 10.12688/f1000research.10529.1] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 12/28/2016] [Indexed: 12/30/2022] Open
Abstract
Background: Selected gene mutations are routinely used to guide the selection of cancer drugs for a given patient tumour. Large pharmacogenomic data sets, such as those by Genomics of Drug Sensitivity in Cancer (GDSC) consortium, were introduced to discover more of these single-gene markers of drug sensitivity. Very recently, machine learning regression has been used to investigate how well cancer cell line sensitivity to drugs is predicted depending on the type of molecular profile. The latter has revealed that gene expression data is the most predictive profile in the pan-cancer setting. However, no study to date has exploited GDSC data to systematically compare the performance of machine learning models based on multi-gene expression data against that of widely-used single-gene markers based on genomics data. Methods: Here we present this systematic comparison using Random Forest (RF) classifiers exploiting the expression levels of 13,321 genes and an average of 501 tested cell lines per drug. To account for time-dependent batch effects in IC 50 measurements, we employ independent test sets generated with more recent GDSC data than that used to train the predictors and show that this is a more realistic validation than standard k-fold cross-validation. Results and Discussion: Across 127 GDSC drugs, our results show that the single-gene markers unveiled by the MANOVA analysis tend to achieve higher precision than these RF-based multi-gene models, at the cost of generally having a poor recall (i.e. correctly detecting only a small part of the cell lines sensitive to the drug). Regarding overall classification performance, about two thirds of the drugs are better predicted by the multi-gene RF classifiers. Among the drugs with the most predictive of these models, we found pyrimethamine, sunitinib and 17-AAG. Conclusions: Thanks to this unbiased validation, we now know that this type of models can predict in vitro tumour response to some of these drugs. These models can thus be further investigated on in vivo tumour models. R code to facilitate the construction of alternative machine learning models and their validation in the presented benchmark is available at http://ballester.marseille.inserm.fr/gdsc.transcriptomicDatav2.tar.gz.
Collapse
Affiliation(s)
- Linh Nguyen
- Cancer Research Center of Marseille, INSERM U1068, Marseille, France; Institut Paoli-Calmettes, Marseille, France; Aix-Marseille Université, Marseille, France; Cancer Research Center of Marseille UMR7258, Marseille, France
| | - Cuong C Dang
- Cancer Research Center of Marseille, INSERM U1068, Marseille, France; Institut Paoli-Calmettes, Marseille, France; Aix-Marseille Université, Marseille, France; Cancer Research Center of Marseille UMR7258, Marseille, France
| | - Pedro J Ballester
- Cancer Research Center of Marseille, INSERM U1068, Marseille, France; Institut Paoli-Calmettes, Marseille, France; Aix-Marseille Université, Marseille, France; Cancer Research Center of Marseille UMR7258, Marseille, France
| |
Collapse
|
76
|
|
77
|
Gurard-Levin ZA, Wilson LOW, Pancaldi V, Postel-Vinay S, Sousa FG, Reyes C, Marangoni E, Gentien D, Valencia A, Pommier Y, Cottu P, Almouzni G. Chromatin Regulators as a Guide for Cancer Treatment Choice. Mol Cancer Ther 2016; 15:1768-77. [PMID: 27196757 DOI: 10.1158/1535-7163.mct-15-1008] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2015] [Accepted: 04/26/2016] [Indexed: 12/22/2022]
Abstract
The limited capacity to predict a patient's response to distinct chemotherapeutic agents is a major hurdle in cancer management. The efficiency of a large fraction of current cancer therapeutics (radio- and chemotherapies) is influenced by chromatin structure. Reciprocally, alterations in chromatin organization may affect resistance mechanisms. Here, we explore how the misexpression of chromatin regulators-factors involved in the establishment and maintenance of functional chromatin domains-can inform about the extent of docetaxel response. We exploit Affymetrix and NanoString gene expression data for a set of chromatin regulators generated from breast cancer patient-derived xenograft models and patient samples treated with docetaxel. Random Forest classification reveals specific panels of chromatin regulators, including key components of the SWI/SNF chromatin remodeler, which readily distinguish docetaxel high-responders and poor-responders. Further exploration of SWI/SNF components in the comprehensive NCI-60 dataset reveals that the expression inversely correlates with docetaxel sensitivity. Finally, we show that loss of the SWI/SNF subunit BRG1 (SMARCA4) in a model cell line leads to enhanced docetaxel sensitivity. Altogether, our findings point toward chromatin regulators as biomarkers for drug response as well as therapeutic targets to sensitize patients toward docetaxel and combat drug resistance. Mol Cancer Ther; 15(7); 1768-77. ©2016 AACR.
Collapse
Affiliation(s)
- Zachary A Gurard-Levin
- Institut Curie, PSL Research University, CNRS, UMR3664, Equipe Labellisée Ligue contre le Cancer, Paris, France. Sorbonne Universités, UPMC Universite Paris 06, CNRS, UMR3664, Paris, France.
| | - Laurence O W Wilson
- Institut Curie, PSL Research University, CNRS, UMR3664, Equipe Labellisée Ligue contre le Cancer, Paris, France. Sorbonne Universités, UPMC Universite Paris 06, CNRS, UMR3664, Paris, France
| | - Vera Pancaldi
- Spanish National Cancer Research Centre (CNIO), c/Melchor Fernandez, Almagro, Madrid, Spain
| | - Sophie Postel-Vinay
- DITEP (Département d'Innovations Thérapeutiques et Essais Précoces), Gustave Roussy, France. Inserm Unit U981, Gustave Roussy, Villejuif, France. Université Paris Saclay, Université Paris-Sud, Faculté de Médicine, Le Kremlin Bicêtre, France
| | - Fabricio G Sousa
- Developmental Therapeutics Branch and Laboratory of Molecular Pharmacology, Center for Cancer Research, National Cancer Institute, NIH, Bethesda, Maryland
| | - Cecile Reyes
- Institut Curie, PSL Research University, Translational Research Department, Genomics Platform, Paris, France
| | - Elisabetta Marangoni
- Institut Curie, PSL Research University, Translational Research Department, Genomics Platform, Paris, France
| | - David Gentien
- Institut Curie, PSL Research University, Translational Research Department, Genomics Platform, Paris, France
| | - Alfonso Valencia
- Spanish National Cancer Research Centre (CNIO), c/Melchor Fernandez, Almagro, Madrid, Spain
| | - Yves Pommier
- Developmental Therapeutics Branch and Laboratory of Molecular Pharmacology, Center for Cancer Research, National Cancer Institute, NIH, Bethesda, Maryland
| | - Paul Cottu
- Institut Curie, Medical Oncology, Paris, France
| | - Geneviève Almouzni
- Institut Curie, PSL Research University, CNRS, UMR3664, Equipe Labellisée Ligue contre le Cancer, Paris, France. Sorbonne Universités, UPMC Universite Paris 06, CNRS, UMR3664, Paris, France.
| |
Collapse
|
78
|
Rahman R, Haider S, Ghosh S, Pal R. Design of Probabilistic Random Forests with Applications to Anticancer Drug Sensitivity Prediction. Cancer Inform 2016; 14:57-73. [PMID: 27081304 PMCID: PMC4820080 DOI: 10.4137/cin.s30794] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2015] [Revised: 02/03/2016] [Accepted: 02/07/2016] [Indexed: 01/07/2023] Open
Abstract
Random forests consisting of an ensemble of regression trees with equal weights are frequently used for design of predictive models. In this article, we consider an extension of the methodology by representing the regression trees in the form of probabilistic trees and analyzing the nature of heteroscedasticity. The probabilistic tree representation allows for analytical computation of confidence intervals (CIs), and the tree weight optimization is expected to provide stricter CIs with comparable performance in mean error. We approached the ensemble of probabilistic trees' prediction from the perspectives of a mixture distribution and as a weighted sum of correlated random variables. We applied our methodology to the drug sensitivity prediction problem on synthetic and cancer cell line encyclopedia dataset and illustrated that tree weights can be selected to reduce the average length of the CI without increase in mean error.
Collapse
Affiliation(s)
- Raziur Rahman
- Department of Electrical and Computer Engineering, Texas Tech University, Lubbock, TX, USA
| | - Saad Haider
- Department of Electrical and Computer Engineering, Texas Tech University, Lubbock, TX, USA
| | - Souparno Ghosh
- Department of Mathematics and Statistics, Texas Tech University, Lubbock, TX, USA
| | - Ranadip Pal
- Department of Electrical and Computer Engineering, Texas Tech University, Lubbock, TX, USA
| |
Collapse
|
79
|
Refining Time-Activity Classification of Human Subjects Using the Global Positioning System. PLoS One 2016; 11:e0148875. [PMID: 26919723 PMCID: PMC4769278 DOI: 10.1371/journal.pone.0148875] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2015] [Accepted: 01/24/2016] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND Detailed spatial location information is important in accurately estimating personal exposure to air pollution. Global Position System (GPS) has been widely used in tracking personal paths and activities. Previous researchers have developed time-activity classification models based on GPS data, most of them were developed for specific regions. An adaptive model for time-location classification can be widely applied to air pollution studies that use GPS to track individual level time-activity patterns. METHODS Time-activity data were collected for seven days using GPS loggers and accelerometers from thirteen adult participants from Southern California under free living conditions. We developed an automated model based on random forests to classify major time-activity patterns (i.e. indoor, outdoor-static, outdoor-walking, and in-vehicle travel). Sensitivity analysis was conducted to examine the contribution of the accelerometer data and the supplemental spatial data (i.e. roadway and tax parcel data) to the accuracy of time-activity classification. Our model was evaluated using both leave-one-fold-out and leave-one-subject-out methods. RESULTS Maximum speeds in averaging time intervals of 7 and 5 minutes, and distance to primary highways with limited access were found to be the three most important variables in the classification model. Leave-one-fold-out cross-validation showed an overall accuracy of 99.71%. Sensitivities varied from 84.62% (outdoor walking) to 99.90% (indoor). Specificities varied from 96.33% (indoor) to 99.98% (outdoor static). The exclusion of accelerometer and ambient light sensor variables caused a slight loss in sensitivity for outdoor walking, but little loss in overall accuracy. However, leave-one-subject-out cross-validation showed considerable loss in sensitivity for outdoor static and outdoor walking conditions. CONCLUSIONS The random forests classification model can achieve high accuracy for the four major time-activity categories. The model also performed well with just GPS, road and tax parcel data. However, caution is warranted when generalizing the model developed from a small number of subjects to other populations.
Collapse
|
80
|
Amadoz A, Sebastian-Leon P, Vidal E, Salavert F, Dopazo J. Using activation status of signaling pathways as mechanism-based biomarkers to predict drug sensitivity. Sci Rep 2015; 5:18494. [PMID: 26678097 PMCID: PMC4683444 DOI: 10.1038/srep18494] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2015] [Accepted: 11/19/2015] [Indexed: 12/22/2022] Open
Abstract
Many complex traits, as drug response, are associated with changes in biological pathways rather than being caused by single gene alterations. Here, a predictive framework is presented in which gene expression data are recoded into activity statuses of signal transduction circuits (sub-pathways within signaling pathways that connect receptor proteins to final effector proteins that trigger cell actions). Such activity values are used as features by a prediction algorithm which can efficiently predict a continuous variable such as the IC50 value. The main advantage of this prediction method is that the features selected by the predictor, the signaling circuits, are themselves rich-informative, mechanism-based biomarkers which provide insight into or drug molecular mechanisms of action (MoA).
Collapse
Affiliation(s)
- Alicia Amadoz
- Computational Genomics Department, Centro de Investigación Príncipe Felipe (CIPF), Valencia, Spain
| | - Patricia Sebastian-Leon
- Computational Genomics Department, Centro de Investigación Príncipe Felipe (CIPF), Valencia, Spain
| | - Enrique Vidal
- Computational Genomics Department, Centro de Investigación Príncipe Felipe (CIPF), Valencia, Spain
- Bioinformatics of Rare Diseases (BIER), CIBER de Enfermedades Raras (CIBERER), Valencia, Spain
| | - Francisco Salavert
- Computational Genomics Department, Centro de Investigación Príncipe Felipe (CIPF), Valencia, Spain
- Bioinformatics of Rare Diseases (BIER), CIBER de Enfermedades Raras (CIBERER), Valencia, Spain
| | - Joaquin Dopazo
- Computational Genomics Department, Centro de Investigación Príncipe Felipe (CIPF), Valencia, Spain
- Bioinformatics of Rare Diseases (BIER), CIBER de Enfermedades Raras (CIBERER), Valencia, Spain
- Functional Genomics Node, (INB) at CIPF, Valencia, Spain
| |
Collapse
|
81
|
Haider S, Rahman R, Ghosh S, Pal R. A Copula Based Approach for Design of Multivariate Random Forests for Drug Sensitivity Prediction. PLoS One 2015; 10:e0144490. [PMID: 26658256 PMCID: PMC4684346 DOI: 10.1371/journal.pone.0144490] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2015] [Accepted: 11/19/2015] [Indexed: 01/01/2023] Open
Abstract
Modeling sensitivity to drugs based on genetic characterizations is a significant challenge in the area of systems medicine. Ensemble based approaches such as Random Forests have been shown to perform well in both individual sensitivity prediction studies and team science based prediction challenges. However, Random Forests generate a deterministic predictive model for each drug based on the genetic characterization of the cell lines and ignores the relationship between different drug sensitivities during model generation. This application motivates the need for generation of multivariate ensemble learning techniques that can increase prediction accuracy and improve variable importance ranking by incorporating the relationships between different output responses. In this article, we propose a novel cost criterion that captures the dissimilarity in the output response structure between the training data and node samples as the difference in the two empirical copulas. We illustrate that copulas are suitable for capturing the multivariate structure of output responses independent of the marginal distributions and the copula based multivariate random forest framework can provide higher accuracy prediction and improved variable selection. The proposed framework has been validated on genomics of drug sensitivity for cancer and cancer cell line encyclopedia database.
Collapse
Affiliation(s)
- Saad Haider
- Department of Electrical and Computer Engineering, Texas Tech University, Lubbock, Texas, United States of America
| | - Raziur Rahman
- Department of Electrical and Computer Engineering, Texas Tech University, Lubbock, Texas, United States of America
| | - Souparno Ghosh
- Department of Mathematics and Statistics, Texas Tech University, Lubbock, Texas, United States of America
| | - Ranadip Pal
- Department of Electrical and Computer Engineering, Texas Tech University, Lubbock, Texas, United States of America
- * E-mail:
| |
Collapse
|
82
|
Jung S, Bi Y, Davuluri RV. Evaluation of data discretization methods to derive platform independent isoform expression signatures for multi-class tumor subtyping. BMC Genomics 2015; 16 Suppl 11:S3. [PMID: 26576613 PMCID: PMC4652565 DOI: 10.1186/1471-2164-16-s11-s3] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND Many supervised learning algorithms have been applied in deriving gene signatures for patient stratification from gene expression data. However, transferring the multi-gene signatures from one analytical platform to another without loss of classification accuracy is a major challenge. Here, we compared three unsupervised data discretization methods--Equal-width binning, Equal-frequency binning, and k-means clustering--in accurately classifying the four known subtypes of glioblastoma multiforme (GBM) when the classification algorithms were trained on the isoform-level gene expression profiles from exon-array platform and tested on the corresponding profiles from RNA-seq data. RESULTS We applied an integrated machine learning framework that involves three sequential steps; feature selection, data discretization, and classification. For models trained and tested on exon-array data, the addition of data discretization step led to robust and accurate predictive models with fewer number of variables in the final models. For models trained on exon-array data and tested on RNA-seq data, the addition of data discretization step dramatically improved the classification accuracies with Equal-frequency binning showing the highest improvement with more than 90% accuracies for all the models with features chosen by Random Forest based feature selection. Overall, SVM classifier coupled with Equal-frequency binning achieved the best accuracy (> 95%). Without data discretization, however, only 73.6% accuracy was achieved at most. CONCLUSIONS The classification algorithms, trained and tested on data from the same platform, yielded similar accuracies in predicting the four GBM subgroups. However, when dealing with cross-platform data, from exon-array to RNA-seq, the classifiers yielded stable models with highest classification accuracies on data transformed by Equal frequency binning. The approach presented here is generally applicable to other cancer types for classification and identification of molecular subgroups by integrating data across different gene expression platforms.
Collapse
Affiliation(s)
- Segun Jung
- Division of Health and Biomedical Informatics, Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL 60611, USA
| | - Yingtao Bi
- Division of Health and Biomedical Informatics, Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL 60611, USA
| | - Ramana V Davuluri
- Division of Health and Biomedical Informatics, Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL 60611, USA
| |
Collapse
|
83
|
Qureshi A, Tandon H, Kumar M. AVP-IC50 Pred: Multiple machine learning techniques-based prediction of peptide antiviral activity in terms of half maximal inhibitory concentration (IC50). Biopolymers 2015; 104:753-63. [PMID: 26213387 PMCID: PMC7161829 DOI: 10.1002/bip.22703] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2015] [Revised: 06/16/2015] [Accepted: 07/21/2015] [Indexed: 01/29/2023]
Abstract
Peptide-based antiviral therapeutics has gradually paved their way into mainstream drug discovery research. Experimental determination of peptides' antiviral activity as expressed by their IC50 values involves a lot of effort. Therefore, we have developed "AVP-IC50 Pred," a regression-based algorithm to predict the antiviral activity in terms of IC50 values (μM). A total of 759 non-redundant peptides from AVPdb and HIPdb were divided into a training/test set having 683 peptides (T(683)) and a validation set with 76 independent peptides (V(76)) for evaluation. We utilized important peptide sequence features like amino-acid compositions, binary profile of N8-C8 residues, physicochemical properties and their hybrids. Four different machine learning techniques (MLTs) namely Support vector machine, Random Forest, Instance-based classifier, and K-Star were employed. During 10-fold cross validation, we achieved maximum Pearson correlation coefficients (PCCs) of 0.66, 0.64, 0.56, 0.55, respectively, for the above MLTs using the best combination of feature sets. All the predictive models also performed well on the independent validation dataset and achieved maximum PCCs of 0.74, 0.68, 0.59, 0.57, respectively, on the best combination of feature sets. The AVP-IC50 Pred web server is anticipated to assist the researchers working on antiviral therapeutics by enabling them to computationally screen many compounds and focus experimental validation on the most promising set of peptides, thus reducing cost and time efforts. The server is available at http://crdd.osdd.net/servers/ic50avp.
Collapse
Affiliation(s)
- Abid Qureshi
- Bioinformatics Centre, Institute of Microbial TechnologyCouncil of Scientific and Industrial ResearchSector 39‐AChandigarh160036India
| | - Himani Tandon
- Bioinformatics Centre, Institute of Microbial TechnologyCouncil of Scientific and Industrial ResearchSector 39‐AChandigarh160036India
| | - Manoj Kumar
- Bioinformatics Centre, Institute of Microbial TechnologyCouncil of Scientific and Industrial ResearchSector 39‐AChandigarh160036India
| |
Collapse
|
84
|
Cichonska A, Rousu J, Aittokallio T. Identification of drug candidates and repurposing opportunities through compound-target interaction networks. Expert Opin Drug Discov 2015; 10:1333-45. [PMID: 26429153 DOI: 10.1517/17460441.2015.1096926] [Citation(s) in RCA: 45] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
INTRODUCTION System-wide identification of both on- and off-targets of chemical probes provides improved understanding of their therapeutic potential and possible adverse effects, thereby accelerating and de-risking drug discovery process. Given the high costs of experimental profiling of the complete target space of drug-like compounds, computational models offer systematic means for guiding these mapping efforts. These models suggest the most potent interactions for further experimental or pre-clinical evaluation both in cell line models and in patient-derived material. AREAS COVERED The authors focus here on network-based machine learning models and their use in the prediction of novel compound-target interactions both in target-based and phenotype-based drug discovery applications. While currently being used mainly in complementing the experimentally mapped compound-target networks for drug repurposing applications, such as extending the target space of already approved drugs, these network pharmacology approaches may also suggest completely unexpected and novel investigational probes for drug development. EXPERT OPINION Although the studies reviewed here have already demonstrated that network-centric modeling approaches have the potential to identify candidate compounds and selective targets in disease networks, many challenges still remain. In particular, these challenges include how to incorporate the cellular context and genetic background into the disease networks to enable more stratified and selective target predictions, as well as how to make the prediction models more realistic for the practical drug discovery and therapeutic applications.
Collapse
Affiliation(s)
- Anna Cichonska
- a 1 University of Helsinki, Institute for Molecular Medicine Finland FIMM , Helsinki, Finland.,b 2 Aalto University, Helsinki Institute for Information Technology HIIT, Department of Computer Science , Espoo, Finland
| | - Juho Rousu
- c 3 Aalto University, Helsinki Institute for Information Technology HIIT, Department of Computer Science , Espoo, Finland
| | - Tero Aittokallio
- d 4 University of Helsinki, Institute for Molecular Medicine Finland FIMM , Helsinki, Finland +358 5 03 18 24 26 ; .,e 5 University of Turku, Department of Mathematics and Statistics , Turku, Finland
| |
Collapse
|
85
|
MeSiC: A Model-Based Method for Estimating 5 mC Levels at Single-CpG Resolution from MeDIP-seq. Sci Rep 2015; 5:14699. [PMID: 26424089 PMCID: PMC4589794 DOI: 10.1038/srep14699] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2015] [Accepted: 09/07/2015] [Indexed: 01/03/2023] Open
Abstract
As the fifth base in mammalian genome, 5-methylcytosine (5 mC) is essential for many biological processes including normal development and disease. Methylated DNA immunoprecipitation sequencing (MeDIP-seq), which uses anti-5 mC antibodies to enrich for methylated fraction of the genome, is widely used to investigate methylome at a resolution of 100–500 bp. Considering the CpG density-dependent bias and limited resolution of MeDIP-seq, we developed a Random Forest Regression (RFR) model method, MeSiC, to estimate DNA methylation levels at single-base resolution. MeSiC integrated MeDIP-seq signals of CpG sites and their surrounding neighbors as well as genomic features to construct genomic element-dependent RFR models. In the H1 cell line, a high correlation was observed between MeSiC predictions and actual 5 mC levels. Meanwhile, MeSiC enabled to calibrate CpG density-dependent bias of MeDIP-seq signals. Importantly, we found that MeSiC models constructed in the H1 cell line could be used to accurately predict DNA methylation levels for other cell types. Comparisons with methylCRF and MEDIPS showed that MeSiC achieved comparable and even better performance. These demonstrate that MeSiC can provide accurate estimations of 5 mC levels at single-CpG resolution using MeDIP-seq data alone.
Collapse
|
86
|
Predicting Anticancer Drug Responses Using a Dual-Layer Integrated Cell Line-Drug Network Model. PLoS Comput Biol 2015; 11:e1004498. [PMID: 26418249 PMCID: PMC4587957 DOI: 10.1371/journal.pcbi.1004498] [Citation(s) in RCA: 123] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2014] [Accepted: 08/10/2015] [Indexed: 01/22/2023] Open
Abstract
The ability to predict the response of a cancer patient to a therapeutic agent is a major goal in modern oncology that should ultimately lead to personalized treatment. Existing approaches to predicting drug sensitivity rely primarily on profiling of cancer cell line panels that have been treated with different drugs and selecting genomic or functional genomic features to regress or classify the drug response. Here, we propose a dual-layer integrated cell line-drug network model, which uses both cell line similarity network (CSN) data and drug similarity network (DSN) data to predict the drug response of a given cell line using a weighted model. Using the Cancer Cell Line Encyclopedia (CCLE) and Cancer Genome Project (CGP) studies as benchmark datasets, our single-layer model with CSN or DSN and only a single parameter achieved a prediction performance comparable to the previously generated elastic net model. When using the dual-layer model integrating both CSN and DSN, our predicted response reached a 0.6 Pearson correlation coefficient with observed responses for most drugs, which is significantly better than the previous results using the elastic net model. We have also applied the dual-layer cell line-drug integrated network model to fill in the missing drug response values in the CGP dataset. Even though the dual-layer integrated cell line-drug network model does not specifically model mutation information, it correctly predicted that BRAF mutant cell lines would be more sensitive than BRAF wild-type cell lines to three MEK1/2 inhibitors tested. In this study, using the Cancer Cell Line Encyclopedia (CCLE) and Cancer Genome Project (CGP) studies as benchmark datasets, we explored the application of similarity information between cell lines and drugs in drug response prediction. We found that similar cell lines by gene expression profiles exhibit similar response to the same drug. Meanwhile, drugs with similar chemical structures also show similar inhibitory effects across different cell lines. Based on the above observations, we proposed a dual-layer network and local weighted model to predict drug response of a cell line using proximal information of the drug-cell line network. The only three parameters of our model are optimized by leave-one-out cross-validation for each drug. Two case studies of MAPK and ERK signal pathways on CCLE dataset proved that the predicted-to-observed correlations of our dual-layer network model is significantly better than the previous predictor using elastic net model. Interestingly, predictions based on drug similarity network (DSN) alone were much better than those based on cell line similarity network (CSN) alone for most drugs, implying that drug similarities are more informative for drug response prediction than cell line similarities. Our network model can be applied to predict the response of a new cell line to existing already tested drugs or to predict the response of an existing cell line to new drugs, thus potentially saving the cost in a drug-cell line screening.
Collapse
|
87
|
Cortés-Ciriano I, van Westen GJP, Bouvier G, Nilges M, Overington JP, Bender A, Malliavin TE. Improved large-scale prediction of growth inhibition patterns using the NCI60 cancer cell line panel. Bioinformatics 2015; 32:85-95. [PMID: 26351271 PMCID: PMC4681992 DOI: 10.1093/bioinformatics/btv529] [Citation(s) in RCA: 59] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2015] [Accepted: 08/26/2015] [Indexed: 01/28/2023] Open
Abstract
MOTIVATION Recent large-scale omics initiatives have catalogued the somatic alterations of cancer cell line panels along with their pharmacological response to hundreds of compounds. In this study, we have explored these data to advance computational approaches that enable more effective and targeted use of current and future anticancer therapeutics. RESULTS We modelled the 50% growth inhibition bioassay end-point (GI50) of 17,142 compounds screened against 59 cancer cell lines from the NCI60 panel (941,831 data-points, matrix 93.08% complete) by integrating the chemical and biological (cell line) information. We determine that the protein, gene transcript and miRNA abundance provide the highest predictive signal when modelling the GI50 endpoint, which significantly outperformed the DNA copy-number variation or exome sequencing data (Tukey's Honestly Significant Difference, P <0.05). We demonstrate that, within the limits of the data, our approach exhibits the ability to both interpolate and extrapolate compound bioactivities to new cell lines and tissues and, although to a lesser extent, to dissimilar compounds. Moreover, our approach outperforms previous models generated on the GDSC dataset. Finally, we determine that in the cases investigated in more detail, the predicted drug-pathway associations and growth inhibition patterns are mostly consistent with the experimental data, which also suggests the possibility of identifying genomic markers of drug sensitivity for novel compounds on novel cell lines. CONTACT terez@pasteur.fr; ab454@ac.cam.uk SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Isidro Cortés-Ciriano
- Unité de Bioinformatique Structurale, Institut Pasteur and CNRS UMR 3825, Structural Biology and Chemistry Department, 75 724 Paris, France
| | - Gerard J P van Westen
- Medicinal Chemistry, Leiden Academic Centre for Drug Research, Einsteinweg 55, 2333CC, Leiden
| | - Guillaume Bouvier
- Unité de Bioinformatique Structurale, Institut Pasteur and CNRS UMR 3825, Structural Biology and Chemistry Department, 75 724 Paris, France
| | - Michael Nilges
- Unité de Bioinformatique Structurale, Institut Pasteur and CNRS UMR 3825, Structural Biology and Chemistry Department, 75 724 Paris, France
| | - John P Overington
- European Molecular Biology Laboratory European Bioinformatics Institute, Wellcome Trust Genome Campus, CB10 1SD, Hinxton, Cambridge, UK and
| | - Andreas Bender
- Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, CB2 1EW Cambridge, UK
| | - Thérèse E Malliavin
- Unité de Bioinformatique Structurale, Institut Pasteur and CNRS UMR 3825, Structural Biology and Chemistry Department, 75 724 Paris, France
| |
Collapse
|
88
|
Dong Z, Zhang N, Li C, Wang H, Fang Y, Wang J, Zheng X. Anticancer drug sensitivity prediction in cell lines from baseline gene expression through recursive feature selection. BMC Cancer 2015; 15:489. [PMID: 26121976 PMCID: PMC4485860 DOI: 10.1186/s12885-015-1492-6] [Citation(s) in RCA: 93] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2014] [Accepted: 06/16/2015] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND An enduring challenge in personalized medicine is to select right drug for individual patients. Testing drugs on patients in large clinical trials is one way to assess their efficacy and toxicity, but it is impractical to test hundreds of drugs currently under development. Therefore the preclinical prediction model is highly expected as it enables prediction of drug response to hundreds of cell lines in parallel. METHODS Recently, two large-scale pharmacogenomic studies screened multiple anticancer drugs on over 1000 cell lines in an effort to elucidate the response mechanism of anticancer drugs. To this aim, we here used gene expression features and drug sensitivity data in Cancer Cell Line Encyclopedia (CCLE) to build a predictor based on Support Vector Machine (SVM) and a recursive feature selection tool. Robustness of our model was validated by cross-validation and an independent dataset, the Cancer Genome Project (CGP). RESULTS Our model achieved good cross validation performance for most drugs in the Cancer Cell Line Encyclopedia (≥80% accuracy for 10 drugs, ≥75% accuracy for 19 drugs). Independent tests on eleven common drugs between CCLE and CGP achieved satisfactory performance for three of them, i.e., AZD6244, Erlotinib and PD-0325901, using expression levels of only twelve, six and seven genes, respectively. CONCLUSIONS These results suggest that drug response could be effectively predicted from genomic features. Our model could be applied to predict drug response for some certain drugs and potentially play a complementary role in personalized medicine.
Collapse
Affiliation(s)
- Zuoli Dong
- Department of Mathematics, Shanghai Normal University, Shanghai, China.
| | - Naiqian Zhang
- Department of Mathematics, Shanghai Normal University, Shanghai, China.
| | - Chun Li
- Department of Mathematics, Bohai University, Jinzhou, China.
| | - Haiyun Wang
- Department of Bioinformatics, School of Life Science and Technology, Tongji University, Shanghai, China.
| | - Yun Fang
- Department of Mathematics, Shanghai Normal University, Shanghai, China.
| | - Jun Wang
- Department of Mathematics, Shanghai Normal University, Shanghai, China.
| | - Xiaoqi Zheng
- Department of Mathematics, Shanghai Normal University, Shanghai, China.
| |
Collapse
|
89
|
A network flow-based method to predict anticancer drug sensitivity. PLoS One 2015; 10:e0127380. [PMID: 25992881 PMCID: PMC4436355 DOI: 10.1371/journal.pone.0127380] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2014] [Accepted: 04/15/2015] [Indexed: 01/01/2023] Open
Abstract
Predicting anticancer drug sensitivity can enhance the ability to individualize patient treatment, thus making development of cancer therapies more effective and safe. In this paper, we present a new network flow-based method, which utilizes the topological structure of pathways, for predicting anticancer drug sensitivities. Mutations and copy number alterations of cancer-related genes are assumed to change the pathway activity, and pathway activity difference before and after drug treatment is used as a measure of drug response. In our model, Contributions from different genetic alterations are considered as free parameters, which are optimized by the drug response data from the Cancer Genome Project (CGP). 10-fold cross validation on CGP data set showed that our model achieved comparable prediction results with existing elastic net model using much less input features.
Collapse
|
90
|
DISIS: prediction of drug response through an iterative sure independence screening. PLoS One 2015; 10:e0120408. [PMID: 25794193 PMCID: PMC4368776 DOI: 10.1371/journal.pone.0120408] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2014] [Accepted: 01/21/2015] [Indexed: 02/01/2023] Open
Abstract
Prediction of drug response based on genomic alterations is an important task in the research of personalized medicine. Current elastic net model utilized a sure independence screening to select relevant genomic features with drug response, but it may neglect the combination effect of some marginally weak features. In this work, we applied an iterative sure independence screening scheme to select drug response relevant features from the Cancer Cell Line Encyclopedia (CCLE) dataset. For each drug in CCLE, we selected up to 40 features including gene expressions, mutation and copy number alterations of cancer-related genes, and some of them are significantly strong features but showing weak marginal correlation with drug response vector. Lasso regression based on the selected features showed that our prediction accuracies are higher than those by elastic net regression for most drugs.
Collapse
|
91
|
Hejase HA, Chan C. Improving Drug Sensitivity Prediction Using Different Types of Data. CPT-PHARMACOMETRICS & SYSTEMS PHARMACOLOGY 2015. [PMID: 26225231 PMCID: PMC4360670 DOI: 10.1002/psp4.2] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
The algorithms and models used to address the two subchallenges that are part of the NCI-DREAM (Dialogue for Reverse Engineering Assessments and Methods) Drug Sensitivity Prediction Challenge (2012) are presented. In subchallenge 1, a bidirectional search algorithm is introduced and optimized using an ensemble scheme and a nonlinear support vector machine (SVM) is then applied to predict the effects of the drug compounds on breast cancer cell lines. In subchallenge 2, a weighted Euclidean distance method is introduced to predict and rank the drug combinations from the most to the least effective in reducing the viability of a diffuse large B-cell lymphoma (DLBCL) cell line.
Collapse
Affiliation(s)
- H A Hejase
- Department of Computer Science and Engineering, Michigan State University East Lansing, Michigan, USA
| | - C Chan
- Department of Computer Science and Engineering, Michigan State University East Lansing, Michigan, USA ; Department of Chemical Engineering and Materials Science, Michigan State University East Lansing, Michigan, USA ; Department of Biochemistry and Molecular Biology, Michigan State University East Lansing, Michigan, USA
| |
Collapse
|
92
|
Riddick G, Song H, Holbeck SL, Kopp W, Walling J, Ahn S, Zhang W, Fine HA. An in silico screen links gene expression signatures to drug response in glioblastoma stem cells. THE PHARMACOGENOMICS JOURNAL 2014; 15:347-53. [PMID: 25446780 DOI: 10.1038/tpj.2014.61] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/12/2013] [Revised: 07/29/2014] [Accepted: 08/21/2014] [Indexed: 11/09/2022]
Abstract
Cancer stem cells (CSCs) are thought to promote resistance to chemotherapeutic drugs in glioblastoma, the most common and aggressive primary brain tumor. However, the use of high-throughput drug screens to discover novel small-molecule inhibitors for CSC has been hampered by their instability in long-term cell culture. We asked whether predictive models of drug response could be developed from gene expression signatures of established cell lines and applied to predict drug response in glioblastoma stem cells. Predictions for active compounds were confirmed both for 185 compounds in seven established glioma cell lines and 21 compounds in three glioblastoma stem cells. The use of established cell lines as a surrogate for drug response in CSC lines may enable the large-scale virtual screening of drug candidates that would otherwise be difficult or impossible to test directly in CSCs.
Collapse
Affiliation(s)
- G Riddick
- Neuro-Oncology Branch, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
| | - H Song
- Neuro-Oncology Branch, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
| | - S L Holbeck
- Developmental Therapeutics Branch, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
| | - W Kopp
- Developmental Therapeutics Branch, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
| | - J Walling
- Neuro-Oncology Branch, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
| | - S Ahn
- Neuro-Oncology Branch, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
| | - W Zhang
- Neuro-Oncology Branch, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
| | - H A Fine
- NYU Brain Tumor Center, Laura and Isaac Perlmutter Cancer Center, Bellevue Hospital Cancer Center, New York University Langone Medical Center, New York, NY, USA
| |
Collapse
|
93
|
Berlow N, Davis L, Keller C, Pal R. Inference of dynamic biological networks based on responses to drug perturbations. EURASIP JOURNAL ON BIOINFORMATICS & SYSTEMS BIOLOGY 2014; 2014:14. [PMID: 28194164 PMCID: PMC5270455 DOI: 10.1186/s13637-014-0014-1] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/31/2014] [Accepted: 07/21/2014] [Indexed: 12/23/2022]
Abstract
Drugs that target specific proteins are a major paradigm in cancer research. In this article, we extend a modeling framework for drug sensitivity prediction and combination therapy design based on drug perturbation experiments. The recently proposed target inhibition map approach can infer stationary pathway models from drug perturbation experiments, but the method is limited to a steady-state snapshot of the underlying dynamical model. We consider the inverse problem of possible dynamic models that can generate the static target inhibition map model. From a deterministic viewpoint, we analyze the inference of Boolean networks that can generate the observed binarized sensitivities under different target inhibition scenarios. From a stochastic perspective, we investigate the generation of Markov chain models that satisfy the observed target inhibition sensitivities.
Collapse
Affiliation(s)
- Noah Berlow
- Department of Electrical and Computer Engineering, Texas Tech University, Lubbock, 79409 TX USA
| | - Lara Davis
- Department of Pediatrics, Oregon Health & Science University, Portland, 97239 OR USA
| | - Charles Keller
- Department of Pediatrics, Oregon Health & Science University, Portland, 97239 OR USA
| | - Ranadip Pal
- Department of Electrical and Computer Engineering, Texas Tech University, Lubbock, 79409 TX USA
| |
Collapse
|
94
|
Wan Q, Pal R. An ensemble based top performing approach for NCI-DREAM drug sensitivity prediction challenge. PLoS One 2014; 9:e101183. [PMID: 24978814 PMCID: PMC4076307 DOI: 10.1371/journal.pone.0101183] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2013] [Accepted: 09/18/2013] [Indexed: 01/12/2023] Open
Abstract
We consider the problem of predicting sensitivity of cancer cell lines to new drugs based on supervised learning on genomic profiles. The genetic and epigenetic characterization of a cell line provides observations on various aspects of regulation including DNA copy number variations, gene expression, DNA methylation and protein abundance. To extract relevant information from the various data types, we applied a random forest based approach to generate sensitivity predictions from each type of data and combined the predictions in a linear regression model to generate the final drug sensitivity prediction. Our approach when applied to the NCI-DREAM drug sensitivity prediction challenge was a top performer among 47 teams and produced high accuracy predictions. Our results show that the incorporation of multiple genomic characterizations lowered the mean and variance of the estimated bootstrap prediction error. We also applied our approach to the Cancer Cell Line Encyclopedia database for sensitivity prediction and the ability to extract the top targets of an anti-cancer drug. The results illustrate the effectiveness of our approach in predicting drug sensitivity from heterogeneous genomic datasets.
Collapse
Affiliation(s)
- Qian Wan
- Electrical and Computer Engineering, Texas Tech University, Lubbock, Texas, United States of America
| | - Ranadip Pal
- Electrical and Computer Engineering, Texas Tech University, Lubbock, Texas, United States of America
- * E-mail:
| |
Collapse
|
95
|
Liu Y, Traskin M, Lorch SA, George EI, Small D. Ensemble of trees approaches to risk adjustment for evaluating a hospital's performance. Health Care Manag Sci 2014; 18:58-66. [PMID: 24777832 DOI: 10.1007/s10729-014-9272-4] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2013] [Accepted: 02/10/2014] [Indexed: 01/07/2023]
Abstract
A commonly used method for evaluating a hospital's performance on an outcome is to compare the hospital's observed outcome rate to the hospital's expected outcome rate given its patient (case) mix and service. The process of calculating the hospital's expected outcome rate given its patient mix and service is called risk adjustment (Iezzoni 1997). Risk adjustment is critical for accurately evaluating and comparing hospitals' performances since we would not want to unfairly penalize a hospital just because it treats sicker patients. The key to risk adjustment is accurately estimating the probability of an Outcome given patient characteristics. For cases with binary outcomes, the method that is commonly used in risk adjustment is logistic regression. In this paper, we consider ensemble of trees methods as alternatives for risk adjustment, including random forests and Bayesian additive regression trees (BART). Both random forests and BART are modern machine learning methods that have been shown recently to have excellent performance for prediction of outcomes in many settings. We apply these methods to carry out risk adjustment for the performance of neonatal intensive care units (NICU). We show that these ensemble of trees methods outperform logistic regression in predicting mortality among babies treated in NICU, and provide a superior method of risk adjustment compared to logistic regression.
Collapse
Affiliation(s)
- Yang Liu
- Department of Mathematics, University of Pennsylvania, Philadelphia, PA, USA,
| | | | | | | | | |
Collapse
|
96
|
Berlow N, Davis LE, Cantor EL, Séguin B, Keller C, Pal R. A new approach for prediction of tumor sensitivity to targeted drugs based on functional data. BMC Bioinformatics 2013; 14:239. [PMID: 23890326 PMCID: PMC3750584 DOI: 10.1186/1471-2105-14-239] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2012] [Accepted: 07/25/2013] [Indexed: 12/21/2022] Open
Abstract
Background The success of targeted anti-cancer drugs are frequently hindered by the lack of knowledge of the individual pathway of the patient and the extreme data requirements on the estimation of the personalized genetic network of the patient’s tumor. The prediction of tumor sensitivity to targeted drugs remains a major challenge in the design of optimal therapeutic strategies. The current sensitivity prediction approaches are primarily based on genetic characterizations of the tumor sample. We propose a novel sensitivity prediction approach based on functional perturbation data that incorporates the drug protein interaction information and sensitivities to a training set of drugs with known targets. Results We illustrate the high prediction accuracy of our framework on synthetic data generated from the Kyoto Encyclopedia of Genes and Genomes (KEGG) and an experimental dataset of four canine osteosarcoma tumor cultures following application of 60 targeted small-molecule drugs. We achieve a low leave one out cross validation error of <10% for the canine osteosarcoma tumor cultures using a drug screen consisting of 60 targeted drugs. Conclusions The proposed framework provides a unique input-output based methodology to model a cancer pathway and predict the effectiveness of targeted anti-cancer drugs. This framework can be developed as a viable approach for personalized cancer therapy.
Collapse
Affiliation(s)
- Noah Berlow
- Department of Electrical and Computer Engineering, Texas Tech University, Lubbock, TX, USA
| | | | | | | | | | | |
Collapse
|
97
|
Bayer I, Groth P, Schneckener S. Prediction errors in learning drug response from gene expression data - influence of labeling, sample size, and machine learning algorithm. PLoS One 2013; 8:e70294. [PMID: 23894636 PMCID: PMC3720898 DOI: 10.1371/journal.pone.0070294] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2012] [Accepted: 06/24/2013] [Indexed: 12/24/2022] Open
Abstract
Model-based prediction is dependent on many choices ranging from the sample collection and prediction endpoint to the choice of algorithm and its parameters. Here we studied the effects of such choices, exemplified by predicting sensitivity (as IC50) of cancer cell lines towards a variety of compounds. For this, we used three independent sample collections and applied several machine learning algorithms for predicting a variety of endpoints for drug response. We compared all possible models for combinations of sample collections, algorithm, drug, and labeling to an identically generated null model. The predictability of treatment effects varies among compounds, i.e. response could be predicted for some but not for all. The choice of sample collection plays a major role towards lowering the prediction error, as does sample size. However, we found that no algorithm was able to consistently outperform the other and there was no significant difference between regression and two- or three class predictors in this experimental setting. These results indicate that response-modeling projects should direct efforts mainly towards sample collection and data quality, rather than method adjustment.
Collapse
Affiliation(s)
- Immanuel Bayer
- Aachen Institute for Advanced Study in Computational Engineering Science (AICES), RWTH Aachen University; Aachen, Germany
| | - Philip Groth
- Therapeutic Research Group, Bayer Pharma AG, Berlin, Germany
| | | |
Collapse
|
98
|
Ren X, Wang Y, Zhang XS, Jin Q. iPcc: a novel feature extraction method for accurate disease class discovery and prediction. Nucleic Acids Res 2013; 41:e143. [PMID: 23761440 PMCID: PMC3737526 DOI: 10.1093/nar/gkt343] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023] Open
Abstract
Gene expression profiling has gradually become a routine procedure for disease diagnosis and classification. In the past decade, many computational methods have been proposed, resulting in great improvements on various levels, including feature selection and algorithms for classification and clustering. In this study, we present iPcc, a novel method from the feature extraction perspective to further propel gene expression profiling technologies from bench to bedside. We define ‘correlation feature space’ for samples based on the gene expression profiles by iterative employment of Pearson’s correlation coefficient. Numerical experiments on both simulated and real gene expression data sets demonstrate that iPcc can greatly highlight the latent patterns underlying noisy gene expression data and thus greatly improve the robustness and accuracy of the algorithms currently available for disease diagnosis and classification based on gene expression profiles.
Collapse
Affiliation(s)
- Xianwen Ren
- MOH Key Laboratory of Systems Biology of Pathogens, Institute of Pathogen Biology, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100730, China
| | | | | | | |
Collapse
|
99
|
Joseph S, Karnik S, Nilawe P, Jayaraman VK, Idicula-Thomas S. ClassAMP: a prediction tool for classification of antimicrobial peptides. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012; 9:1535-1538. [PMID: 22732690 DOI: 10.1109/tcbb.2012.89] [Citation(s) in RCA: 100] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
Antimicrobial peptides (AMPs) are gaining popularity as anti-infective agents. Information on sequence features that contribute to target specificity of AMPs will aid in accelerating drug discovery programs involving them. In this study, an algorithm called ClassAMP using Random Forests (RFs) and Support Vector Machines (SVMs) has been developed to predict the propensity of a protein sequence to have antibacterial, antifungal, or antiviral activity. ClassAMP is available at http://www.bicnirrh.res.in/classamp/.
Collapse
Affiliation(s)
- Shaini Joseph
- Biomedical Informatics Center of Indian Council of Medical Research, National Institute for Research in Reproductive Health, Parel, Mumbai, Maharashtra, India.
| | | | | | | | | |
Collapse
|
100
|
Zhang W, Niu Y, Xiong Y, Zhao M, Yu R, Liu J. Computational prediction of conformational B-cell epitopes from antigen primary structures by ensemble learning. PLoS One 2012; 7:e43575. [PMID: 22927994 PMCID: PMC3424238 DOI: 10.1371/journal.pone.0043575] [Citation(s) in RCA: 50] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2011] [Accepted: 07/26/2012] [Indexed: 11/18/2022] Open
Abstract
MOTIVATION The conformational B-cell epitopes are the specific sites on the antigens that have immune functions. The identification of conformational B-cell epitopes is of great importance to immunologists for facilitating the design of peptide-based vaccines. As an attempt to narrow the search for experimental validation, various computational models have been developed for the epitope prediction by using antigen structures. However, the application of these models is undermined by the limited number of available antigen structures. In contrast to the most of available structure-based methods, we here attempt to accurately predict conformational B-cell epitopes from antigen sequences. METHODS In this paper, we explore various sequence-derived features, which have been observed to be associated with the location of epitopes or ever used in the similar tasks. These features are evaluated and ranked by their discriminative performance on the benchmark datasets. From the perspective of information science, the combination of various features can usually lead to better results than the individual features. In order to build the robust model, we adopt the ensemble learning approach to incorporate various features, and develop the ensemble model to predict conformational epitopes from antigen sequences. RESULTS Evaluated by the leave-one-out cross validation, the proposed method gives out the mean AUC scores of 0.687 and 0.651 on two datasets respectively compiled from the bound structures and unbound structures. When compared with publicly available servers by using the independent dataset, our method yields better or comparable performance. The results demonstrate the proposed method is useful for the sequence-based conformational epitope prediction. AVAILABILITY The web server and datasets are freely available at http://bcell.whu.edu.cn.
Collapse
Affiliation(s)
- Wen Zhang
- School of Computer, Wuhan University, Wuhan, China.
| | | | | | | | | | | |
Collapse
|