51
|
Nguyen L, Dang CC, Ballester PJ. Systematic assessment of multi-gene predictors of pan-cancer cell line sensitivity to drugs exploiting gene expression data. F1000Res 2016; 5. [PMID: 28299173 PMCID: PMC5310525 DOI: 10.12688/f1000research.10529.2] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 03/10/2017] [Indexed: 12/19/2022] Open
Abstract
Background: Selected gene mutations are routinely used to guide the selection of cancer drugs for a given patient tumour. Large pharmacogenomic data sets, such as those by Genomics of Drug Sensitivity in Cancer (GDSC) consortium, were introduced to discover more of these single-gene markers of drug sensitivity. Very recently, machine learning regression has been used to investigate how well cancer cell line sensitivity to drugs is predicted depending on the type of molecular profile. The latter has revealed that gene expression data is the most predictive profile in the pan-cancer setting. However, no study to date has exploited GDSC data to systematically compare the performance of machine learning models based on multi-gene expression data against that of widely-used single-gene markers based on genomics data.
Methods: Here we present this systematic comparison using Random Forest (RF) classifiers exploiting the expression levels of 13,321 genes and an average of 501 tested cell lines per drug. To account for time-dependent batch effects in IC
50 measurements, we employ independent test sets generated with more recent GDSC data than that used to train the predictors and show that this is a more realistic validation than standard k-fold cross-validation.
Results and Discussion: Across 127 GDSC drugs, our results show that the single-gene markers unveiled by the MANOVA analysis tend to achieve higher precision than these RF-based multi-gene models, at the cost of generally having a poor recall (i.e. correctly detecting only a small part of the cell lines sensitive to the drug). Regarding overall classification performance, about two thirds of the drugs are better predicted by the multi-gene RF classifiers. Among the drugs with the most predictive of these models, we found pyrimethamine, sunitinib and 17-AAG.
Conclusions: Thanks to this unbiased validation, we now know that this type of models can predict
in vitro tumour response to some of these drugs. These models can thus be further investigated on
in vivo tumour models. R code to facilitate the construction of alternative machine learning models and their validation in the presented benchmark is available at
http://ballester.marseille.inserm.fr/gdsc.transcriptomicDatav2.tar.gz.
Collapse
Affiliation(s)
- Linh Nguyen
- Cancer Research Center of Marseille, INSERM U1068, Marseille, France; Institut Paoli-Calmettes, Marseille, France; Aix-Marseille Université, Marseille, France; Cancer Research Center of Marseille UMR7258, Marseille, France
| | - Cuong C Dang
- Cancer Research Center of Marseille, INSERM U1068, Marseille, France; Institut Paoli-Calmettes, Marseille, France; Aix-Marseille Université, Marseille, France; Cancer Research Center of Marseille UMR7258, Marseille, France
| | - Pedro J Ballester
- Cancer Research Center of Marseille, INSERM U1068, Marseille, France; Institut Paoli-Calmettes, Marseille, France; Aix-Marseille Université, Marseille, France; Cancer Research Center of Marseille UMR7258, Marseille, France
| |
Collapse
|
52
|
Nguyen L, Dang CC, Ballester PJ. Systematic assessment of multi-gene predictors of pan-cancer cell line sensitivity to drugs exploiting gene expression data. F1000Res 2016; 5. [PMID: 28299173 DOI: 10.12688/f1000research.10529.1] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 12/28/2016] [Indexed: 12/30/2022] Open
Abstract
Background: Selected gene mutations are routinely used to guide the selection of cancer drugs for a given patient tumour. Large pharmacogenomic data sets, such as those by Genomics of Drug Sensitivity in Cancer (GDSC) consortium, were introduced to discover more of these single-gene markers of drug sensitivity. Very recently, machine learning regression has been used to investigate how well cancer cell line sensitivity to drugs is predicted depending on the type of molecular profile. The latter has revealed that gene expression data is the most predictive profile in the pan-cancer setting. However, no study to date has exploited GDSC data to systematically compare the performance of machine learning models based on multi-gene expression data against that of widely-used single-gene markers based on genomics data. Methods: Here we present this systematic comparison using Random Forest (RF) classifiers exploiting the expression levels of 13,321 genes and an average of 501 tested cell lines per drug. To account for time-dependent batch effects in IC 50 measurements, we employ independent test sets generated with more recent GDSC data than that used to train the predictors and show that this is a more realistic validation than standard k-fold cross-validation. Results and Discussion: Across 127 GDSC drugs, our results show that the single-gene markers unveiled by the MANOVA analysis tend to achieve higher precision than these RF-based multi-gene models, at the cost of generally having a poor recall (i.e. correctly detecting only a small part of the cell lines sensitive to the drug). Regarding overall classification performance, about two thirds of the drugs are better predicted by the multi-gene RF classifiers. Among the drugs with the most predictive of these models, we found pyrimethamine, sunitinib and 17-AAG. Conclusions: Thanks to this unbiased validation, we now know that this type of models can predict in vitro tumour response to some of these drugs. These models can thus be further investigated on in vivo tumour models. R code to facilitate the construction of alternative machine learning models and their validation in the presented benchmark is available at http://ballester.marseille.inserm.fr/gdsc.transcriptomicDatav2.tar.gz.
Collapse
Affiliation(s)
- Linh Nguyen
- Cancer Research Center of Marseille, INSERM U1068, Marseille, France; Institut Paoli-Calmettes, Marseille, France; Aix-Marseille Université, Marseille, France; Cancer Research Center of Marseille UMR7258, Marseille, France
| | - Cuong C Dang
- Cancer Research Center of Marseille, INSERM U1068, Marseille, France; Institut Paoli-Calmettes, Marseille, France; Aix-Marseille Université, Marseille, France; Cancer Research Center of Marseille UMR7258, Marseille, France
| | - Pedro J Ballester
- Cancer Research Center of Marseille, INSERM U1068, Marseille, France; Institut Paoli-Calmettes, Marseille, France; Aix-Marseille Université, Marseille, France; Cancer Research Center of Marseille UMR7258, Marseille, France
| |
Collapse
|
53
|
Safikhani Z, El-Hachem N, Smirnov P, Freeman M, Goldenberg A, Birkbak NJ, Beck AH, Aerts HJWL, Quackenbush J, Haibe-Kains B. Safikhani et al. reply. Nature 2016; 540:E2-E4. [PMID: 27905430 DOI: 10.1038/nature19839] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Affiliation(s)
- Zhaleh Safikhani
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario M5G 2M9, Canada.,Department of Medical Biophysics, University of Toronto, Toronto, Ontario M5G 1L7, Canada
| | - Nehme El-Hachem
- Institut de recherches cliniques de Montréal, Montreal, Quebec H2W 1R7, Canada
| | - Petr Smirnov
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario M5G 2M9, Canada
| | - Mark Freeman
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario M5G 2M9, Canada
| | - Anna Goldenberg
- Hospital for Sick Children, Toronto, Ontario M5G 1X8, Canada.,Department of Computer Science, University of Toronto, Toronto, Ontario M5S 2E4, Canada
| | - Nicolai J Birkbak
- The Francis Crick Institute, University College London, London NW1 1AT, UK.University College London Cancer Institute, London, WC1E 6BT, UK
| | - Andrew H Beck
- Beth Israel Deaconess Medical Center, Boston, Massachusetts 02215, USA.,Harvard Medical School, Boston, Massachusetts 02115, USA
| | - Hugo J W L Aerts
- Harvard Medical School, Boston, Massachusetts 02115, USA.,Dana-Farber Cancer Institute, Boston, Massachusetts 02115, USA.,Brigham and Women's Hospital, Boston, Massachusetts 02115, USA
| | - John Quackenbush
- Dana-Farber Cancer Institute, Boston, Massachusetts 02115, USA.,Harvard School of Public Health, Boston, Massachusetts 02115, USA
| | - Benjamin Haibe-Kains
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario M5G 2M9, Canada.,Department of Medical Biophysics, University of Toronto, Toronto, Ontario M5G 1L7, Canada.,Department of Computer Science, University of Toronto, Toronto, Ontario M5S 2E4, Canada.,Ontario Institute of Cancer Research, Toronto, Ontario M5G 1L7, Canada
| |
Collapse
|
54
|
Safikhani Z, Smirnov P, Freeman M, El-Hachem N, She A, Rene Q, Goldenberg A, Birkbak NJ, Hatzis C, Shi L, Beck AH, Aerts HJ, Quackenbush J, Haibe-Kains B. Revisiting inconsistency in large pharmacogenomic studies. F1000Res 2016; 5:2333. [PMID: 28928933 PMCID: PMC5580432 DOI: 10.12688/f1000research.9611.3] [Citation(s) in RCA: 35] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 08/11/2017] [Indexed: 01/30/2023] Open
Abstract
In 2013, we published a comparative analysis of mutation and gene expression profiles and drug sensitivity measurements for 15 drugs characterized in the 471 cancer cell lines screened in the Genomics of Drug Sensitivity in Cancer (GDSC) and Cancer Cell Line Encyclopedia (CCLE). While we found good concordance in gene expression profiles, there was substantial inconsistency in the drug responses reported by the GDSC and CCLE projects. We received extensive feedback on the comparisons that we performed. This feedback, along with the release of new data, prompted us to revisit our initial analysis. We present a new analysis using these expanded data, where we address the most significant suggestions for improvements on our published analysis - that targeted therapies and broad cytotoxic drugs should have been treated differently in assessing consistency, that consistency of both molecular profiles and drug sensitivity measurements should be compared across cell lines, and that the software analysis tools provided should have been easier to run, particularly as the GDSC and CCLE released additional data. Our re-analysis supports our previous finding that gene expression data are significantly more consistent than drug sensitivity measurements. Using new statistics to assess data consistency allowed identification of two broad effect drugs and three targeted drugs with moderate to good consistency in drug sensitivity data between GDSC and CCLE. For three other targeted drugs, there were not enough sensitive cell lines to assess the consistency of the pharmacological profiles. We found evidence of inconsistencies in pharmacological phenotypes for the remaining eight drugs. Overall, our findings suggest that the drug sensitivity data in GDSC and CCLE continue to present challenges for robust biomarker discovery. This re-analysis provides additional support for the argument that experimental standardization and validation of pharmacogenomic response will be necessary to advance the broad use of large pharmacogenomic screens.
Collapse
Affiliation(s)
- Zhaleh Safikhani
- Department of Medical Biophysics, University of Toronto, Toronto, M5G 1L7, Canada
- Princess Margaret Cancer Centre, University Health Network, Toronto, M5G 1L7, Canada
| | - Petr Smirnov
- Princess Margaret Cancer Centre, University Health Network, Toronto, M5G 1L7, Canada
| | - Mark Freeman
- Princess Margaret Cancer Centre, University Health Network, Toronto, M5G 1L7, Canada
| | - Nehme El-Hachem
- Institut de Recherches Cliniques de Montréal, Montréal, H2W 1R7, Canada
| | - Adrian She
- Princess Margaret Cancer Centre, University Health Network, Toronto, M5G 1L7, Canada
| | - Quevedo Rene
- Department of Medical Biophysics, University of Toronto, Toronto, M5G 1L7, Canada
- Princess Margaret Cancer Centre, University Health Network, Toronto, M5G 1L7, Canada
| | - Anna Goldenberg
- Department of Computer Science, University of Toronto, Toronto, M5S 2E4, Canada
- Hospital for Sick Children, Toronto, M5G 1X8, Canada
| | | | - Christos Hatzis
- Yale Cancer Center, Yale University, New Haven, CT, 06510, USA
- Section of Medical Oncology, Yale University School of Medicine, New Haven, CT, 06520, USA
| | - Leming Shi
- University of Arkansas for Medical Sciences, Little Rock, AR, 72205, USA
- Fudan University, Shanghai City, 200135, China
| | - Andrew H. Beck
- Department of Pathology, Beth Israel Deaconess Medical Center and Harvard Medical School, Boston, MA, 02215, USA
| | - Hugo J.W.L. Aerts
- Department of Biostatistics and Computational Biology and Center for Cancer Computational Biology, Boston, MA, 02215, USA
- Department of Radiation Oncology and Radiology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, 02215, USA
| | - John Quackenbush
- Department of Biostatistics and Computational Biology and Center for Cancer Computational Biology, Boston, MA, 02215, USA
- Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, MA, 02215, USA
| | - Benjamin Haibe-Kains
- Department of Medical Biophysics, University of Toronto, Toronto, M5G 1L7, Canada
- Princess Margaret Cancer Centre, University Health Network, Toronto, M5G 1L7, Canada
- Department of Computer Science, University of Toronto, Toronto, M5S 2E4, Canada
- Ontario Institute of Cancer Research, Toronto, M5G 1L7, Canada
| |
Collapse
|
55
|
Safikhani Z, Smirnov P, Freeman M, El-Hachem N, She A, Rene Q, Goldenberg A, Birkbak NJ, Hatzis C, Shi L, Beck AH, Aerts HJ, Quackenbush J, Haibe-Kains B. Revisiting inconsistency in large pharmacogenomic studies. F1000Res 2016; 5:2333. [PMID: 28928933 PMCID: PMC5580432 DOI: 10.12688/f1000research.9611.2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 07/21/2017] [Indexed: 11/13/2023] Open
Abstract
In 2013, we published a comparative analysis of mutation and gene expression profiles and drug sensitivity measurements for 15 drugs characterized in the 471 cancer cell lines screened in the Genomics of Drug Sensitivity in Cancer (GDSC) and Cancer Cell Line Encyclopedia (CCLE). While we found good concordance in gene expression profiles, there was substantial inconsistency in the drug responses reported by the GDSC and CCLE projects. We received extensive feedback on the comparisons that we performed. This feedback, along with the release of new data, prompted us to revisit our initial analysis. We present a new analysis using these expanded data, where we address the most significant suggestions for improvements on our published analysis - that targeted therapies and broad cytotoxic drugs should have been treated differently in assessing consistency, that consistency of both molecular profiles and drug sensitivity measurements should be compared across cell lines, and that the software analysis tools provided should have been easier to run, particularly as the GDSC and CCLE released additional data. Our re-analysis supports our previous finding that gene expression data are significantly more consistent than drug sensitivity measurements. Using new statistics to assess data consistency allowed identification of two broad effect drugs and three targeted drugs with moderate to good consistency in drug sensitivity data between GDSC and CCLE. For three other targeted drugs, there were not enough sensitive cell lines to assess the consistency of the pharmacological profiles. We found evidence of inconsistencies in pharmacological phenotypes for the remaining eight drugs. Overall, our findings suggest that the drug sensitivity data in GDSC and CCLE continue to present challenges for robust biomarker discovery. This re-analysis provides additional support for the argument that experimental standardization and validation of pharmacogenomic response will be necessary to advance the broad use of large pharmacogenomic screens.
Collapse
Affiliation(s)
- Zhaleh Safikhani
- Department of Medical Biophysics, University of Toronto, Toronto, M5G 1L7, Canada
- Princess Margaret Cancer Centre, University Health Network, Toronto, M5G 1L7, Canada
| | - Petr Smirnov
- Princess Margaret Cancer Centre, University Health Network, Toronto, M5G 1L7, Canada
| | - Mark Freeman
- Princess Margaret Cancer Centre, University Health Network, Toronto, M5G 1L7, Canada
| | - Nehme El-Hachem
- Institut de Recherches Cliniques de Montréal, Montréal, H2W 1R7, Canada
| | - Adrian She
- Princess Margaret Cancer Centre, University Health Network, Toronto, M5G 1L7, Canada
| | - Quevedo Rene
- Department of Medical Biophysics, University of Toronto, Toronto, M5G 1L7, Canada
- Princess Margaret Cancer Centre, University Health Network, Toronto, M5G 1L7, Canada
| | - Anna Goldenberg
- Department of Computer Science, University of Toronto, Toronto, M5S 2E4, Canada
- Hospital for Sick Children, Toronto, M5G 1X8, Canada
| | | | - Christos Hatzis
- Yale Cancer Center, Yale University, New Haven, CT, 06510, USA
- Section of Medical Oncology, Yale University School of Medicine, New Haven, CT, 06520, USA
| | - Leming Shi
- University of Arkansas for Medical Sciences, Little Rock, AR, 72205, USA
- Fudan University, Shanghai City, 200135, China
| | - Andrew H. Beck
- Department of Pathology, Beth Israel Deaconess Medical Center and Harvard Medical School, Boston, MA, 02215, USA
| | - Hugo J.W.L. Aerts
- Department of Biostatistics and Computational Biology and Center for Cancer Computational Biology, Boston, MA, 02215, USA
- Department of Radiation Oncology and Radiology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, 02215, USA
| | - John Quackenbush
- Department of Biostatistics and Computational Biology and Center for Cancer Computational Biology, Boston, MA, 02215, USA
- Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, MA, 02215, USA
| | - Benjamin Haibe-Kains
- Department of Medical Biophysics, University of Toronto, Toronto, M5G 1L7, Canada
- Princess Margaret Cancer Centre, University Health Network, Toronto, M5G 1L7, Canada
- Department of Computer Science, University of Toronto, Toronto, M5S 2E4, Canada
- Ontario Institute of Cancer Research, Toronto, M5G 1L7, Canada
| |
Collapse
|
56
|
Safikhani Z, Smirnov P, Freeman M, El-Hachem N, She A, Rene Q, Goldenberg A, Birkbak NJ, Hatzis C, Shi L, Beck AH, Aerts HJ, Quackenbush J, Haibe-Kains B. Revisiting inconsistency in large pharmacogenomic studies. F1000Res 2016; 5:2333. [PMID: 28928933 PMCID: PMC5580432 DOI: 10.12688/f1000research.9611.1] [Citation(s) in RCA: 45] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 09/15/2016] [Indexed: 01/22/2023] Open
Abstract
In 2013, we published a comparative analysis mutation and gene expression profiles and drug sensitivity measurements for 15 drugs characterized in the 471 cancer cell lines screened in the Genomics of Drug Sensitivity in Cancer (GDSC) and Cancer Cell Line Encyclopedia (CCLE). While we found good concordance in gene expression profiles, there was substantial inconsistency in the drug responses reported by the GDSC and CCLE projects. We received extensive feedback on the comparisons that we performed. This feedback, along with the release of new data, prompted us to revisit our initial analysis. Here we present a new analysis using these expanded data in which we address the most significant suggestions for improvements on our published analysis - that targeted therapies and broad cytotoxic drugs should have been treated differently in assessing consistency, that consistency of both molecular profiles and drug sensitivity measurements should both be compared across cell lines, and that the software analysis tools we provided should have been easier to run, particularly as the GDSC and CCLE released additional data. Our re-analysis supports our previous finding that gene expression data are significantly more consistent than drug sensitivity measurements. The use of new statistics to assess data consistency allowed us to identify two broad effect drugs and three targeted drugs with moderate to good consistency in drug sensitivity data between GDSC and CCLE. For three other targeted drugs, there were not enough sensitive cell lines to assess the consistency of the pharmacological profiles. We found evidence of inconsistencies in pharmacological phenotypes for the remaining eight drugs. Overall, our findings suggest that the drug sensitivity data in GDSC and CCLE continue to present challenges for robust biomarker discovery. This re-analysis provides additional support for the argument that experimental standardization and validation of pharmacogenomic response will be necessary to advance the broad use of large pharmacogenomic screens.
Collapse
Affiliation(s)
- Zhaleh Safikhani
- Department of Medical Biophysics, University of Toronto, Toronto, M5G 1L7, Canada
- Princess Margaret Cancer Centre, University Health Network, Toronto, M5G 1L7, Canada
| | - Petr Smirnov
- Princess Margaret Cancer Centre, University Health Network, Toronto, M5G 1L7, Canada
| | - Mark Freeman
- Princess Margaret Cancer Centre, University Health Network, Toronto, M5G 1L7, Canada
| | - Nehme El-Hachem
- Institut de Recherches Cliniques de Montréal, Montréal, H2W 1R7, Canada
| | - Adrian She
- Princess Margaret Cancer Centre, University Health Network, Toronto, M5G 1L7, Canada
| | - Quevedo Rene
- Department of Medical Biophysics, University of Toronto, Toronto, M5G 1L7, Canada
- Princess Margaret Cancer Centre, University Health Network, Toronto, M5G 1L7, Canada
| | - Anna Goldenberg
- Department of Computer Science, University of Toronto, Toronto, M5S 2E4, Canada
- Hospital for Sick Children, Toronto, M5G 1X8, Canada
| | | | - Christos Hatzis
- Yale Cancer Center, Yale University, New Haven, CT, 06510, USA
- Section of Medical Oncology, Yale University School of Medicine, New Haven, CT, 06520, USA
| | - Leming Shi
- University of Arkansas for Medical Sciences, Little Rock, AR, 72205, USA
- Fudan University, Shanghai City, 200135, China
| | - Andrew H. Beck
- Department of Pathology, Beth Israel Deaconess Medical Center and Harvard Medical School, Boston, MA, 02215, USA
| | - Hugo J.W.L. Aerts
- Department of Biostatistics and Computational Biology and Center for Cancer Computational Biology, Boston, MA, 02215, USA
- Department of Radiation Oncology and Radiology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, 02215, USA
| | - John Quackenbush
- Department of Biostatistics and Computational Biology and Center for Cancer Computational Biology, Boston, MA, 02215, USA
- Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, MA, 02215, USA
| | - Benjamin Haibe-Kains
- Department of Medical Biophysics, University of Toronto, Toronto, M5G 1L7, Canada
- Princess Margaret Cancer Centre, University Health Network, Toronto, M5G 1L7, Canada
- Department of Computer Science, University of Toronto, Toronto, M5S 2E4, Canada
- Ontario Institute of Cancer Research, Toronto, M5G 1L7, Canada
| |
Collapse
|
57
|
Ammad-ud-din M, Khan SA, Malani D, Murumägi A, Kallioniemi O, Aittokallio T, Kaski S. Drug response prediction by inferring pathway-response associations with kernelized Bayesian matrix factorization. Bioinformatics 2016; 32:i455-i463. [DOI: 10.1093/bioinformatics/btw433] [Citation(s) in RCA: 72] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
58
|
Cortes-Ciriano I. Benchmarking the Predictive Power of Ligand Efficiency Indices in QSAR. J Chem Inf Model 2016; 56:1576-87. [DOI: 10.1021/acs.jcim.6b00136] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
Affiliation(s)
- Isidro Cortes-Ciriano
- Département de Biologie
Structurale et Chimie, Institut Pasteur, Unité de Bioinformatique Structurale, CNRS UMR 3825, 25, rue du Dr Roux, 75015 Paris, France
| |
Collapse
|
59
|
Ding Z, Zu S, Gu J. Evaluating the molecule-based prediction of clinical drug responses in cancer. Bioinformatics 2016; 32:2891-5. [DOI: 10.1093/bioinformatics/btw344] [Citation(s) in RCA: 84] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2016] [Accepted: 05/26/2016] [Indexed: 01/09/2023] Open
|
60
|
Cortes-Ciriano I, Bender A. Improved Chemical Structure-Activity Modeling Through Data Augmentation. J Chem Inf Model 2015; 55:2682-92. [PMID: 26619900 DOI: 10.1021/acs.jcim.5b00570] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Extending the original training data with simulated unobserved data points has proven powerful to increase both the generalization ability of predictive models and their robustness against changes in the structure of data (e.g., systematic drifts in the response variable) in diverse areas such as the analysis of spectroscopic data or the detection of conserved domains in protein sequences. In this contribution, we explore the effect of data augmentation in the predictive power of QSAR models, quantified by the RMSE values on the test set. We collected 8 diverse data sets from the literature and ChEMBL version 19 reporting compound activity as pIC50 values. The original training data were replicated (i.e., augmented) N times (N ∈ 0, 1, 2, 4, 6, 8, 10), and these replications were perturbed with Gaussian noise (μ = 0, σ = σnoise) on either (i) the pIC50 values, (ii) the compound descriptors, (iii) both the compound descriptors and the pIC50 values, or (iv) none of them. The effect of data augmentation was evaluated across three different algorithms (RF, GBM, and SVM radial) and two descriptor types (Morgan fingerprints and physicochemical-property-based descriptors). The influence of all factor levels was analyzed with a balanced fixed-effect full-factorial experiment. Overall, data augmentation constantly led to increased predictive power on the test set by 10-15%. Injecting noise on (i) compound descriptors or on (ii) both compound descriptors and pIC50 values led to the highest drop of RMSEtest values (from 0.67-0.72 to 0.60-0.63 pIC50 units). The maximum increase in predictive power provided by data augmentation is reached when the training data is replicated one time. Therefore, extending the original training data with one perturbed repetition thereof represents a reasonable trade-off between the increased performance of the models and the computational cost of data augmentation, namely increase of (i) model complexity due to the need for optimizing σnoise and (ii) the number of training examples.
Collapse
Affiliation(s)
- Isidro Cortes-Ciriano
- Département de Biologie Structurale et Chimie, Institut Pasteur, Unité de Bioinformatique Structurale; CNRS UMR 3825 , 25, rue du Dr Roux, 75015 Paris, France
| | - Andreas Bender
- Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge , Lensfield Road, Cambridge CB2 1EW, United Kingdom
| |
Collapse
|
61
|
Cortés-Ciriano I, Bender A. How Consistent are Publicly Reported Cytotoxicity Data? Large-Scale Statistical Analysis of the Concordance of Public Independent Cytotoxicity Measurements. ChemMedChem 2015; 11:57-71. [DOI: 10.1002/cmdc.201500424] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2015] [Indexed: 12/13/2022]
Affiliation(s)
- Isidro Cortés-Ciriano
- Institut Pasteur; Unité de Bioinformatique Structurale; CNRS UMR 3825; Département de Biologie Structurale et Chimie; 25, rue du Dr. Roux 75015 Paris France
| | - Andreas Bender
- Centre for Molecular Science Informatics; Department of Chemistry; University of Cambridge; Lensfield Road Cambridge CB2 1EW UK
| |
Collapse
|