51
|
Smoller JW, Andreassen OA, Edenberg HJ, Faraone SV, Glatt SJ, Kendler KS. Psychiatric genetics and the structure of psychopathology. Mol Psychiatry 2019; 24:409-420. [PMID: 29317742 PMCID: PMC6684352 DOI: 10.1038/s41380-017-0010-4] [Citation(s) in RCA: 215] [Impact Index Per Article: 43.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/20/2017] [Revised: 10/23/2017] [Accepted: 11/01/2017] [Indexed: 12/20/2022]
Abstract
For over a century, psychiatric disorders have been defined by expert opinion and clinical observation. The modern DSM has relied on a consensus of experts to define categorical syndromes based on clusters of symptoms and signs, and, to some extent, external validators, such as longitudinal course and response to treatment. In the absence of an established etiology, psychiatry has struggled to validate these descriptive syndromes, and to define the boundaries between disorders and between normal and pathologic variation. Recent advances in genomic research, coupled with large-scale collaborative efforts like the Psychiatric Genomics Consortium, have identified hundreds of common and rare genetic variations that contribute to a range of neuropsychiatric disorders. At the same time, they have begun to address deeper questions about the structure and classification of mental disorders: To what extent do genetic findings support or challenge our clinical nosology? Are there genetic boundaries between psychiatric and neurologic illness? Do the data support a boundary between disorder and normal variation? Is it possible to envision a nosology based on genetically informed disease mechanisms? This review provides an overview of conceptual issues and genetic findings that bear on the relationships among and boundaries between psychiatric disorders and other conditions. We highlight implications for the evolving classification of psychopathology and the challenges for clinical translation.
Collapse
Affiliation(s)
- Jordan W Smoller
- Psychiatric and Neurodevelopmental Genetics Unit, Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA.
- Department of Psychiatry, Massachusetts General Hospital, Boston, MA, USA.
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
| | - Ole A Andreassen
- NORMENT-KG Jebsen Centre, University of Oslo, Oslo, Norway
- Division of Mental Health and Addiction, Oslo University Hospital, Oslo, Norway
| | - Howard J Edenberg
- Department of Biochemistry and Molecular Biology, Indiana University School of Medicine, Indianapolis, IN, USA
| | - Stephen V Faraone
- Departments of Psychiatry and of Neuroscience and Physiology, SUNY Upstate Medical University, Syracuse, NY, USA
| | - Stephen J Glatt
- Departments of Psychiatry and of Neuroscience and Physiology, SUNY Upstate Medical University, Syracuse, NY, USA
| | - Kenneth S Kendler
- Virginia Institute for Psychiatric and Behavioral Genetics, Virginia Commonwealth University, Richmond, VA, USA
- Department of Psychiatry, Virginia Commonwealth University, Richmond, VA, USA
| |
Collapse
|
52
|
Efficient Implementation of Penalized Regression for Genetic Risk Prediction. Genetics 2019; 212:65-74. [PMID: 30808621 PMCID: PMC6499521 DOI: 10.1534/genetics.119.302019] [Citation(s) in RCA: 38] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2018] [Accepted: 02/22/2019] [Indexed: 12/14/2022] Open
Abstract
Polygenic risk scores (PRS) combine many single-nucleotide polymorphisms into a score reflecting the genetic risk of developing a disease. Privé, Aschard, and Blum present an efficient implementation of penalized logistic regression... Polygenic Risk Scores (PRS) combine genotype information across many single-nucleotide polymorphisms (SNPs) to give a score reflecting the genetic risk of developing a disease. PRS might have a major impact on public health, possibly allowing for screening campaigns to identify high-genetic risk individuals for a given disease. The “Clumping+Thresholding” (C+T) approach is the most common method to derive PRS. C+T uses only univariate genome-wide association studies (GWAS) summary statistics, which makes it fast and easy to use. However, previous work showed that jointly estimating SNP effects for computing PRS has the potential to significantly improve the predictive performance of PRS as compared to C+T. In this paper, we present an efficient method for the joint estimation of SNP effects using individual-level data, allowing for practical application of penalized logistic regression (PLR) on modern datasets including hundreds of thousands of individuals. Moreover, our implementation of PLR directly includes automatic choices for hyper-parameters. We also provide an implementation of penalized linear regression for quantitative traits. We compare the performance of PLR, C+T and a derivation of random forests using both real and simulated data. Overall, we find that PLR achieves equal or higher predictive performance than C+T in most scenarios considered, while being scalable to biobank data. In particular, we find that improvement in predictive performance is more pronounced when there are few effects located in nearby genomic regions with correlated SNPs; for instance, in simulations, AUC values increase from 83% with the best prediction of C+T to 92.5% with PLR. We confirm these results in a data analysis of a case-control study for celiac disease where PLR and the standard C+T method achieve AUC values of 89% and of 82.5%. Applying penalized linear regression to 350,000 individuals of the UK Biobank, we predict height with a larger correlation than with the best prediction of C+T (∼65% instead of ∼55%), further demonstrating its scalability and strong predictive power, even for highly polygenic traits. Moreover, using 150,000 individuals of the UK Biobank, we are able to predict breast cancer better than C+T, fitting PLR in a few minutes only. In conclusion, this paper demonstrates the feasibility and relevance of using penalized regression for PRS computation when large individual-level datasets are available, thanks to the efficient implementation available in our R package bigstatsr.
Collapse
|
53
|
|
54
|
Waldmann P. Approximate Bayesian neural networks in genomic prediction. Genet Sel Evol 2018; 50:70. [PMID: 30577737 PMCID: PMC6303864 DOI: 10.1186/s12711-018-0439-1] [Citation(s) in RCA: 30] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2018] [Accepted: 12/16/2018] [Indexed: 12/23/2022] Open
Abstract
BACKGROUND Genome-wide marker data are used both in phenotypic genome-wide association studies (GWAS) and genome-wide prediction (GWP). Typically, such studies include high-dimensional data with thousands to millions of single nucleotide polymorphisms (SNPs) recorded in hundreds to a few thousands individuals. Different machine-learning approaches have been used in GWAS and GWP effectively, but the use of neural networks (NN) and deep-learning is still scarce. This study presents a NN model for genomic SNP data. RESULTS We show, using both simulated and real pig data, that regularization is obtained using weight decay and dropout, and results in an approximate Bayesian (ABNN) model that can be used to obtain model averaged posterior predictions. The ABNN model is implemented in mxnet and shown to yield better prediction accuracy than genomic best linear unbiased prediction and Bayesian LASSO. The mean squared error was reduced by at least 6.5% in the simulated data and by at least 1% in the real data. Moreover, by comparing NN of different complexities, our results confirm that a shallow model with one layer, one neuron, one-hot encoding and a linear activation function performs better than more complex models. CONCLUSIONS The ABNN model provides a computationally efficient approach with good prediction performance and in which the weight components can also provide information on the importance of the SNPs. Hence, ABNN is suitable for both GWP and GWAS.
Collapse
Affiliation(s)
- Patrik Waldmann
- Department of Animal Breeding and Genetics, Swedish University of Agricultural Sciences (SLU), Box 7023, 750 07, Uppsala, Sweden.
| |
Collapse
|
55
|
Improving pharmacogenetic prediction of extrapyramidal symptoms induced by antipsychotics. Transl Psychiatry 2018; 8:276. [PMID: 30546092 PMCID: PMC6293322 DOI: 10.1038/s41398-018-0330-4] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/22/2018] [Revised: 10/15/2018] [Accepted: 11/13/2018] [Indexed: 11/30/2022] Open
Abstract
In previous work we developed a pharmacogenetic predictor of antipsychotic (AP) induced extrapyramidal symptoms (EPS) based on four genes involved in mTOR regulation. The main objective is to improve this predictor by increasing its biological plausibility and replication. We re-sequence the four genes using next-generation sequencing. We predict functionality "in silico" of all identified SNPs and test it using gene reporter assays. Using functional SNPs, we develop a new predictor utilizing machine learning algorithms (Discovery Cohort, N = 131) and replicate it in two independent cohorts (Replication Cohort 1, N = 113; Replication Cohort 2, N = 113). After prioritization, four SNPs were used to develop the pharmacogenetic predictor of AP-induced EPS. The model constructed using the Naive Bayes algorithm achieved a 66% of accuracy in the Discovery Cohort, and similar performances in the replication cohorts. The result is an improved pharmacogenetic predictor of AP-induced EPS, which is more robust and generalizable than the original.
Collapse
|
56
|
Dorani F, Hu T, Woods MO, Zhai G. Ensemble learning for detecting gene-gene interactions in colorectal cancer. PeerJ 2018; 6:e5854. [PMID: 30397551 PMCID: PMC6211269 DOI: 10.7717/peerj.5854] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2018] [Accepted: 09/28/2018] [Indexed: 11/20/2022] Open
Abstract
Colorectal cancer (CRC) has a high incident rate in both men and women and is affecting millions of people every year. Genome-wide association studies (GWAS) on CRC have successfully revealed common single-nucleotide polymorphisms (SNPs) associated with CRC risk. However, they can only explain a very limited fraction of the disease heritability. One reason may be the common uni-variable analyses in GWAS where genetic variants are examined one at a time. Given the complexity of cancers, the non-additive interaction effects among multiple genetic variants have a potential of explaining the missing heritability. In this study, we employed two powerful ensemble learning algorithms, random forests and gradient boosting machine (GBM), to search for SNPs that contribute to the disease risk through non-additive gene-gene interactions. We were able to find 44 possible susceptibility SNPs that were ranked most significant by both algorithms. Out of those 44 SNPs, 29 are in coding regions. The 29 genes include ARRDC5, DCC, ALK, and ITGA1, which have been found previously associated with CRC, and E2F3 and NID2, which are potentially related to CRC since they have known associations with other types of cancer. We performed pairwise and three-way interaction analysis on the 44 SNPs using information theoretical techniques and found 17 pairwise (p < 0.02) and 16 three-way (p ≤ 0.001) interactions among them. Moreover, functional enrichment analysis suggested 16 functional terms or biological pathways that may help us better understand the etiology of the disease.
Collapse
Affiliation(s)
- Faramarz Dorani
- Department of Computer Science, Memorial University, St. John's, Newfoundland and Labrador, Canada
| | - Ting Hu
- Department of Computer Science, Memorial University, St. John's, Newfoundland and Labrador, Canada
| | - Michael O Woods
- Faculty of Medicine, Memorial University, St. John's, Newfoundland and Labrador, Canada
| | - Guangju Zhai
- Faculty of Medicine, Memorial University, St. John's, Newfoundland and Labrador, Canada
| |
Collapse
|
57
|
Genomic prediction of relapse in recipients of allogeneic haematopoietic stem cell transplantation. Leukemia 2018; 33:240-248. [PMID: 30089915 PMCID: PMC6326954 DOI: 10.1038/s41375-018-0229-3] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2017] [Revised: 06/21/2018] [Accepted: 07/17/2018] [Indexed: 02/06/2023]
Abstract
Allogeneic haematopoietic stem cell transplantation currently represents the primary potentially curative treatment for cancers of the blood and bone marrow. While relapse occurs in approximately 30% of patients, few risk-modifying genetic variants have been identified. The present study evaluates the predictive potential of patient genetics on relapse risk in a genome-wide manner. We studied 151 graft recipients with HLA-matched sibling donors by sequencing the whole-exome, active immunoregulatory regions, and the full MHC region. To assess the predictive capability and contributions of SNPs and INDELs, we employed machine learning and a feature selection approach in a cross-validation framework to discover the most informative variants while controlling against overfitting. Our results show that germline genetic polymorphisms in patients entail a significant contribution to relapse risk, as judged by the predictive performance of the model (AUC = 0.72 [95% CI: 0.63-0.81]). Furthermore, the top contributing variants were predictive in two independent replication cohorts (n = 258 and n = 125) from the same population. The results can help elucidate relapse mechanisms and suggest novel therapeutic targets. A computational genomic model could provide a step toward individualized prognostic risk assessment, particularly when accompanied by other data modalities.
Collapse
|
58
|
Zheutlin AB, Chekroud AM, Polimanti R, Gelernter J, Sabb FW, Bilder RM, Freimer N, London ED, Hultman CM, Cannon TD. Multivariate Pattern Analysis of Genotype-Phenotype Relationships in Schizophrenia. Schizophr Bull 2018; 44. [PMID: 29534239 PMCID: PMC6101611 DOI: 10.1093/schbul/sby005] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Abstract
Genetic risk variants for schizophrenia have been linked to many related clinical and biological phenotypes with the hopes of delineating how individual variation across thousands of variants corresponds to the clinical and etiologic heterogeneity within schizophrenia. This has primarily been done using risk score profiling, which aggregates effects across all variants into a single predictor. While effective, this method lacks flexibility in certain domains: risk scores cannot capture nonlinear effects and do not employ any variable selection. We used random forest, an algorithm with this flexibility designed to maximize predictive power, to predict 6 cognitive endophenotypes in a combined sample of psychiatric patients and controls (N = 739) using 77 genetic variants strongly associated with schizophrenia. Tenfold cross-validation was applied to the discovery sample and models were externally validated in an independent sample of similar ancestry (N = 336). Linear approaches, including linear regression and task-specific polygenic risk scores, were employed for comparison. Random forest models for processing speed (P = .019) and visual memory (P = .036) and risk scores developed for verbal (P = .042) and working memory (P = .037) successfully generalized to an independent sample with similar predictive strength and error. As such, we suggest that both methods may be useful for mapping a limited set of predetermined, disease-associated SNPs to related phenotypes. Incorporating random forest and other more flexible algorithms into genotype-phenotype mapping inquiries could contribute to parsing heterogeneity within schizophrenia; such algorithms can perform as well as standard methods and can capture a more comprehensive set of potential relationships.
Collapse
Affiliation(s)
| | - Adam M Chekroud
- Department of Psychology, Yale University, New Haven, CT,Spring Health, New York, NY,Department of Psychiatry, Yale University School of Medicine, New Haven, CT
| | - Renato Polimanti
- Department of Psychiatry, Yale University School of Medicine, New Haven, CT
| | - Joel Gelernter
- Department of Psychiatry, Yale University School of Medicine, New Haven, CT
| | - Fred W Sabb
- Lewis Center for Neuroimaging, University of Oregon, Eugene, OR
| | - Robert M Bilder
- Department of Psychology, University of California - Los Angeles, Los Angeles, CA
| | - Nelson Freimer
- Department of Psychiatry and Biobehavioral Sciences, University of California - Los Angeles, Los Angeles, CA
| | - Edythe D London
- Department of Psychiatry and Biobehavioral Sciences, University of California - Los Angeles, Los Angeles, CA
| | - Christina M Hultman
- Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| | - Tyrone D Cannon
- Department of Psychology, Yale University, New Haven, CT,Department of Psychiatry, Yale University School of Medicine, New Haven, CT,To whom correspondence should be addressed; Department of Psychology, Yale University, PO Box 208205, New Haven, CT 06520; tel: 203-436-1545, e-mail:
| |
Collapse
|
59
|
Li B, Zhang N, Wang YG, George AW, Reverter A, Li Y. Genomic Prediction of Breeding Values Using a Subset of SNPs Identified by Three Machine Learning Methods. Front Genet 2018; 9:237. [PMID: 30023001 PMCID: PMC6039760 DOI: 10.3389/fgene.2018.00237] [Citation(s) in RCA: 79] [Impact Index Per Article: 13.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2018] [Accepted: 06/14/2018] [Indexed: 12/22/2022] Open
Abstract
The analysis of large genomic data is hampered by issues such as a small number of observations and a large number of predictive variables (commonly known as “large P small N”), high dimensionality or highly correlated data structures. Machine learning methods are renowned for dealing with these problems. To date machine learning methods have been applied in Genome-Wide Association Studies for identification of candidate genes, epistasis detection, gene network pathway analyses and genomic prediction of phenotypic values. However, the utility of two machine learning methods, Gradient Boosting Machine (GBM) and Extreme Gradient Boosting Method (XgBoost), in identifying a subset of SNP makers for genomic prediction of breeding values has never been explored before. In this study, using 38,082 SNP markers and body weight phenotypes from 2,093 Brahman cattle (1,097 bulls as a discovery population and 996 cows as a validation population), we examined the efficiency of three machine learning methods, namely Random Forests (RF), GBM and XgBoost, in (a) the identification of top 400, 1,000, and 3,000 ranked SNPs; (b) using the subsets of SNPs to construct genomic relationship matrices (GRMs) for the estimation of genomic breeding values (GEBVs). For comparison purposes, we also calculated the GEBVs from (1) 400, 1,000, and 3,000 SNPs that were randomly selected and evenly spaced across the genome, and (2) from all the SNPs. We found that RF and especially GBM are efficient methods in identifying a subset of SNPs with direct links to candidate genes affecting the growth trait. In comparison to the estimate of prediction accuracy of GEBVs from using all SNPs (0.43), the 3,000 top SNPs identified by RF (0.42) and GBM (0.46) had similar values to those of the whole SNP panel. The performance of the subsets of SNPs from RF and GBM was substantially better than that of evenly spaced subsets across the genome (0.18–0.29). Of the three methods, RF and GBM consistently outperformed the XgBoost in genomic prediction accuracy.
Collapse
Affiliation(s)
- Bo Li
- CSIRO Agriculture and Food, St Lucia, QLD, Australia.,Shandong Technology and Business University, School of Computer Science and Technology, YanTai, China.,Shandong Co-Innovation Centre of Future Intelligent Computing, YanTai, China
| | - Nanxi Zhang
- Centre for Applications in Natural Resource Mathematics, University of Queensland, St Lucia, QLD, Australia
| | - You-Gan Wang
- School of Mathematical Sciences, Queensland University of Technology, Brisbane, QLD, Australia
| | | | | | - Yutao Li
- CSIRO Agriculture and Food, St Lucia, QLD, Australia
| |
Collapse
|
60
|
Choi J, Kim K, Sun H. New variable selection strategy for analysis of high-dimensional DNA methylation data. J Bioinform Comput Biol 2018; 16:1850010. [PMID: 29954287 DOI: 10.1142/s0219720018500105] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
In genetic association studies, regularization methods are often used due to their computational efficiency for analysis of high-dimensional genomic data. DNA methylation data generated from Infinium HumanMethylation450 BeadChip Kit have a group structure where an individual gene consists of multiple Cytosine-phosphate-Guanine (CpG) sites. Consequently, group-based regularization can precisely detect outcome-related CpG sites. Representative examples are sparse group lasso (SGL) and network-based regularization. The former is powerful when most of the CpG sites within the same gene are associated with a phenotype outcome. In contrast, the latter is preferred when only a few of the CpG sites within the same gene are related to the outcome. In this paper, we propose new variable selection strategy based on a selection probability that measures selection frequency of individual variables selected by both SGL and network-based regularization. In extensive simulation study, we demonstrated that the proposed strategy can show relatively outstanding selection performance under any situation, compared with both SGL and network-based regularization. Also, we applied the proposed strategy to identify differentially methylated CpG sites and their corresponding genes from ovarian cancer data.
Collapse
Affiliation(s)
- Jiyun Choi
- 1 Department of Statistics, Pusan National University, Busan 46241, Korea
| | - Kipoong Kim
- 1 Department of Statistics, Pusan National University, Busan 46241, Korea
| | - Hokeun Sun
- 1 Department of Statistics, Pusan National University, Busan 46241, Korea
| |
Collapse
|
61
|
Kang J, Rancati T, Lee S, Oh JH, Kerns SL, Scott JG, Schwartz R, Kim S, Rosenstein BS. Machine Learning and Radiogenomics: Lessons Learned and Future Directions. Front Oncol 2018; 8:228. [PMID: 29977864 PMCID: PMC6021505 DOI: 10.3389/fonc.2018.00228] [Citation(s) in RCA: 42] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2018] [Accepted: 06/04/2018] [Indexed: 12/25/2022] Open
Abstract
Due to the rapid increase in the availability of patient data, there is significant interest in precision medicine that could facilitate the development of a personalized treatment plan for each patient on an individual basis. Radiation oncology is particularly suited for predictive machine learning (ML) models due to the enormous amount of diagnostic data used as input and therapeutic data generated as output. An emerging field in precision radiation oncology that can take advantage of ML approaches is radiogenomics, which is the study of the impact of genomic variations on the sensitivity of normal and tumor tissue to radiation. Currently, patients undergoing radiotherapy are treated using uniform dose constraints specific to the tumor and surrounding normal tissues. This is suboptimal in many ways. First, the dose that can be delivered to the target volume may be insufficient for control but is constrained by the surrounding normal tissue, as dose escalation can lead to significant morbidity and rare. Second, two patients with nearly identical dose distributions can have substantially different acute and late toxicities, resulting in lengthy treatment breaks and suboptimal control, or chronic morbidities leading to poor quality of life. Despite significant advances in radiogenomics, the magnitude of the genetic contribution to radiation response far exceeds our current understanding of individual risk variants. In the field of genomics, ML methods are being used to extract harder-to-detect knowledge, but these methods have yet to fully penetrate radiogenomics. Hence, the goal of this publication is to provide an overview of ML as it applies to radiogenomics. We begin with a brief history of radiogenomics and its relationship to precision medicine. We then introduce ML and compare it to statistical hypothesis testing to reflect on shared lessons and to avoid common pitfalls. Current ML approaches to genome-wide association studies are examined. The application of ML specifically to radiogenomics is next presented. We end with important lessons for the proper integration of ML into radiogenomics.
Collapse
Affiliation(s)
- John Kang
- Department of Radiation Oncology, University of Rochester Medical Center, Rochester, NY, United States
| | - Tiziana Rancati
- Prostate Cancer Program, Fondazione IRCCS Istituto Nazionale dei Tumori, Milan, Italy
| | - Sangkyu Lee
- Department of Medical Physics, Memorial Sloan Kettering Cancer Center, New York, NY, United States
| | - Jung Hun Oh
- Department of Medical Physics, Memorial Sloan Kettering Cancer Center, New York, NY, United States
| | - Sarah L. Kerns
- Department of Radiation Oncology, University of Rochester Medical Center, Rochester, NY, United States
| | - Jacob G. Scott
- Department of Translational Hematology and Oncology Research, Cleveland Clinic, Cleveland, OH, United States
- Department of Radiation Oncology, Cleveland Clinic, Cleveland, OH, United States
| | - Russell Schwartz
- Computational Biology Department, Carnegie Mellon School of Computer Science, Pittsburgh, PA, United States
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA, United States
| | - Seyoung Kim
- Computational Biology Department, Carnegie Mellon School of Computer Science, Pittsburgh, PA, United States
| | - Barry S. Rosenstein
- Department of Radiation Oncology, Icahn School of Medicine at Mount Sinai, New York, NY, United States
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| |
Collapse
|
62
|
Maciukiewicz M, Marshe VS, Hauschild AC, Foster JA, Rotzinger S, Kennedy JL, Kennedy SH, Müller DJ, Geraci J. GWAS-based machine learning approach to predict duloxetine response in major depressive disorder. J Psychiatr Res 2018; 99:62-68. [PMID: 29407288 DOI: 10.1016/j.jpsychires.2017.12.009] [Citation(s) in RCA: 47] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/19/2017] [Revised: 10/31/2017] [Accepted: 12/14/2017] [Indexed: 12/22/2022]
Abstract
Major depressive disorder (MDD) is one of the most prevalent psychiatric disorders and is commonly treated with antidepressant drugs. However, large variability is observed in terms of response to antidepressants. Machine learning (ML) models may be useful to predict treatment outcomes. A sample of 186 MDD patients received treatment with duloxetine for up to 8 weeks were categorized as "responders" based on a MADRS change >50% from baseline; or "remitters" based on a MADRS score ≤10 at end point. The initial dataset (N = 186) was randomly divided into training and test sets in a nested 5-fold cross-validation, where 80% was used as a training set and 20% made up five independent test sets. We performed genome-wide logistic regression to identify potentially significant variants related to duloxetine response/remission and extracted the most promising predictors using LASSO regression. Subsequently, classification-regression trees (CRT) and support vector machines (SVM) were applied to construct models, using ten-fold cross-validation. With regards to response, none of the pairs performed significantly better than chance (accuracy p > .1). For remission, SVM achieved moderate performance with an accuracy = 0.52, a sensitivity = 0.58, and a specificity = 0.46, and 0.51 for all coefficients for CRT. The best performing SVM fold was characterized by an accuracy = 0.66 (p = .071), sensitivity = 0.70 and a sensitivity = 0.61. In this study, the potential of using GWAS data to predict duloxetine outcomes was examined using ML models. The models were characterized by a promising sensitivity, but specificity remained moderate at best. The inclusion of additional non-genetic variables to create integrated models may improve prediction.
Collapse
Affiliation(s)
- Malgorzata Maciukiewicz
- Pharmacogenetic Research Clinic, Center for Addiction and Mental Health, Toronto, ON, Canada
| | - Victoria S Marshe
- Pharmacogenetic Research Clinic, Center for Addiction and Mental Health, Toronto, ON, Canada; Institute of Medical Science, Faculty of Medicine, University of Toronto, Toronto, ON, Canada
| | - Anne-Christin Hauschild
- IBM Life Sciences Discovery Centre, Princess Margaret Cancer Centre, Toronto, ON, Canada; Department of Computer Science, University of Toronto, Toronto, ON, Canada; University Health Network, Toronto, ON, Canada
| | - Jane A Foster
- University Health Network, Toronto, ON, Canada; Department of Psychiatry and Behavioral Neurosciences, McMaster University, Hamilton, ON, Canada
| | - Susan Rotzinger
- University Health Network, Toronto, ON, Canada; Department of Psychiatry, University of Toronto, Toronto, ON, Canada
| | - James L Kennedy
- Pharmacogenetic Research Clinic, Center for Addiction and Mental Health, Toronto, ON, Canada; Institute of Medical Science, Faculty of Medicine, University of Toronto, Toronto, ON, Canada; Department of Psychiatry, University of Toronto, Toronto, ON, Canada
| | - Sidney H Kennedy
- Institute of Medical Science, Faculty of Medicine, University of Toronto, Toronto, ON, Canada; University Health Network, Toronto, ON, Canada; Department of Psychiatry, University of Toronto, Toronto, ON, Canada; Department of Psychiatry, St. Michael's Hospital, Toronto, ON, Canada
| | - Daniel J Müller
- Pharmacogenetic Research Clinic, Center for Addiction and Mental Health, Toronto, ON, Canada; Institute of Medical Science, Faculty of Medicine, University of Toronto, Toronto, ON, Canada; Department of Psychiatry, University of Toronto, Toronto, ON, Canada.
| | - Joseph Geraci
- Department of Molecular Medicine, Queen's University, Kingston, ON, Canada
| |
Collapse
|
63
|
Sinoquet C. A method combining a random forest-based technique with the modeling of linkage disequilibrium through latent variables, to run multilocus genome-wide association studies. BMC Bioinformatics 2018; 19:106. [PMID: 29587628 PMCID: PMC5870262 DOI: 10.1186/s12859-018-2054-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2016] [Accepted: 02/09/2018] [Indexed: 01/26/2023] Open
Abstract
BACKGROUND Genome-wide association studies (GWASs) have been widely used to discover the genetic basis of complex phenotypes. However, standard single-SNP GWASs suffer from lack of power. In particular, they do not directly account for linkage disequilibrium, that is the dependences between SNPs (Single Nucleotide Polymorphisms). RESULTS We present the comparative study of two multilocus GWAS strategies, in the random forest-based framework. The first method, T-Trees, was designed by Botta and collaborators (Botta et al., PLoS ONE 9(4):e93379, 2014). We designed the other method, which is an innovative hybrid method combining T-Trees with the modeling of linkage disequilibrium. Linkage disequilibrium is modeled through a collection of tree-shaped Bayesian networks with latent variables, following our former works (Mourad et al., BMC Bioinformatics 12(1):16, 2011). We compared the two methods, both on simulated and real data. For dominant and additive genetic models, in either of the conditions simulated, the hybrid approach always slightly performs better than T-Trees. We assessed predictive powers through the standard ROC technique on 14 real datasets. For 10 of the 14 datasets analyzed, the already high predicted power observed for T-Trees (0.910-0.946) can still be increased by up to 0.030. We also assessed whether the distributions of SNPs' scores obtained from T-Trees and the hybrid approach differed. Finally, we thoroughly analyzed the intersections of top 100 SNPs output by any two or the three methods amongst T-Trees, the hybrid approach, and the single-SNP method. CONCLUSIONS The sophistication of T-Trees through finer linkage disequilibrium modeling is shown beneficial. The distributions of SNPs' scores generated by T-Trees and the hybrid approach are shown statistically different, which suggests complementary of the methods. In particular, for 12 of the 14 real datasets, the distribution tail of highest SNPs' scores shows larger values for the hybrid approach. Thus are pinpointed more interesting SNPs than by T-Trees, to be provided as a short list of prioritized SNPs, for a further analysis by biologists. Finally, among the 211 top 100 SNPs jointly detected by the single-SNP method, T-Trees and the hybrid approach over the 14 datasets, we identified 72 and 38 SNPs respectively present in the top25s and top10s for each method.
Collapse
Affiliation(s)
- Christine Sinoquet
- LS2N, UMR CNRS 6004, Université de Nantes, 2 rue de la Houssinière, BP 92208, Nantes Cedex, 44322, France.
| |
Collapse
|
64
|
López B, Torrent-Fontbona F, Viñas R, Fernández-Real JM. Single Nucleotide Polymorphism relevance learning with Random Forests for Type 2 diabetes risk prediction. Artif Intell Med 2017; 85:43-49. [PMID: 28943335 DOI: 10.1016/j.artmed.2017.09.005] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2017] [Revised: 09/04/2017] [Indexed: 10/18/2022]
Abstract
OBJECTIVE The use of artificial intelligence techniques to find out which Single Nucleotide Polymorphisms (SNPs) promote the development of a disease is one of the features of medical research, as such techniques may potentially aid early diagnosis and help in the prescription of preventive measures. In particular, the aim is to help physicians to identify the relevant SNPs related to Type 2 diabetes, and to build a decision-support tool for risk prediction. METHODS We use the Random Forest (RF) technique in order to search for the most important attributes (SNPs) related to diabetes, giving a weight (degree of importance), ranging between 0 and 1, to each attribute. Support Vector Machines and Logistic Regression have also been used since they are two other machine learning techniques that are well-established in the health community. Their performance has been compared to that achieved by RF. Furthermore, the relevance of the attributes obtained through the use of RF has then been used to perform predictions with k-Nearest Neighbour method weighting attributes in the similarity measure according to the relevance of the attributes with RF. RESULTS Testing is performed on a set of 677 subjects. RF is able to handle the complexity of features' interactions, overfitting, and unknown attribute values, providing the SNPs' relevance with an up to 0.89 area under the ROC curve in terms of risk prediction. RF outperforms all the other tested machine learning techniques in terms of prediction accuracy, and in terms of the stability of the estimated relevance of the attributes. CONCLUSIONS The Random Forest is a useful method for learning predictive models and the relevance of SNPs without any underlying assumption.
Collapse
Affiliation(s)
- Beatriz López
- University of Girona, Campus Montilivi, building EPS4, 17071 Girona, Spain.
| | | | - Ramón Viñas
- University of Girona, Campus Montilivi, building EPS4, 17071 Girona, Spain.
| | - José Manuel Fernández-Real
- Biomedical Research Institute of Girona, Avda. de França, s/n, 17007 Girona, Spain; CIBERobn Pathophysiology of Obesity and Nutrition, Instituto de Salud Carlos III, Madrid, Spain.
| |
Collapse
|
65
|
Naulaerts S, Dang CC, Ballester PJ. Precision and recall oncology: combining multiple gene mutations for improved identification of drug-sensitive tumours. Oncotarget 2017; 8:97025-97040. [PMID: 29228590 PMCID: PMC5722542 DOI: 10.18632/oncotarget.20923] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2017] [Accepted: 08/14/2017] [Indexed: 02/07/2023] Open
Abstract
Cancer drug therapies are only effective in a small proportion of patients. To make things worse, our ability to identify these responsive patients before administering a treatment is generally very limited. The recent arrival of large-scale pharmacogenomic data sets, which measure the sensitivity of molecularly profiled cancer cell lines to a panel of drugs, has boosted research on the discovery of drug sensitivity markers. However, no systematic comparison of widely-used single-gene markers with multi-gene machine-learning markers exploiting genomic data has been so far conducted. We therefore assessed the performance offered by these two types of models in discriminating between sensitive and resistant cell lines to a given drug. This was carried out for each of 127 considered drugs using genomic data characterising the cell lines. We found that the proportion of cell lines predicted to be sensitive that are actually sensitive (precision) varies strongly with the drug and type of model used. Furthermore, the proportion of sensitive cell lines that are correctly predicted as sensitive (recall) of the best single-gene marker was lower than that of the multi-gene marker in 118 of the 127 tested drugs. We conclude that single-gene markers are only able to identify those drug-sensitive cell lines with the considered actionable mutation, unlike multi-gene markers that can in principle combine multiple gene mutations to identify additional sensitive cell lines. We also found that cell line sensitivities to some drugs (e.g. Temsirolimus, 17-AAG or Methotrexate) are better predicted by these machine-learning models.
Collapse
Affiliation(s)
- Stefan Naulaerts
- Computational Biology and Drug Design, Cancer Research Center of Marseille, INSERM U1068, Marseille, France.,Institut Paoli-Calmettes, Marseille, France.,Aix-Marseille Université, Marseille, France.,CNRS UMR7258, Marseille, France
| | - Cuong C Dang
- Faculty of Information Technology, VNU University of Engineering and Technology, Hanoi, Vietnam
| | - Pedro J Ballester
- Computational Biology and Drug Design, Cancer Research Center of Marseille, INSERM U1068, Marseille, France.,Institut Paoli-Calmettes, Marseille, France.,Aix-Marseille Université, Marseille, France.,CNRS UMR7258, Marseille, France
| |
Collapse
|
66
|
Coram MA, Fang H, Candille SI, Assimes TL, Tang H. Leveraging Multi-ethnic Evidence for Risk Assessment of Quantitative Traits in Minority Populations. Am J Hum Genet 2017; 101:218-226. [PMID: 28757202 DOI: 10.1016/j.ajhg.2017.06.015] [Citation(s) in RCA: 50] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2017] [Accepted: 06/29/2017] [Indexed: 11/20/2022] Open
Abstract
An essential component of precision medicine is the ability to predict an individual's risk of disease based on genetic and non-genetic factors. For complex traits and diseases, assessing the risk due to genetic factors is challenging because it requires knowledge of both the identity of variants that influence the trait and their corresponding allelic effects. Although the set of risk variants and their allelic effects may vary between populations, a large proportion of these variants were identified based on studies in populations of European descent. Heterogeneity in genetic architecture underlying complex traits and diseases, while broadly acknowledged, remains poorly characterized. Ignoring such heterogeneity likely reduces predictive accuracy for minority individuals. In this study, we propose an approach, called XP-BLUP, which ameliorates this ethnic disparity by combining trans-ethnic and ethnic-specific information. We build a polygenic model for complex traits that distinguishes candidate trait-relevant variants from the rest of the genome. The set of candidate variants are selected based on studies in any human population, yet the allelic effects are evaluated in a population-specific fashion. Simulation studies and real data analyses demonstrate that XP-BLUP adaptively utilizes trans-ethnic information and can substantially improve predictive accuracy in minority populations. At the same time, our study highlights the importance of the continued expansion of minority cohorts.
Collapse
Affiliation(s)
- Marc A Coram
- Department of Health Research and Policy, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Huaying Fang
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Sophie I Candille
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Themistocles L Assimes
- Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA; Cardiovascular Institute, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Hua Tang
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA.
| |
Collapse
|
67
|
Almlöf JC, Alexsson A, Imgenberg-Kreuz J, Sylwan L, Bäcklin C, Leonard D, Nordmark G, Tandre K, Eloranta ML, Padyukov L, Bengtsson C, Jönsen A, Dahlqvist SR, Sjöwall C, Bengtsson AA, Gunnarsson I, Svenungsson E, Rönnblom L, Sandling JK, Syvänen AC. Novel risk genes for systemic lupus erythematosus predicted by random forest classification. Sci Rep 2017; 7:6236. [PMID: 28740209 PMCID: PMC5524838 DOI: 10.1038/s41598-017-06516-1] [Citation(s) in RCA: 37] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2017] [Accepted: 06/13/2017] [Indexed: 01/08/2023] Open
Abstract
Genome-wide association studies have identified risk loci for SLE, but a large proportion of the genetic contribution to SLE still remains unexplained. To detect novel risk genes, and to predict an individual's SLE risk we designed a random forest classifier using SNP genotype data generated on the "Immunochip" from 1,160 patients with SLE and 2,711 controls. Using gene importance scores defined by the random forest classifier, we identified 15 potential novel risk genes for SLE. Of them 12 are associated with other autoimmune diseases than SLE, whereas three genes (ZNF804A, CDK1, and MANF) have not previously been associated with autoimmunity. Random forest classification also allowed prediction of patients at risk for lupus nephritis with an area under the curve of 0.94. By allele-specific gene expression analysis we detected cis-regulatory SNPs that affect the expression levels of six of the top 40 genes designed by the random forest analysis, indicating a regulatory role for the identified risk variants. The 40 top genes from the prediction were overrepresented for differential expression in B and T cells according to RNA-sequencing of samples from five healthy donors, with more frequent over-expression in B cells compared to T cells.
Collapse
Affiliation(s)
- Jonas Carlsson Almlöf
- Department of Medical Sciences, Molecular Medicine and Science for Life Laboratory, Uppsala University, Uppsala, Sweden.
| | - Andrei Alexsson
- Department of Medical Sciences, Rheumatology and Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| | - Juliana Imgenberg-Kreuz
- Department of Medical Sciences, Molecular Medicine and Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| | - Lina Sylwan
- Department of Medical Sciences, Molecular Medicine and Science for Life Laboratory, Uppsala University, Uppsala, Sweden
- Science for Life Laboratory (SciLifeLab), Department of Biosciences and Nutrition, Karolinska Institutet, Solna, Sweden
| | - Christofer Bäcklin
- Department of Medical Sciences, Molecular Medicine and Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| | - Dag Leonard
- Department of Medical Sciences, Rheumatology and Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| | - Gunnel Nordmark
- Department of Medical Sciences, Rheumatology and Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| | - Karolina Tandre
- Department of Medical Sciences, Rheumatology and Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| | - Maija-Leena Eloranta
- Department of Medical Sciences, Rheumatology and Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| | - Leonid Padyukov
- Rheumatology Unit, Department of Medicine, Karolinska Institutet, Karolinska university hospital, Stockholm, Sweden
| | - Christine Bengtsson
- Department of Public Health and Clinical Medicine/Rheumatology, Umeå University, Umeå, Sweden
| | - Andreas Jönsen
- Lund University, Skåne University Hospital, Department of Clinical Sciences, Rheumatology, Lund, Sweden
| | | | - Christopher Sjöwall
- AIR/Rheumatology, Department of Clinical and Experimental Medicine, Linköping University, Linköping, Sweden
| | - Anders A Bengtsson
- Lund University, Skåne University Hospital, Department of Clinical Sciences, Rheumatology, Lund, Sweden
| | - Iva Gunnarsson
- Rheumatology Unit, Department of Medicine, Karolinska Institutet, Karolinska university hospital, Stockholm, Sweden
| | - Elisabet Svenungsson
- Rheumatology Unit, Department of Medicine, Karolinska Institutet, Karolinska university hospital, Stockholm, Sweden
| | - Lars Rönnblom
- Department of Medical Sciences, Rheumatology and Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| | - Johanna K Sandling
- Department of Medical Sciences, Molecular Medicine and Science for Life Laboratory, Uppsala University, Uppsala, Sweden
- Department of Medical Sciences, Rheumatology and Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| | - Ann-Christine Syvänen
- Department of Medical Sciences, Molecular Medicine and Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| |
Collapse
|
68
|
Xavier A, Xu S, Muir W, Rainey KM. Genomic prediction using subsampling. BMC Bioinformatics 2017; 18:191. [PMID: 28340551 PMCID: PMC5366167 DOI: 10.1186/s12859-017-1582-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2016] [Accepted: 03/03/2017] [Indexed: 01/09/2023] Open
Abstract
Background Genome-wide assisted selection is a critical tool for the genetic improvement of plants and animals. Whole-genome regression models in Bayesian framework represent the main family of prediction methods. Fitting such models with a large number of observations involves a prohibitive computational burden. We propose the use of subsampling bootstrap Markov chain in genomic prediction. Such method consists of fitting whole-genome regression models by subsampling observations in each round of a Markov Chain Monte Carlo. We evaluated the effect of subsampling bootstrap on prediction and computational parameters. Results Across datasets, we observed an optimal subsampling proportion of observations around 50% with replacement, and around 33% without replacement. Subsampling provided a substantial decrease in computation time, reducing the time to fit the model by half. On average, losses on predictive properties imposed by subsampling were negligible, usually below 1%. For each dataset, an optimal subsampling point that improves prediction properties was observed, but the improvements were also negligible. Conclusion Combining subsampling with Gibbs sampling is an interesting ensemble algorithm. The investigation indicates that the subsampling bootstrap Markov chain algorithm substantially reduces computational burden associated with model fitting, and it may slightly enhance prediction properties. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1582-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Alencar Xavier
- Department of Agronomy, Purdue University, 915 W. State St., Lilly Hall, West Lafayette, IN, 47907, USA
| | - Shizhong Xu
- Department of Plant Science, University of California, 3134 Batchelor Hall, Riverside, CA, 92521, USA
| | - William Muir
- Department of Animal Science, Purdue University, 915 W. State St., Lilly Hall, West Lafayette, IN, 47907, USA
| | - Katy Martin Rainey
- Department of Agronomy, Purdue University, 915 W. State St., Lilly Hall, West Lafayette, IN, 47907, USA.
| |
Collapse
|
69
|
Improving polygenic risk prediction from summary statistics by an empirical Bayes approach. Sci Rep 2017; 7:41262. [PMID: 28145530 PMCID: PMC5286518 DOI: 10.1038/srep41262] [Citation(s) in RCA: 31] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2016] [Accepted: 12/20/2016] [Indexed: 11/24/2022] Open
Abstract
Polygenic risk scores (PRS) from genome-wide association studies (GWAS) are increasingly used to predict disease risks. However some included variants could be false positives and the raw estimates of effect sizes from them may be subject to selection bias. In addition, the standard PRS approach requires testing over a range of p-value thresholds, which are often chosen arbitrarily. The prediction error estimated from the optimized threshold may also be subject to an optimistic bias. To improve genomic risk prediction, we proposed new empirical Bayes approaches to recover the underlying effect sizes and used them as weights to construct PRS. We applied the new PRS to twelve cardio-metabolic traits in the Northern Finland Birth Cohort and demonstrated improvements in predictive power (in R2) when compared to standard PRS at the best p-value threshold. Importantly, for eleven out of the twelve traits studied, the predictive performance from the entire set of genome-wide markers outperformed the best R2 from standard PRS at optimal p-value thresholds. Our proposed methodology essentially enables an automatic PRS weighting scheme without the need of choosing tuning parameters. The new method also performed satisfactorily in simulations. It is computationally simple and does not require assumptions on the effect size distributions.
Collapse
|
70
|
Wang MH, Weng H. Genetic Test, Risk Prediction, and Counseling. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2017; 1005:21-46. [DOI: 10.1007/978-981-10-5717-5_2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
|
71
|
Nguyen L, Dang CC, Ballester PJ. Systematic assessment of multi-gene predictors of pan-cancer cell line sensitivity to drugs exploiting gene expression data. F1000Res 2016; 5. [PMID: 28299173 PMCID: PMC5310525 DOI: 10.12688/f1000research.10529.2] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 03/10/2017] [Indexed: 12/19/2022] Open
Abstract
Background: Selected gene mutations are routinely used to guide the selection of cancer drugs for a given patient tumour. Large pharmacogenomic data sets, such as those by Genomics of Drug Sensitivity in Cancer (GDSC) consortium, were introduced to discover more of these single-gene markers of drug sensitivity. Very recently, machine learning regression has been used to investigate how well cancer cell line sensitivity to drugs is predicted depending on the type of molecular profile. The latter has revealed that gene expression data is the most predictive profile in the pan-cancer setting. However, no study to date has exploited GDSC data to systematically compare the performance of machine learning models based on multi-gene expression data against that of widely-used single-gene markers based on genomics data.
Methods: Here we present this systematic comparison using Random Forest (RF) classifiers exploiting the expression levels of 13,321 genes and an average of 501 tested cell lines per drug. To account for time-dependent batch effects in IC
50 measurements, we employ independent test sets generated with more recent GDSC data than that used to train the predictors and show that this is a more realistic validation than standard k-fold cross-validation.
Results and Discussion: Across 127 GDSC drugs, our results show that the single-gene markers unveiled by the MANOVA analysis tend to achieve higher precision than these RF-based multi-gene models, at the cost of generally having a poor recall (i.e. correctly detecting only a small part of the cell lines sensitive to the drug). Regarding overall classification performance, about two thirds of the drugs are better predicted by the multi-gene RF classifiers. Among the drugs with the most predictive of these models, we found pyrimethamine, sunitinib and 17-AAG.
Conclusions: Thanks to this unbiased validation, we now know that this type of models can predict
in vitro tumour response to some of these drugs. These models can thus be further investigated on
in vivo tumour models. R code to facilitate the construction of alternative machine learning models and their validation in the presented benchmark is available at
http://ballester.marseille.inserm.fr/gdsc.transcriptomicDatav2.tar.gz.
Collapse
Affiliation(s)
- Linh Nguyen
- Cancer Research Center of Marseille, INSERM U1068, Marseille, France; Institut Paoli-Calmettes, Marseille, France; Aix-Marseille Université, Marseille, France; Cancer Research Center of Marseille UMR7258, Marseille, France
| | - Cuong C Dang
- Cancer Research Center of Marseille, INSERM U1068, Marseille, France; Institut Paoli-Calmettes, Marseille, France; Aix-Marseille Université, Marseille, France; Cancer Research Center of Marseille UMR7258, Marseille, France
| | - Pedro J Ballester
- Cancer Research Center of Marseille, INSERM U1068, Marseille, France; Institut Paoli-Calmettes, Marseille, France; Aix-Marseille Université, Marseille, France; Cancer Research Center of Marseille UMR7258, Marseille, France
| |
Collapse
|
72
|
Nguyen L, Dang CC, Ballester PJ. Systematic assessment of multi-gene predictors of pan-cancer cell line sensitivity to drugs exploiting gene expression data. F1000Res 2016; 5. [PMID: 28299173 DOI: 10.12688/f1000research.10529.1] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 12/28/2016] [Indexed: 12/30/2022] Open
Abstract
Background: Selected gene mutations are routinely used to guide the selection of cancer drugs for a given patient tumour. Large pharmacogenomic data sets, such as those by Genomics of Drug Sensitivity in Cancer (GDSC) consortium, were introduced to discover more of these single-gene markers of drug sensitivity. Very recently, machine learning regression has been used to investigate how well cancer cell line sensitivity to drugs is predicted depending on the type of molecular profile. The latter has revealed that gene expression data is the most predictive profile in the pan-cancer setting. However, no study to date has exploited GDSC data to systematically compare the performance of machine learning models based on multi-gene expression data against that of widely-used single-gene markers based on genomics data. Methods: Here we present this systematic comparison using Random Forest (RF) classifiers exploiting the expression levels of 13,321 genes and an average of 501 tested cell lines per drug. To account for time-dependent batch effects in IC 50 measurements, we employ independent test sets generated with more recent GDSC data than that used to train the predictors and show that this is a more realistic validation than standard k-fold cross-validation. Results and Discussion: Across 127 GDSC drugs, our results show that the single-gene markers unveiled by the MANOVA analysis tend to achieve higher precision than these RF-based multi-gene models, at the cost of generally having a poor recall (i.e. correctly detecting only a small part of the cell lines sensitive to the drug). Regarding overall classification performance, about two thirds of the drugs are better predicted by the multi-gene RF classifiers. Among the drugs with the most predictive of these models, we found pyrimethamine, sunitinib and 17-AAG. Conclusions: Thanks to this unbiased validation, we now know that this type of models can predict in vitro tumour response to some of these drugs. These models can thus be further investigated on in vivo tumour models. R code to facilitate the construction of alternative machine learning models and their validation in the presented benchmark is available at http://ballester.marseille.inserm.fr/gdsc.transcriptomicDatav2.tar.gz.
Collapse
Affiliation(s)
- Linh Nguyen
- Cancer Research Center of Marseille, INSERM U1068, Marseille, France; Institut Paoli-Calmettes, Marseille, France; Aix-Marseille Université, Marseille, France; Cancer Research Center of Marseille UMR7258, Marseille, France
| | - Cuong C Dang
- Cancer Research Center of Marseille, INSERM U1068, Marseille, France; Institut Paoli-Calmettes, Marseille, France; Aix-Marseille Université, Marseille, France; Cancer Research Center of Marseille UMR7258, Marseille, France
| | - Pedro J Ballester
- Cancer Research Center of Marseille, INSERM U1068, Marseille, France; Institut Paoli-Calmettes, Marseille, France; Aix-Marseille Université, Marseille, France; Cancer Research Center of Marseille UMR7258, Marseille, France
| |
Collapse
|
73
|
Nishizaki SS, Boyle AP. Mining the Unknown: Assigning Function to Noncoding Single Nucleotide Polymorphisms. Trends Genet 2016; 33:34-45. [PMID: 27939749 DOI: 10.1016/j.tig.2016.10.008] [Citation(s) in RCA: 64] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2016] [Revised: 10/30/2016] [Accepted: 10/31/2016] [Indexed: 11/18/2022]
Abstract
One of the formative goals of genetics research is to understand how genetic variation leads to phenotypic differences and human disease. Genome-wide association studies (GWASs) bring us closer to this goal by linking variation with disease faster than ever before. Despite this, GWASs alone are unable to pinpoint disease-causing single nucleotide polymorphisms (SNPs). Noncoding SNPs, which represent the majority of GWAS SNPs, present a particular challenge. To address this challenge, an array of computational tools designed to prioritize and predict the function of noncoding GWAS SNPs have been developed. However, fewer than 40% of GWAS publications from 2015 utilized these tools. We discuss several leading methods for annotating noncoding variants and how they can be integrated into research pipelines in hopes that they will be broadly applied in future GWAS analyses.
Collapse
Affiliation(s)
- Sierra S Nishizaki
- Department of Human Genetics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Alan P Boyle
- Department of Human Genetics, University of Michigan, Ann Arbor, MI 48109, USA; Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA.
| |
Collapse
|
74
|
Mostafa Abd El Hamid M, Omar YM, Mabrouk MS. Identifying genetic biomarkers associated to Alzheimer's disease using Support Vector Machine. 2016 8TH CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE (CIBEC) 2016. [DOI: 10.1109/cibec.2016.7836087] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
|
75
|
Mieth B, Kloft M, Rodríguez JA, Sonnenburg S, Vobruba R, Morcillo-Suárez C, Farré X, Marigorta UM, Fehr E, Dickhaus T, Blanchard G, Schunk D, Navarro A, Müller KR. Combining Multiple Hypothesis Testing with Machine Learning Increases the Statistical Power of Genome-wide Association Studies. Sci Rep 2016; 6:36671. [PMID: 27892471 PMCID: PMC5125008 DOI: 10.1038/srep36671] [Citation(s) in RCA: 35] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2016] [Accepted: 10/06/2016] [Indexed: 12/21/2022] Open
Abstract
The standard approach to the analysis of genome-wide association studies (GWAS) is based on testing each position in the genome individually for statistical significance of its association with the phenotype under investigation. To improve the analysis of GWAS, we propose a combination of machine learning and statistical testing that takes correlation structures within the set of SNPs under investigation in a mathematically well-controlled manner into account. The novel two-step algorithm, COMBI, first trains a support vector machine to determine a subset of candidate SNPs and then performs hypothesis tests for these SNPs together with an adequate threshold correction. Applying COMBI to data from a WTCCC study (2007) and measuring performance as replication by independent GWAS published within the 2008-2015 period, we show that our method outperforms ordinary raw p-value thresholding as well as other state-of-the-art methods. COMBI presents higher power and precision than the examined alternatives while yielding fewer false (i.e. non-replicated) and more true (i.e. replicated) discoveries when its results are validated on later GWAS studies. More than 80% of the discoveries made by COMBI upon WTCCC data have been validated by independent studies. Implementations of the COMBI method are available as a part of the GWASpi toolbox 2.0.
Collapse
Affiliation(s)
- Bettina Mieth
- Machine Learning Group, Technische Universität Berlin, Berlin, 10587, Germany
| | - Marius Kloft
- Department of Computer Science, Humboldt University of Berlin, Berlin, 10099, Germany
| | - Juan Antonio Rodríguez
- Institut de Biología Evolutiva (CSIC-UPF). Departament de Ciències Experimentals i de la Salut. Universitat Pompeu Fabra, Barcelona, 08003, Spain
| | | | - Robin Vobruba
- Machine Learning Group, Technische Universität Berlin, Berlin, 10587, Germany
| | - Carlos Morcillo-Suárez
- Institut de Biología Evolutiva (CSIC-UPF). Departament de Ciències Experimentals i de la Salut. Universitat Pompeu Fabra, Barcelona, 08003, Spain
| | - Xavier Farré
- Institut de Biología Evolutiva (CSIC-UPF). Departament de Ciències Experimentals i de la Salut. Universitat Pompeu Fabra, Barcelona, 08003, Spain
| | - Urko M. Marigorta
- School of Biology, Georgia Institute of Technology, Atlanta, 30332, GA, USA
| | - Ernst Fehr
- Department of Economics, Laboratory for Social and Neural Systems Research, University of Zurich, Zurich, 8006, Switzerland
| | - Thorsten Dickhaus
- Institute for Statistics (FB 3), University of Bremen, Bremen, 28359, Germany
| | - Gilles Blanchard
- Department of Mathematics, University of Potsdam, Potsdam, 14476, Germany
| | - Daniel Schunk
- Department of Economics, University of Mainz, Mainz, 55099, Germany
| | - Arcadi Navarro
- Institut de Biología Evolutiva (CSIC-UPF). Departament de Ciències Experimentals i de la Salut. Universitat Pompeu Fabra, Barcelona, 08003, Spain
- Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, 08010, Spain
- Center for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Barcelona, 08003, Spain
| | - Klaus-Robert Müller
- Machine Learning Group, Technische Universität Berlin, Berlin, 10587, Germany
- Department of Brain and Cognitive Engineering, Korea University, Seoul, Republic of Korea
| |
Collapse
|
76
|
Prediction of overall survival for patients with metastatic castration-resistant prostate cancer: development of a prognostic model through a crowdsourced challenge with open clinical trial data. Lancet Oncol 2016; 18:132-142. [PMID: 27864015 DOI: 10.1016/s1470-2045(16)30560-5] [Citation(s) in RCA: 100] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2016] [Revised: 09/12/2016] [Accepted: 09/21/2016] [Indexed: 01/10/2023]
Abstract
BACKGROUND Improvements to prognostic models in metastatic castration-resistant prostate cancer have the potential to augment clinical trial design and guide treatment strategies. In partnership with Project Data Sphere, a not-for-profit initiative allowing data from cancer clinical trials to be shared broadly with researchers, we designed an open-data, crowdsourced, DREAM (Dialogue for Reverse Engineering Assessments and Methods) challenge to not only identify a better prognostic model for prediction of survival in patients with metastatic castration-resistant prostate cancer but also engage a community of international data scientists to study this disease. METHODS Data from the comparator arms of four phase 3 clinical trials in first-line metastatic castration-resistant prostate cancer were obtained from Project Data Sphere, comprising 476 patients treated with docetaxel and prednisone from the ASCENT2 trial, 526 patients treated with docetaxel, prednisone, and placebo in the MAINSAIL trial, 598 patients treated with docetaxel, prednisone or prednisolone, and placebo in the VENICE trial, and 470 patients treated with docetaxel and placebo in the ENTHUSE 33 trial. Datasets consisting of more than 150 clinical variables were curated centrally, including demographics, laboratory values, medical history, lesion sites, and previous treatments. Data from ASCENT2, MAINSAIL, and VENICE were released publicly to be used as training data to predict the outcome of interest-namely, overall survival. Clinical data were also released for ENTHUSE 33, but data for outcome variables (overall survival and event status) were hidden from the challenge participants so that ENTHUSE 33 could be used for independent validation. Methods were evaluated using the integrated time-dependent area under the curve (iAUC). The reference model, based on eight clinical variables and a penalised Cox proportional-hazards model, was used to compare method performance. Further validation was done using data from a fifth trial-ENTHUSE M1-in which 266 patients with metastatic castration-resistant prostate cancer were treated with placebo alone. FINDINGS 50 independent methods were developed to predict overall survival and were evaluated through the DREAM challenge. The top performer was based on an ensemble of penalised Cox regression models (ePCR), which uniquely identified predictive interaction effects with immune biomarkers and markers of hepatic and renal function. Overall, ePCR outperformed all other methods (iAUC 0·791; Bayes factor >5) and surpassed the reference model (iAUC 0·743; Bayes factor >20). Both the ePCR model and reference models stratified patients in the ENTHUSE 33 trial into high-risk and low-risk groups with significantly different overall survival (ePCR: hazard ratio 3·32, 95% CI 2·39-4·62, p<0·0001; reference model: 2·56, 1·85-3·53, p<0·0001). The new model was validated further on the ENTHUSE M1 cohort with similarly high performance (iAUC 0·768). Meta-analysis across all methods confirmed previously identified predictive clinical variables and revealed aspartate aminotransferase as an important, albeit previously under-reported, prognostic biomarker. INTERPRETATION Novel prognostic factors were delineated, and the assessment of 50 methods developed by independent international teams establishes a benchmark for development of methods in the future. The results of this effort show that data-sharing, when combined with a crowdsourced challenge, is a robust and powerful framework to develop new prognostic models in advanced prostate cancer. FUNDING Sanofi US Services, Project Data Sphere.
Collapse
|
77
|
Zhao SD. Integrative genetic risk prediction using non-parametric empirical Bayes classification. Biometrics 2016; 73:582-592. [PMID: 27792843 DOI: 10.1111/biom.12619] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2016] [Revised: 09/01/2016] [Accepted: 09/01/2016] [Indexed: 12/27/2022]
Abstract
Genetic risk prediction is an important component of individualized medicine, but prediction accuracies remain low for many complex diseases. A fundamental limitation is the sample sizes of the studies on which the prediction algorithms are trained. One way to increase the effective sample size is to integrate information from previously existing studies. However, it can be difficult to find existing data that examine the target disease of interest, especially if that disease is rare or poorly studied. Furthermore, individual-level genotype data from these auxiliary studies are typically difficult to obtain. This article proposes a new approach to integrative genetic risk prediction of complex diseases with binary phenotypes. It accommodates possible heterogeneity in the genetic etiologies of the target and auxiliary diseases using a tuning parameter-free non-parametric empirical Bayes procedure, and can be trained using only auxiliary summary statistics. Simulation studies show that the proposed method can provide superior predictive accuracy relative to non-integrative as well as integrative classifiers. The method is applied to a recent study of pediatric autoimmune diseases, where it substantially reduces prediction error for certain target/auxiliary disease combinations. The proposed method is implemented in the R package ssa.
Collapse
Affiliation(s)
- Sihai Dave Zhao
- Department of Statistics, University of Illinois at Urbana-Champaign, Champaign, Illinois, U.S.A
| |
Collapse
|
78
|
Crowdsourced assessment of common genetic contribution to predicting anti-TNF treatment response in rheumatoid arthritis. Nat Commun 2016; 7:12460. [PMID: 27549343 PMCID: PMC4996969 DOI: 10.1038/ncomms12460] [Citation(s) in RCA: 59] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2015] [Accepted: 07/05/2016] [Indexed: 12/17/2022] Open
Abstract
Rheumatoid arthritis (RA) affects millions world-wide. While anti-TNF treatment is widely used to reduce disease progression, treatment fails in ∼one-third of patients. No biomarker currently exists that identifies non-responders before treatment. A rigorous community-based assessment of the utility of SNP data for predicting anti-TNF treatment efficacy in RA patients was performed in the context of a DREAM Challenge (http://www.synapse.org/RA_Challenge). An open challenge framework enabled the comparative evaluation of predictions developed by 73 research groups using the most comprehensive available data and covering a wide range of state-of-the-art modelling methodologies. Despite a significant genetic heritability estimate of treatment non-response trait (h2=0.18, P value=0.02), no significant genetic contribution to prediction accuracy is observed. Results formally confirm the expectations of the rheumatology community that SNP information does not significantly improve predictive performance relative to standard clinical traits, thereby justifying a refocusing of future efforts on collection of other data. Rheumatoid arthritis patients respond differently to anti-TNF treatment. Using community-based challenge, the authors show that currently available data does not reveal meaningful genetic predictors of response to anti-TNF therapy, thus confirming clinical observations.
Collapse
|
79
|
Xavier A, Muir WM, Rainey KM. Assessing Predictive Properties of Genome-Wide Selection in Soybeans. G3 (BETHESDA, MD.) 2016; 6:2611-6. [PMID: 27317786 PMCID: PMC4978914 DOI: 10.1534/g3.116.032268] [Citation(s) in RCA: 46] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/04/2016] [Accepted: 06/16/2016] [Indexed: 11/30/2022]
Abstract
Many economically important traits in plant breeding have low heritability or are difficult to measure. For these traits, genomic selection has attractive features and may boost genetic gains. Our goal was to evaluate alternative scenarios to implement genomic selection for yield components in soybean (Glycine max L. merr). We used a nested association panel with cross validation to evaluate the impacts of training population size, genotyping density, and prediction model on the accuracy of genomic prediction. Our results indicate that training population size was the factor most relevant to improvement in genome-wide prediction, with greatest improvement observed in training sets up to 2000 individuals. We discuss assumptions that influence the choice of the prediction model. Although alternative models had minor impacts on prediction accuracy, the most robust prediction model was the combination of reproducing kernel Hilbert space regression and BayesB. Higher genotyping density marginally improved accuracy. Our study finds that breeding programs seeking efficient genomic selection in soybeans would best allocate resources by investing in a representative training set.
Collapse
Affiliation(s)
- Alencar Xavier
- Department of Agronomy, Purdue University, West Lafayette, Indiana 47907
| | - William M Muir
- Department of Animal Science, Purdue University, West Lafayette, Indiana 47907
| | - Katy Martin Rainey
- Department of Agronomy, Purdue University, West Lafayette, Indiana 47907
| |
Collapse
|
80
|
Laurin C, Boomsma D, Lubke G. The use of vector bootstrapping to improve variable selection precision in Lasso models. Stat Appl Genet Mol Biol 2016; 15:305-20. [PMID: 27248122 PMCID: PMC5131926 DOI: 10.1515/sagmb-2015-0043] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
The Lasso is a shrinkage regression method that is widely used for variable selection in statistical genetics. Commonly, K-fold cross-validation is used to fit a Lasso model. This is sometimes followed by using bootstrap confidence intervals to improve precision in the resulting variable selections. Nesting cross-validation within bootstrapping could provide further improvements in precision, but this has not been investigated systematically. We performed simulation studies of Lasso variable selection precision (VSP) with and without nesting cross-validation within bootstrapping. Data were simulated to represent genomic data under a polygenic model as well as under a model with effect sizes representative of typical GWAS results. We compared these approaches to each other as well as to software defaults for the Lasso. Nested cross-validation had the most precise variable selection at small effect sizes. At larger effect sizes, there was no advantage to nesting. We illustrated the nested approach with empirical data comprising SNPs and SNP-SNP interactions from the most significant SNPs in a GWAS of borderline personality symptoms. In the empirical example, we found that the default Lasso selected low-reliability SNPs and interactions which were excluded by bootstrapping.
Collapse
|
81
|
Mikhchi A, Honarvar M, Kashan NEJ, Aminafshar M. Assessing and comparison of different machine learning methods in parent-offspring trios for genotype imputation. J Theor Biol 2016; 399:148-58. [PMID: 27049046 DOI: 10.1016/j.jtbi.2016.03.035] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2015] [Revised: 03/06/2016] [Accepted: 03/24/2016] [Indexed: 11/17/2022]
Abstract
Genotype imputation is an important tool for prediction of unknown genotypes for both unrelated individuals and parent-offspring trios. Several imputation methods are available and can either employ universal machine learning methods, or deploy algorithms dedicated to infer missing genotypes. In this research the performance of eight machine learning methods: Support Vector Machine, K-Nearest Neighbors, Extreme Learning Machine, Radial Basis Function, Random Forest, AdaBoost, LogitBoost, and TotalBoost compared in terms of the imputation accuracy, computation time and the factors affecting imputation accuracy. The methods employed using real and simulated datasets to impute the un-typed SNPs in parent-offspring trios. The tested methods show that imputation of parent-offspring trios can be accurate. The Random Forest and Support Vector Machine were more accurate than the other machine learning methods. The TotalBoost performed slightly worse than the other methods.The running times were different between methods. The ELM was always most fast algorithm. In case of increasing the sample size, the RBF requires long imputation time.The tested methods in this research can be an alternative for imputation of un-typed SNPs in low missing rate of data. However, it is recommended that other machine learning methods to be used for imputation.
Collapse
Affiliation(s)
- Abbas Mikhchi
- Department of Animal Science, Science and Research Branch, Islamic Azad University, Tehran, Iran.
| | - Mahmood Honarvar
- Department of Animal Science, Shahr-e-Qods Branch, Islamic Azad University, Tehran, Iran
| | - Nasser Emam Jomeh Kashan
- Department of Animal Science, Science and Research Branch, Islamic Azad University, Tehran, Iran
| | - Mehdi Aminafshar
- Department of Animal Science, Science and Research Branch, Islamic Azad University, Tehran, Iran
| |
Collapse
|
82
|
Waldmann P. Genome-wide prediction using Bayesian additive regression trees. Genet Sel Evol 2016; 48:42. [PMID: 27286957 PMCID: PMC4901500 DOI: 10.1186/s12711-016-0219-8] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2016] [Accepted: 05/26/2016] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The goal of genome-wide prediction (GWP) is to predict phenotypes based on marker genotypes, often obtained through single nucleotide polymorphism (SNP) chips. The major problem with GWP is high-dimensional data from many thousands of SNPs scored on several thousands of individuals. A large number of methods have been developed for GWP, which are mostly parametric methods that assume statistical linearity and only additive genetic effects. The Bayesian additive regression trees (BART) method was recently proposed and is based on the sum of nonparametric regression trees with the priors being used to regularize the parameters. Each regression tree is based on a recursive binary partitioning of the predictor space that approximates an unknown function, which will automatically model nonlinearities within SNPs (dominance) and interactions between SNPs (epistasis). In this study, we introduced BART and compared its predictive performance with that of the LASSO, Bayesian LASSO (BLASSO), genomic best linear unbiased prediction (GBLUP), reproducing kernel Hilbert space (RKHS) regression and random forest (RF) methods. RESULTS Tests on the QTLMAS2010 simulated data, which are mainly based on additive genetic effects, show that cross-validated optimization of BART provides a smaller prediction error than the RF, BLASSO, GBLUP and RKHS methods, and is almost as accurate as the LASSO method. If dominance and epistasis effects are added to the QTLMAS2010 data, the accuracy of BART relative to the other methods was increased. We also showed that BART can produce importance measures on the SNPs through variable inclusion proportions. In evaluations using real data on pigs, the prediction error was smaller with BART than with the other methods. CONCLUSIONS BART was shown to be an accurate method for GWP, in which the regression trees guarantee a very sparse representation of additive and complex non-additive genetic effects. Moreover, the Markov chain Monte Carlo algorithm with Bayesian back-fitting provides a computationally efficient procedure that is suitable for high-dimensional genomic data.
Collapse
Affiliation(s)
- Patrik Waldmann
- Department of Animal Breeding and Genetics, Swedish University of Agricultural Sciences (SLU), Box 7023, 750 07, Uppsala, Sweden.
| |
Collapse
|
83
|
Malovini A, Bellazzi R, Napolitano C, Guffanti G. Multivariate Methods for Genetic Variants Selection and Risk Prediction in Cardiovascular Diseases. Front Cardiovasc Med 2016; 3:17. [PMID: 27376073 PMCID: PMC4896915 DOI: 10.3389/fcvm.2016.00017] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2016] [Accepted: 05/23/2016] [Indexed: 01/06/2023] Open
Abstract
Over the last decade, high-throughput genotyping and sequencing technologies have contributed to major advancements in genetics research, as these technologies now facilitate affordable mapping of the entire genome for large sets of individuals. Given this, genome-wide association studies are proving to be powerful tools in identifying genetic variants that have the capacity to modify the probability of developing a disease or trait of interest. However, when the study’s goal is to evaluate the effect of the presence of genetic variants mapping to specific chromosomes regions on a specific phenotype, the candidate loci approach is still preferred. Regardless of which approach is taken, such a large data set calls for the establishment and development of appropriate analytical methods in order to translate such knowledge into biological or clinical findings. Standard univariate tests often fail to identify informative genetic variants, especially when dealing with complex traits, which are more likely to result from a combination of rare and common variants and non-genetic determinants. These limitations can partially be overcome by multivariate methods, which allow for the identification of informative combinations of genetic variants and non-genetic features. Furthermore, such methods can help to generate additive genetic scores and risk stratification algorithms that, once extensively validated in independent cohorts, could serve as useful tools to assist clinicians in decision-making. This review aims to provide readers with an overview of the main multivariate methods for genetic data analysis that could be applied to the analysis of cardiovascular traits.
Collapse
Affiliation(s)
- Alberto Malovini
- Laboratory of Informatics and Systems Engineering for Clinical Research, IRCCS Fondazione Salvatore Maugeri , Pavia , Italy
| | - Riccardo Bellazzi
- Laboratory of Informatics and Systems Engineering for Clinical Research, IRCCS Fondazione Salvatore Maugeri, Pavia, Italy; Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy
| | - Carlo Napolitano
- Molecular Cardiology Laboratories, IRCCS Fondazione Salvatore Maugeri , Pavia , Italy
| | - Guia Guffanti
- Department of Psychiatry, McLean Hospital, Harvard Medical School , Belmont, MA , USA
| |
Collapse
|
84
|
González-Camacho JM, Crossa J, Pérez-Rodríguez P, Ornella L, Gianola D. Genome-enabled prediction using probabilistic neural network classifiers. BMC Genomics 2016; 17:208. [PMID: 26956885 PMCID: PMC4784384 DOI: 10.1186/s12864-016-2553-1] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2015] [Accepted: 02/29/2016] [Indexed: 12/24/2022] Open
Abstract
BACKGROUND Multi-layer perceptron (MLP) and radial basis function neural networks (RBFNN) have been shown to be effective in genome-enabled prediction. Here, we evaluated and compared the classification performance of an MLP classifier versus that of a probabilistic neural network (PNN), to predict the probability of membership of one individual in a phenotypic class of interest, using genomic and phenotypic data as input variables. We used 16 maize and 17 wheat genomic and phenotypic datasets with different trait-environment combinations (sample sizes ranged from 290 to 300 individuals) with 1.4 k and 55 k SNP chips. Classifiers were tested using continuous traits that were categorized into three classes (upper, middle and lower) based on the empirical distribution of each trait, constructed on the basis of two percentiles (15-85 % and 30-70 %). We focused on the 15 and 30 % percentiles for the upper and lower classes for selecting the best individuals, as commonly done in genomic selection. Wheat datasets were also used with two classes. The criteria for assessing the predictive accuracy of the two classifiers were the area under the receiver operating characteristic curve (AUC) and the area under the precision-recall curve (AUCpr). Parameters of both classifiers were estimated by optimizing the AUC for a specific class of interest. RESULTS The AUC and AUCpr criteria provided enough evidence to conclude that PNN was more accurate than MLP for assigning maize and wheat lines to the correct upper, middle or lower class for the complex traits analyzed. Results for the wheat datasets with continuous traits split into two and three classes showed that the performance of PNN with three classes was higher than with two classes when classifying individuals into the upper and lower (15 or 30 %) categories. CONCLUSIONS The PNN classifier outperformed the MLP classifier in all 33 (maize and wheat) datasets when using AUC and AUCpr for selecting individuals of a specific class. Use of PNN with Gaussian radial basis functions seems promising in genomic selection for identifying the best individuals. Categorizing continuous traits into three classes generally provided better classification than when using two classes, because classification accuracy improved when classes were balanced.
Collapse
Affiliation(s)
| | - José Crossa
- Biometrics and Statistics Unit (BSU), International Maize and Wheat Improvement Center (CIMMYT), Apdo Postal 6-641, México DF, 06600 24105, México.
| | | | - Leonardo Ornella
- NIDERA SEMILLAS S.A., Ruta 8 Km. 376, 2600, Venado Tuerto, Argentina.
| | - Daniel Gianola
- Department of Animal Sciences, University of Wisconsin, Madison, 53706, USA.
| |
Collapse
|
85
|
Mikhchi A, Honarvar M, Emam Jomeh Kashan N, Zerehdaran S, Aminafshar M. Comparison of three boosting methods in parent-offspring trios for genotype imputation using simulation study. JOURNAL OF ANIMAL SCIENCE AND TECHNOLOGY 2016; 58:1. [PMID: 26740888 PMCID: PMC4702368 DOI: 10.1186/s40781-015-0081-1] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/18/2015] [Accepted: 12/28/2015] [Indexed: 11/30/2022]
Abstract
Background Genotype imputation is an important process of predicting unknown genotypes, which uses reference population with dense genotypes to predict missing genotypes for both human and animal genetic variations at a low cost. Machine learning methods specially boosting methods have been used in genetic studies to explore the underlying genetic profile of disease and build models capable of predicting missing values of a marker. Methods In this study strategies and factors affecting the imputation accuracy of parent-offspring trios compared from lower-density SNP panels (5 K) to high density (10 K) SNP panel using three different Boosting methods namely TotalBoost (TB), LogitBoost (LB) and AdaBoost (AB). The methods employed using simulated data to impute the un-typed SNPs in parent-offspring trios. Four different datasets of G1 (100 trios with 5 k SNPs), G2 (100 trios with 10 k SNPs), G3 (500 trios with 5 k SNPs), and G4 (500 trio with 10 k SNPs) were simulated. In four datasets all parents were genotyped completely, and offspring genotyped with a lower density panel. Results Comparison of the three methods for imputation showed that the LB outperformed AB and TB for imputation accuracy. The time of computation were different between methods. The AB was the fastest algorithm. The higher SNP densities resulted the increase of the accuracy of imputation. Larger trios (i.e. 500) was better for performance of LB and TB. Conclusions The conclusion is that the three methods do well in terms of imputation accuracy also the dense chip is recommended for imputation of parent-offspring trios.
Collapse
Affiliation(s)
- Abbas Mikhchi
- Department of Animal Science, Science and Research Branch, Islamic Azad University, Tehran, Iran
| | - Mahmood Honarvar
- Department of Animal Science, Shahr-e-Qods Branch, Islamic Azad University, Tehran, Iran
| | - Nasser Emam Jomeh Kashan
- Department of Animal Science, Science and Research Branch, Islamic Azad University, Tehran, Iran
| | - Saeed Zerehdaran
- Department of Animal Science, Ferdowsi University of Mashhad, Mashhad, Iran
| | - Mehdi Aminafshar
- Department of Animal Science, Science and Research Branch, Islamic Azad University, Tehran, Iran
| |
Collapse
|
86
|
Mas S, Gassó P, Lafuente A. Applicability of gene expression and systems biology to develop pharmacogenetic predictors; antipsychotic-induced extrapyramidal symptoms as an example. Pharmacogenomics 2015; 16:1975-88. [PMID: 26556470 DOI: 10.2217/pgs.15.134] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Pharmacogenetics has been driven by a candidate gene approach. The disadvantage of this approach is that is limited by our current understanding of the mechanisms by which drugs act. Gene expression could help to elucidate the molecular signatures of antipsychotic treatments searching for dysregulated molecular pathways and the relationships between gene products, especially protein-protein interactions. To embrace the complexity of drug response, machine learning methods could help to identify gene-gene interactions and develop pharmacogenetic predictors of drug response. The present review summarizes the applicability of the topics presented here (gene expression, network analysis and gene-gene interactions) in pharmacogenetics. In order to achieve this, we present an example of identifying genetic predictors of extrapyramidal symptoms induced by antipsychotic.
Collapse
Affiliation(s)
- Sergi Mas
- Department of Pathological Anatomy, Pharmacology & Microbiology, University of Barcelona, Spain.,Institut d'Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Barcelona, Spain.,Centro de Investigación Biomédica en Red de Salud Mental (CIBERSAM), Spain
| | - Patricia Gassó
- Department of Pathological Anatomy, Pharmacology & Microbiology, University of Barcelona, Spain.,Institut d'Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Barcelona, Spain
| | - Amelia Lafuente
- Department of Pathological Anatomy, Pharmacology & Microbiology, University of Barcelona, Spain.,Institut d'Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Barcelona, Spain.,Centro de Investigación Biomédica en Red de Salud Mental (CIBERSAM), Spain
| |
Collapse
|
87
|
Abraham G, Rohmer A, Tye-Din JA, Inouye M. Genomic prediction of celiac disease targeting HLA-positive individuals. Genome Med 2015; 7:72. [PMID: 26244058 PMCID: PMC4523954 DOI: 10.1186/s13073-015-0196-5] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2015] [Accepted: 07/08/2015] [Indexed: 02/07/2023] Open
Abstract
Background Genomic prediction aims to leverage genome-wide genetic data towards better disease diagnostics and risk scores. We have previously published a genomic risk score (GRS) for celiac disease (CD), a common and highly heritable autoimmune disease, which differentiates between CD cases and population-based controls at a clinically-relevant predictive level, improving upon other gene-based approaches. HLA risk haplotypes, particularly HLA-DQ2.5, are necessary but not sufficient for CD, with at least one HLA risk haplotype present in up to half of most Caucasian populations. Here, we assess a genomic prediction strategy that specifically targets this common genetic susceptibility subtype, utilizing a supervised learning procedure for CD that leverages known HLA-DQ2.5 risk. Methods Using L1/L2-regularized support-vector machines trained on large European case-control datasets, we constructed novel CD GRSs specific to individuals with HLA-DQ2.5 risk haplotypes (GRS-DQ2.5) and compared them with the predictive power of the existing CD GRS (GRS14) as well as two haplotype-based approaches, externally validating the results in a North American case-control study. Results Consistent with previous observations, both the existing GRS14 and the GRS-DQ2.5 had better predictive performance than the HLA haplotype approaches. GRS-DQ2.5 models, based on directly genotyped or imputed markers, achieved similar levels of predictive performance (AUC = 0.718-0.73), which were substantially higher than those obtained from the DQ2.5 zygosity alone (AUC = 0.558), the HLA risk haplotype method (AUC = 0.634), or the generic GRS14 (AUC = 0.679). In a screening model of at-risk individuals, the GRS-DQ2.5 lowered the number of unnecessary follow-up tests for CD across most sensitivity levels. Relative to a baseline implicating all DQ2.5-positive individuals for follow-up, the GRS-DQ2.5 resulted in a net saving of 2.2 unnecessary follow-up tests for each justified test while still capturing 90 % of DQ2.5-positive CD cases. Conclusions Genomic risk scores for CD that target genetically at-risk sub-groups improve predictive performance beyond traditional approaches and may represent a useful strategy for prioritizing individuals at increased risk of disease, thus potentially reducing unnecessary follow-up diagnostic tests. Electronic supplementary material The online version of this article (doi:10.1186/s13073-015-0196-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Gad Abraham
- Centre for Systems Genomics, School of BioSciences, The University of Melbourne, Parkville, 3010 Victoria Australia ; Medical Systems Biology, Department of Pathology and Department of Microbiology & Immunology, The University of Melbourne, Parkville, 3010 Victoria Australia
| | - Alexia Rohmer
- Medical Systems Biology, Department of Pathology and Department of Microbiology & Immunology, The University of Melbourne, Parkville, 3010 Victoria Australia ; Faculty of Life Science, University of Strasbourg, Strasbourg, 67084 CEDEX France
| | - Jason A Tye-Din
- The Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, 3052 Victoria Australia ; Department of Medical Biology, The University of Melbourne, Parkville, 3010 Victoria Australia ; Department of Gastroenterology, The Royal Melbourne Hospital, Grattan St., Parkville, 3050 Victoria Australia ; Murdoch Children's Research Institute, Flemington Road, Parkville, Victoria 3050 Australia
| | - Michael Inouye
- Centre for Systems Genomics, School of BioSciences, The University of Melbourne, Parkville, 3010 Victoria Australia ; Medical Systems Biology, Department of Pathology and Department of Microbiology & Immunology, The University of Melbourne, Parkville, 3010 Victoria Australia
| |
Collapse
|
88
|
Cardoso JGR, Andersen MR, Herrgård MJ, Sonnenschein N. Analysis of genetic variation and potential applications in genome-scale metabolic modeling. Front Bioeng Biotechnol 2015; 3:13. [PMID: 25763369 PMCID: PMC4329917 DOI: 10.3389/fbioe.2015.00013] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2014] [Accepted: 01/22/2015] [Indexed: 11/13/2022] Open
Abstract
Genetic variation is the motor of evolution and allows organisms to overcome the environmental challenges they encounter. It can be both beneficial and harmful in the process of engineering cell factories for the production of proteins and chemicals. Throughout the history of biotechnology, there have been efforts to exploit genetic variation in our favor to create strains with favorable phenotypes. Genetic variation can either be present in natural populations or it can be artificially created by mutagenesis and selection or adaptive laboratory evolution. On the other hand, unintended genetic variation during a long term production process may lead to significant economic losses and it is important to understand how to control this type of variation. With the emergence of next-generation sequencing technologies, genetic variation in microbial strains can now be determined on an unprecedented scale and resolution by re-sequencing thousands of strains systematically. In this article, we review challenges in the integration and analysis of large-scale re-sequencing data, present an extensive overview of bioinformatics methods for predicting the effects of genetic variants on protein function, and discuss approaches for interfacing existing bioinformatics approaches with genome-scale models of cellular processes in order to predict effects of sequence variation on cellular phenotypes.
Collapse
Affiliation(s)
- João G. R. Cardoso
- The Novo Nordisk Foundation Center of Biosustainability, Technical University of Denmark, Hørsholm, Denmark
| | | | - Markus J. Herrgård
- The Novo Nordisk Foundation Center of Biosustainability, Technical University of Denmark, Hørsholm, Denmark
| | - Nikolaus Sonnenschein
- The Novo Nordisk Foundation Center of Biosustainability, Technical University of Denmark, Hørsholm, Denmark
| |
Collapse
|