1
|
Hu K, Meyer F, Deng ZL, Asgari E, Kuo TH, Münch PC, McHardy AC. Assessing computational predictions of antimicrobial resistance phenotypes from microbial genomes. Brief Bioinform 2024; 25:bbae206. [PMID: 38706320 PMCID: PMC11070729 DOI: 10.1093/bib/bbae206] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2023] [Revised: 04/08/2024] [Accepted: 04/11/2024] [Indexed: 05/07/2024] Open
Abstract
The advent of rapid whole-genome sequencing has created new opportunities for computational prediction of antimicrobial resistance (AMR) phenotypes from genomic data. Both rule-based and machine learning (ML) approaches have been explored for this task, but systematic benchmarking is still needed. Here, we evaluated four state-of-the-art ML methods (Kover, PhenotypeSeeker, Seq2Geno2Pheno and Aytan-Aktug), an ML baseline and the rule-based ResFinder by training and testing each of them across 78 species-antibiotic datasets, using a rigorous benchmarking workflow that integrates three evaluation approaches, each paired with three distinct sample splitting methods. Our analysis revealed considerable variation in the performance across techniques and datasets. Whereas ML methods generally excelled for closely related strains, ResFinder excelled for handling divergent genomes. Overall, Kover most frequently ranked top among the ML approaches, followed by PhenotypeSeeker and Seq2Geno2Pheno. AMR phenotypes for antibiotic classes such as macrolides and sulfonamides were predicted with the highest accuracies. The quality of predictions varied substantially across species-antibiotic combinations, particularly for beta-lactams; across species, resistance phenotyping of the beta-lactams compound, aztreonam, amoxicillin/clavulanic acid, cefoxitin, ceftazidime and piperacillin/tazobactam, alongside tetracyclines demonstrated more variable performance than the other benchmarked antibiotics. By organism, Campylobacter jejuni and Enterococcus faecium phenotypes were more robustly predicted than those of Escherichia coli, Staphylococcus aureus, Salmonella enterica, Neisseria gonorrhoeae, Klebsiella pneumoniae, Pseudomonas aeruginosa, Acinetobacter baumannii, Streptococcus pneumoniae and Mycobacterium tuberculosis. In addition, our study provides software recommendations for each species-antibiotic combination. It furthermore highlights the need for optimization for robust clinical applications, particularly for strains that diverge substantially from those used for training.
Collapse
Affiliation(s)
- Kaixin Hu
- Computational Biology of Infection Research, Helmholtz Center for Infection Research, Braunschweig, Germany
- Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany
| | - Fernando Meyer
- Computational Biology of Infection Research, Helmholtz Center for Infection Research, Braunschweig, Germany
- Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany
| | - Zhi-Luo Deng
- Computational Biology of Infection Research, Helmholtz Center for Infection Research, Braunschweig, Germany
- Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany
| | - Ehsaneddin Asgari
- Computational Biology of Infection Research, Helmholtz Center for Infection Research, Braunschweig, Germany
- Molecular Cell Biomechanics Laboratory, Department of Bioengineering and Mechanical Engineering, University of California, Berkeley, USA
| | - Tzu-Hao Kuo
- Computational Biology of Infection Research, Helmholtz Center for Infection Research, Braunschweig, Germany
- Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany
| | - Philipp C Münch
- Computational Biology of Infection Research, Helmholtz Center for Infection Research, Braunschweig, Germany
- Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany
- Cluster of Excellence RESIST (EXC 2155), Hannover Medical School, Hannover, Germany
- German Center for Infection Research (DZIF), partner site Hannover Braunschweig, Braunschweig, Germany
- Department of Biostatistics, Harvard School of Public Health, Boston, MA, USA
| | - Alice C McHardy
- Computational Biology of Infection Research, Helmholtz Center for Infection Research, Braunschweig, Germany
- Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany
| |
Collapse
|
2
|
Kolobkov D, Mishra Sharma S, Medvedev A, Lebedev M, Kosaretskiy E, Vakhitov R. Efficacy of federated learning on genomic data: a study on the UK Biobank and the 1000 Genomes Project. Front Big Data 2024; 7:1266031. [PMID: 38487517 PMCID: PMC10937521 DOI: 10.3389/fdata.2024.1266031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Accepted: 01/31/2024] [Indexed: 03/17/2024] Open
Abstract
Combining training data from multiple sources increases sample size and reduces confounding, leading to more accurate and less biased machine learning models. In healthcare, however, direct pooling of data is often not allowed by data custodians who are accountable for minimizing the exposure of sensitive information. Federated learning offers a promising solution to this problem by training a model in a decentralized manner thus reducing the risks of data leakage. Although there is increasing utilization of federated learning on clinical data, its efficacy on individual-level genomic data has not been studied. This study lays the groundwork for the adoption of federated learning for genomic data by investigating its applicability in two scenarios: phenotype prediction on the UK Biobank data and ancestry prediction on the 1000 Genomes Project data. We show that federated models trained on data split into independent nodes achieve performance close to centralized models, even in the presence of significant inter-node heterogeneity. Additionally, we investigate how federated model accuracy is affected by communication frequency and suggest approaches to reduce computational complexity or communication costs.
Collapse
Affiliation(s)
- Dmitry Kolobkov
- GENXT, Hinxton, United Kingdom
- Laboratory of Ecological Genetics, Vavilov Institute of General Genetics, Moscow, Russia
| | - Satyarth Mishra Sharma
- GENXT, Hinxton, United Kingdom
- Center for Artificial Intelligence Technology, Skolkovo Institute of Science and Technology, Moscow, Russia
| | - Aleksandr Medvedev
- GENXT, Hinxton, United Kingdom
- Center for Artificial Intelligence Technology, Skolkovo Institute of Science and Technology, Moscow, Russia
| | | | | | | |
Collapse
|
3
|
Brouard C, Mourad R, Vialaneix N. Should we really use graph neural networks for transcriptomic prediction? Brief Bioinform 2024; 25:bbae027. [PMID: 38349060 PMCID: PMC10939369 DOI: 10.1093/bib/bbae027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2023] [Revised: 12/20/2023] [Accepted: 01/17/2024] [Indexed: 02/15/2024] Open
Abstract
The recent development of deep learning methods have undoubtedly led to great improvement in various machine learning tasks, especially in prediction tasks. This type of methods have also been adapted to answer various problems in bioinformatics, including automatic genome annotation, artificial genome generation or phenotype prediction. In particular, a specific type of deep learning method, called graph neural network (GNN) has repeatedly been reported as a good candidate to predict phenotypes from gene expression because its ability to embed information on gene regulation or co-expression through the use of a gene network. However, up to date, no complete and reproducible benchmark has ever been performed to analyze the trade-off between cost and benefit of this approach compared to more standard (and simpler) machine learning methods. In this article, we provide such a benchmark, based on clear and comparable policies to evaluate the different methods on several datasets. Our conclusion is that GNN rarely provides a real improvement in prediction performance, especially when compared to the computation effort required by the methods. Our findings on a limited but controlled simulated dataset shows that this could be explained by the limited quality or predictive power of the input biological gene network itself.
Collapse
Affiliation(s)
- Céline Brouard
- Université Fédérale de Toulouse, INRAE, MIAT, 31326 Castanet-Tolosan, France
| | - Raphaël Mourad
- Université Fédérale de Toulouse, INRAE, MIAT, 31326 Castanet-Tolosan, France
- Université Paul Sabatier, 31062 Toulouse, France
| | - Nathalie Vialaneix
- Université Fédérale de Toulouse, INRAE, MIAT, 31326 Castanet-Tolosan, France
| |
Collapse
|
4
|
Bonet D, Levin M, Montserrat DM, Ioannidis AG. Machine Learning Strategies for Improved Phenotype Prediction in Underrepresented Populations. Pac Symp Biocomput 2024; 29:404-418. [PMID: 38160295 PMCID: PMC10799683] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 01/03/2024]
Abstract
Precision medicine models often perform better for populations of European ancestry due to the over-representation of this group in the genomic datasets and large-scale biobanks from which the models are constructed. As a result, prediction models may misrepresent or provide less accurate treatment recommendations for underrepresented populations, contributing to health disparities. This study introduces an adaptable machine learning toolkit that integrates multiple existing methodologies and novel techniques to enhance the prediction accuracy for underrepresented populations in genomic datasets. By leveraging machine learning techniques, including gradient boosting and automated methods, coupled with novel population-conditional re-sampling techniques, our method significantly improves the phenotypic prediction from single nucleotide polymorphism (SNP) data for diverse populations. We evaluate our approach using the UK Biobank, which is composed primarily of British individuals with European ancestry, and a minority representation of groups with Asian and African ancestry. Performance metrics demonstrate substantial improvements in phenotype prediction for underrepresented groups, achieving prediction accuracy comparable to that of the majority group. This approach represents a significant step towards improving prediction accuracy amidst current dataset diversity challenges. By integrating a tailored pipeline, our approach fosters more equitable validity and utility of statistical genetics methods, paving the way for more inclusive models and outcomes.
Collapse
Affiliation(s)
- David Bonet
- Stanford University, Stanford, CA, US2Universitat Politècnica de Catalunya, Barcelona, Spain
| | | | | | | |
Collapse
|
5
|
Comajoan Cara M, Mas Montserrat D, Ioannidis AG. PopGenAdapt: Semi-Supervised Domain Adaptation for Genotype-to- Phenotype Prediction in Underrepresented Populations. Pac Symp Biocomput 2024; 29:327-340. [PMID: 38160290 PMCID: PMC10906137] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 01/03/2024]
Abstract
The lack of diversity in genomic datasets, currently skewed towards individuals of European ancestry, presents a challenge in developing inclusive biomedical models. The scarcity of such data is particularly evident in labeled datasets that include genomic data linked to electronic health records. To address this gap, this paper presents PopGenAdapt, a genotype-to-phenotype prediction model which adopts semi-supervised domain adaptation (SSDA) techniques originally proposed for computer vision. PopGenAdapt is designed to leverage the substantial labeled data available from individuals of European ancestry, as well as the limited labeled and the larger amount of unlabeled data from currently underrepresented populations. The method is evaluated in underrepresented populations from Nigeria, Sri Lanka, and Hawaii for the prediction of several disease outcomes. The results suggest a significant improvement in the performance of genotype-to-phenotype models for these populations over state-of-the-art supervised learning methods, setting SSDA as a promising strategy for creating more inclusive machine learning models in biomedical research.Our code is available at https://github.com/AI-sandbox/PopGenAdapt.
Collapse
Affiliation(s)
- Marçal Comajoan Cara
- Dept. of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA2Dept. of Signal Theory & Communications, Universitat Politècnica de Catalunya, Barcelona, Spain
| | | | | |
Collapse
|
6
|
Cara MC, Montserrat DM, Ioannidis AG. PopGenAdapt: Semi-Supervised Domain Adaptation for Genotype-to- Phenotype Prediction in Underrepresented Populations. bioRxiv 2023:2023.10.10.561715. [PMID: 37873492 PMCID: PMC10592760 DOI: 10.1101/2023.10.10.561715] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/25/2023]
Abstract
The lack of diversity in genomic datasets, currently skewed towards individuals of European ancestry, presents a challenge in developing inclusive biomedical models. The scarcity of such data is particularly evident in labeled datasets that include genomic data linked to electronic health records. To address this gap, this paper presents PopGenAdapt, a genotype-to-phenotype prediction model which adopts semi-supervised domain adaptation (SSDA) techniques originally proposed for computer vision. PopGenAdapt is designed to leverage the substantial labeled data available from individuals of European ancestry, as well as the limited labeled and the larger amount of unlabeled data from currently underrepresented populations. The method is evaluated in underrepresented populations from Nigeria, Sri Lanka, and Hawaii for the prediction of several disease outcomes. The results suggest a significant improvement in the performance of genotype-to-phenotype models for these populations over state-of-the-art supervised learning methods, setting SSDA as a promising strategy for creating more inclusive machine learning models in biomedical research.
Collapse
Affiliation(s)
- Marçal Comajoan Cara
- Dept. of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA
- Dept. of Signal Theory & Communications, Universitat Politècnica de Catalunya, Barcelona, Spain
| | - Daniel Mas Montserrat
- Dept. of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA
| | - Alexander G Ioannidis
- Dept. of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA
- Dept. of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
- Institute for Computational & Mathematical Engineering, Stanford University, Stanford, CA, USA
| |
Collapse
|
7
|
Ruan M, Hu Z, Zhu Q, Li Y, Nie X. 16S rDNA Sequencing-Based Insights into the Bacterial Community Structure and Function in Co-Existing Soil and Coal Gangue. Microorganisms 2023; 11:2151. [PMID: 37763995 PMCID: PMC10536285 DOI: 10.3390/microorganisms11092151] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2023] [Revised: 08/16/2023] [Accepted: 08/21/2023] [Indexed: 09/29/2023] Open
Abstract
Coal gangue is a solid waste emitted during coal production. Coal gangue is deployed adjacent to mining land and has characteristics similar to those of the soils of these areas. Coal gangue-soil ecosystems provide habitats for a rich and active bacterial community. However, co-existence networks and the functionality of soil and coal gangue bacterial communities have not been studied. Here, we performed Illumina MiSeq high-throughput sequencing, symbiotic network and statistical analyses, and microbial phenotype prediction to study the microbial community in coal gangue and soil samples from Shanxi Province, China. In general, the structural difference between the bacterial communities in coal gangue and soil was large, indicating that interactions between soil and coal gangue are limited but not absent. The bacterial community exhibited a significant symbiosis network in soil and coal gangue. The co-occurrence network was primarily formed by Proteobacteria, Firmicutes, and Actinobacteria. In addition, BugBase microbiome phenotype predictions and PICRUSt bacterial functional potential predictions showed that transcription regulators represented the highest functional category of symbiotic bacteria in soil and coal gangue. Proteobacteria played an important role in various processes such as mobile element pathogenicity, oxidative stress tolerance, and biofilm formation. In general, this work provides a theoretical basis and data support for the in situ remediation of acidified coal gangue hills based on microbiological methods.
Collapse
Affiliation(s)
- Mengying Ruan
- Institute of Land Reclamation and Ecological Restoration, China University of Mining and Technology-Beijing, Beijing 100083, China; (M.R.); (X.N.)
| | - Zhenqi Hu
- China University of Mining and Technology, Xuzhou 221116, China;
| | - Qi Zhu
- National Engineering Laboratory for Lake Pollution Control and Ecological Restoration, State Environmental Protection Key Laboratory for Lake Pollution Control, Chinese Research Academy of Environmental Sciences, Beijing 100012, China;
| | - Yuanyuan Li
- China University of Mining and Technology, Xuzhou 221116, China;
| | - Xinran Nie
- Institute of Land Reclamation and Ecological Restoration, China University of Mining and Technology-Beijing, Beijing 100083, China; (M.R.); (X.N.)
| |
Collapse
|
8
|
Aspromonte MC, Conte AD, Zhu S, Tan W, Shen Y, Zhang Y, Li Q, Wang MH, Babbi G, Bovo S, Martelli PL, Casadio R, Althagafi A, Toonsi S, Kulmanov M, Hoehndorf R, Katsonis P, Williams A, Lichtarge O, Xian S, Surento W, Pejaver V, Mooney SD, Sunderam U, Srinivasan R, Murgia A, Piovesan D, Tosatto SCE, Leonardi E. CAGI6 ID-Challenge: Assessment of phenotype and variant predictions in 415 children with Neurodevelopmental Disorders (NDDs). Res Sq 2023:rs.3.rs-3209168. [PMID: 37577579 PMCID: PMC10418555 DOI: 10.21203/rs.3.rs-3209168/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/15/2023]
Abstract
In the context of the Critical Assessment of the Genome Interpretation, 6th edition (CAGI6), the Genetics of Neurodevelopmental Disorders Lab in Padua proposed a new ID-challenge to give the opportunity of developing computational methods for predicting patient's phenotype and the causal variants. Eight research teams and 30 models had access to the phenotype details and real genetic data, based on the sequences of 74 genes (VCF format) in 415 pediatric patients affected by Neurodevelopmental Disorders (NDDs). NDDs are clinically and genetically heterogeneous conditions, with onset in infant age. In this study we evaluate the ability and accuracy of computational methods to predict comorbid phenotypes based on clinical features described in each patient and causal variants. Finally, we asked to develop a method to find new possible genetic causes for patients without a genetic diagnosis. As already done for the CAGI5, seven clinical features (ID, ASD, ataxia, epilepsy, microcephaly, macrocephaly, hypotonia), and variants (causative, putative pathogenic and contributing factors) were provided. Considering the overall clinical manifestation of our cohort, we give out the variant data and phenotypic traits of the 150 patients from CAGI5 ID-Challenge as training and validation for the prediction methods development.
Collapse
Affiliation(s)
| | | | - Shaowen Zhu
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843
| | - Wuwei Tan
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843
| | - Yang Shen
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843
| | | | - Qi Li
- CUHK Shenzhen Research Institute, Shenzhen
| | | | - Giulia Babbi
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna
| | - Samuele Bovo
- Department of Agricultural and Food Sciences, University of Bologna
| | - Pier Luigi Martelli
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna
| | - Rita Casadio
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna
| | - Azza Althagafi
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences & Engineering Division (CEMSE), King Abdullah University of Science and Technology (KAUST), Thuwal 23
| | - Sumyyah Toonsi
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences & Engineering Division (CEMSE), King Abdullah University of Science and Technology (KAUST), Thuwal 23
| | - Maxat Kulmanov
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences & Engineering Division (CEMSE), King Abdullah University of Science and Technology (KAUST), Thuwal 23
| | - Robert Hoehndorf
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences & Engineering Division (CEMSE), King Abdullah University of Science and Technology (KAUST), Thuwal 23
| | - Panagiotis Katsonis
- Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030
| | - Amanda Williams
- Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030
| | - Olivier Lichtarge
- Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030
| | - Su Xian
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA 98195
| | - Wesley Surento
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA 98195
| | - Vikas Pejaver
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY 10029
| | - Sean D Mooney
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA 98195
| | - Uma Sunderam
- Innovation Labs, Tata Consultancy Services, Hyderabad
| | | | | | | | | | | |
Collapse
|
9
|
Mowlaei ME, Shi X. FSF-GA: A Feature Selection Framework for Phenotype Prediction Using Genetic Algorithms. Genes (Basel) 2023; 14:genes14051059. [PMID: 37239419 DOI: 10.3390/genes14051059] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2023] [Revised: 05/04/2023] [Accepted: 05/06/2023] [Indexed: 05/28/2023] Open
Abstract
(1) Background: Phenotype prediction is a pivotal task in genetics in order to identify how genetic factors contribute to phenotypic differences. This field has seen extensive research, with numerous methods proposed for predicting phenotypes. Nevertheless, the intricate relationship between genotypes and complex phenotypes, including common diseases, has resulted in an ongoing challenge to accurately decipher the genetic contribution. (2) Results: In this study, we propose a novel feature selection framework for phenotype prediction utilizing a genetic algorithm (FSF-GA) that effectively reduces the feature space to identify genotypes contributing to phenotype prediction. We provide a comprehensive vignette of our method and conduct extensive experiments using a widely used yeast dataset. (3) Conclusions: Our experimental results show that our proposed FSF-GA method delivers comparable phenotype prediction performance as compared to baseline methods, while providing features selected for predicting phenotypes. These selected feature sets can be used to interpret the underlying genetic architecture that contributes to phenotypic variation.
Collapse
Affiliation(s)
- Mohammad Erfan Mowlaei
- Department of Computer and Information Sciences, Temple University, 925 N. 12th Street, Philadelphia, PA 19122, USA
| | - Xinghua Shi
- Department of Computer and Information Sciences, Temple University, 925 N. 12th Street, Philadelphia, PA 19122, USA
| |
Collapse
|
10
|
Chen Y, Guo Y, Guan P, Wang Y, Wang X, Wang Z, Qin Z, Ma S, Xin M, Hu Z, Yao Y, Ni Z, Sun Q, Guo W, Peng H. A wheat integrative regulatory network from large-scale complementary functional datasets enables trait-associated gene discovery for crop improvement. Mol Plant 2023; 16:393-414. [PMID: 36575796 DOI: 10.1016/j.molp.2022.12.019] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/18/2022] [Revised: 11/28/2022] [Accepted: 12/18/2022] [Indexed: 06/17/2023]
Abstract
Gene regulation is central to all aspects of organism growth, and understanding it using large-scale functional datasets can provide a whole view of biological processes controlling complex phenotypic traits in crops. However, the connection between massive functional datasets and trait-associated gene discovery for crop improvement is still lacking. In this study, we constructed a wheat integrative gene regulatory network (wGRN) by combining an updated genome annotation and diverse complementary functional datasets, including gene expression, sequence motif, transcription factor (TF) binding, chromatin accessibility, and evolutionarily conserved regulation. wGRN contains 7.2 million genome-wide interactions covering 5947 TFs and 127 439 target genes, which were further verified using known regulatory relationships, condition-specific expression, gene functional information, and experiments. We used wGRN to assign genome-wide genes to 3891 specific biological pathways and accurately prioritize candidate genes associated with complex phenotypic traits in genome-wide association studies. In addition, wGRN was used to enhance the interpretation of a spike temporal transcriptome dataset to construct high-resolution networks. We further unveiled novel regulators that enhance the power of spike phenotypic trait prediction using machine learning and contribute to the spike phenotypic differences among modern wheat accessions. Finally, we developed an interactive webserver, wGRN (http://wheat.cau.edu.cn/wGRN), for the community to explore gene regulation and discover trait-associated genes. Collectively, this community resource establishes the foundation for using large-scale functional datasets to guide trait-associated gene discovery for crop improvement.
Collapse
Affiliation(s)
- Yongming Chen
- Frontiers Science Center for Molecular Design Breeding, Key Laboratory of Crop Heterosis and Utilization, Beijing Key Laboratory of Crop Genetic Improvement, China Agricultural University, Beijing 100193, China
| | - Yiwen Guo
- Frontiers Science Center for Molecular Design Breeding, Key Laboratory of Crop Heterosis and Utilization, Beijing Key Laboratory of Crop Genetic Improvement, China Agricultural University, Beijing 100193, China
| | - Panfeng Guan
- Frontiers Science Center for Molecular Design Breeding, Key Laboratory of Crop Heterosis and Utilization, Beijing Key Laboratory of Crop Genetic Improvement, China Agricultural University, Beijing 100193, China
| | - Yongfa Wang
- Frontiers Science Center for Molecular Design Breeding, Key Laboratory of Crop Heterosis and Utilization, Beijing Key Laboratory of Crop Genetic Improvement, China Agricultural University, Beijing 100193, China
| | - Xiaobo Wang
- Frontiers Science Center for Molecular Design Breeding, Key Laboratory of Crop Heterosis and Utilization, Beijing Key Laboratory of Crop Genetic Improvement, China Agricultural University, Beijing 100193, China
| | - Zihao Wang
- Frontiers Science Center for Molecular Design Breeding, Key Laboratory of Crop Heterosis and Utilization, Beijing Key Laboratory of Crop Genetic Improvement, China Agricultural University, Beijing 100193, China
| | - Zhen Qin
- Frontiers Science Center for Molecular Design Breeding, Key Laboratory of Crop Heterosis and Utilization, Beijing Key Laboratory of Crop Genetic Improvement, China Agricultural University, Beijing 100193, China
| | - Shengwei Ma
- Hainan Yazhou Bay Seed Laboratory, Sanya, Hainan, China; State Key Laboratory of Plant Cell and Chromosome Engineering, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing, China
| | - Mingming Xin
- Frontiers Science Center for Molecular Design Breeding, Key Laboratory of Crop Heterosis and Utilization, Beijing Key Laboratory of Crop Genetic Improvement, China Agricultural University, Beijing 100193, China
| | - Zhaorong Hu
- Frontiers Science Center for Molecular Design Breeding, Key Laboratory of Crop Heterosis and Utilization, Beijing Key Laboratory of Crop Genetic Improvement, China Agricultural University, Beijing 100193, China
| | - Yingyin Yao
- Frontiers Science Center for Molecular Design Breeding, Key Laboratory of Crop Heterosis and Utilization, Beijing Key Laboratory of Crop Genetic Improvement, China Agricultural University, Beijing 100193, China
| | - Zhongfu Ni
- Frontiers Science Center for Molecular Design Breeding, Key Laboratory of Crop Heterosis and Utilization, Beijing Key Laboratory of Crop Genetic Improvement, China Agricultural University, Beijing 100193, China
| | - Qixin Sun
- Frontiers Science Center for Molecular Design Breeding, Key Laboratory of Crop Heterosis and Utilization, Beijing Key Laboratory of Crop Genetic Improvement, China Agricultural University, Beijing 100193, China
| | - Weilong Guo
- Frontiers Science Center for Molecular Design Breeding, Key Laboratory of Crop Heterosis and Utilization, Beijing Key Laboratory of Crop Genetic Improvement, China Agricultural University, Beijing 100193, China.
| | - Huiru Peng
- Frontiers Science Center for Molecular Design Breeding, Key Laboratory of Crop Heterosis and Utilization, Beijing Key Laboratory of Crop Genetic Improvement, China Agricultural University, Beijing 100193, China.
| |
Collapse
|
11
|
Forutan M, Lynn A, Aliloo H, Clark SA, McGilchrist P, Polkinghorne R, Hayes BJ. Predicting phenotypes of beef eating quality traits. Front Genet 2023; 14:1089490. [PMID: 36816029 PMCID: PMC9936823 DOI: 10.3389/fgene.2023.1089490] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2022] [Accepted: 01/19/2023] [Indexed: 02/04/2023] Open
Abstract
Introduction: Phenotype predictions of beef eating quality for individual animals could be used to allocate animals to longer and more expensive feeding regimes as they enter the feedlot if they are predicted to have higher eating quality, and to sort carcasses into consumer or market value categories. Phenotype predictions can include genetic effects (breed effects, heterosis and breeding value), predicted from genetic markers, as well as fixed effects such as days aged and carcass weight, hump height, ossification, and hormone growth promotant (HGP) status. Methods: Here we assessed accuracy of phenotype predictions for five eating quality traits (tenderness, juiciness, flavour, overall liking and MQ4) in striploins from 1701 animals from a wide variety of backgrounds, including Bos indicus and Bos taurus breeds, using genotypes and simple fixed effects including days aged and carcass weight. The genetic components were predicted based on 709k single nucleotide polymorphism (SNP) using BayesR model, which assumes some markers may have a moderate to large effect. Fixed effects in the prediction included principal components of the genomic relationship matrix, to account for breed effects, heterosis, days aged and carcass weight. Results and Discussion: A model which allowed breed effects to be captured in the SNP effects (e.g., not explicitly fitting these effects) tended to have slightly higher accuracies (0.43-0.50) compared to when these effects were explicitly fitted as fixed effects (0.42-0.49), perhaps because breed effects when explicitly fitted were estimated with more error than when incorporated into the (random) SNP effects. Adding estimates of effects of days aged and carcass weight did not increase the accuracy of phenotype predictions in this particular analysis. The accuracy of phenotype prediction for beef eating quality traits was sufficiently high that such predictions could be useful in predicting eating quality from DNA samples taken from an animal/carcass as it enters the processing plant, to enable optimal supply chain value extraction by sorting product into markets with different quality. The BayesR predictions identified several novel genes potentially associated with beef eating quality.
Collapse
Affiliation(s)
- Mehrnush Forutan
- Queensland Alliance for Agriculture and Food Innovation, The University of Queensland, Brisbane, QLD, Australia,*Correspondence: Mehrnush Forutan,
| | - Andrew Lynn
- School of Environmental and Rural Science, University of New England, Armidale, NSW, Australia
| | - Hassan Aliloo
- School of Environmental and Rural Science, University of New England, Armidale, NSW, Australia
| | - Samuel A. Clark
- School of Environmental and Rural Science, University of New England, Armidale, NSW, Australia
| | - Peter McGilchrist
- School of Environmental and Rural Science, University of New England, Armidale, NSW, Australia
| | | | - Ben J. Hayes
- Queensland Alliance for Agriculture and Food Innovation, The University of Queensland, Brisbane, QLD, Australia
| |
Collapse
|
12
|
John M, Haselbeck F, Dass R, Malisi C, Ricca P, Dreischer C, Schultheiss SJ, Grimm DG. A comparison of classical and machine learning-based phenotype prediction methods on simulated data and three plant species. Front Plant Sci 2022; 13:932512. [PMID: 36407627 PMCID: PMC9673477 DOI: 10.3389/fpls.2022.932512] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/29/2022] [Accepted: 07/25/2022] [Indexed: 06/16/2023]
Abstract
Genomic selection is an integral tool for breeders to accurately select plants directly from genotype data leading to faster and more resource-efficient breeding programs. Several prediction methods have been established in the last few years. These range from classical linear mixed models to complex non-linear machine learning approaches, such as Support Vector Regression, and modern deep learning-based architectures. Many of these methods have been extensively evaluated on different crop species with varying outcomes. In this work, our aim is to systematically compare 12 different phenotype prediction models, including basic genomic selection methods to more advanced deep learning-based techniques. More importantly, we assess the performance of these models on simulated phenotype data as well as on real-world data from Arabidopsis thaliana and two breeding datasets from soy and corn. The synthetic phenotypic data allow us to analyze all prediction models and especially the selected markers under controlled and predefined settings. We show that Bayes B and linear regression models with sparsity constraints perform best under different simulation settings with respect to explained variance. Further, we can confirm results from other studies that there is no superiority of more complex neural network-based architectures for phenotype prediction compared to well-established methods. However, on real-world data, for which several prediction models yield comparable results with slight advantages for Elastic Net, this picture is less clear, suggesting that there is a lot of room for future research.
Collapse
Affiliation(s)
- Maura John
- Technical University of Munich, Campus Straubing for Biotechnology and Sustainability, Bioinformatics, Straubing, Germany
- Weihenstephan-Triesdorf University of Applied Sciences, Bioinformatics, Straubing, Germany
| | - Florian Haselbeck
- Technical University of Munich, Campus Straubing for Biotechnology and Sustainability, Bioinformatics, Straubing, Germany
- Weihenstephan-Triesdorf University of Applied Sciences, Bioinformatics, Straubing, Germany
| | | | | | | | | | | | - Dominik G. Grimm
- Technical University of Munich, Campus Straubing for Biotechnology and Sustainability, Bioinformatics, Straubing, Germany
- Weihenstephan-Triesdorf University of Applied Sciences, Bioinformatics, Straubing, Germany
- Technical University of Munich, Department of Informatics, Garching, Germany
| |
Collapse
|
13
|
Jiang Z, Lu Y, Liu Z, Wu W, Xu X, Dinnyés A, Yu Z, Chen L, Sun Q. Drug resistance prediction and resistance genes identification in Mycobacterium tuberculosis based on a hierarchical attentive neural network utilizing genome-wide variants. Brief Bioinform 2022; 23:6553603. [PMID: 35325021 DOI: 10.1093/bib/bbac041] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2021] [Revised: 01/18/2022] [Accepted: 01/27/2022] [Indexed: 01/25/2023] Open
Abstract
Prediction of antimicrobial resistance based on whole-genome sequencing data has attracted greater attention due to its rapidity and convenience. Numerous machine learning-based studies have used genetic variants to predict drug resistance in Mycobacterium tuberculosis (MTB), assuming that variants are homogeneous, and most of these studies, however, have ignored the essential correlation between variants and corresponding genes when encoding variants, and used a limited number of variants as prediction input. In this study, taking advantage of genome-wide variants for drug-resistance prediction and inspired by natural language processing, we summarize drug resistance prediction into document classification, in which variants are considered as words, mutated genes in an isolate as sentences, and an isolate as a document. We propose a novel hierarchical attentive neural network model (HANN) that helps discover drug resistance-related genes and variants and acquire more interpretable biological results. It captures the interaction among variants in a mutated gene as well as among mutated genes in an isolate. Our results show that for the four first-line drugs of isoniazid (INH), rifampicin (RIF), ethambutol (EMB) and pyrazinamide (PZA), the HANN achieves the optimal area under the ROC curve of 97.90, 99.05, 96.44 and 95.14% and the optimal sensitivity of 94.63, 96.31, 92.56 and 87.05%, respectively. In addition, without any domain knowledge, the model identifies drug resistance-related genes and variants consistent with those confirmed by previous studies, and more importantly, it discovers one more potential drug-resistance-related gene.
Collapse
Affiliation(s)
- Zhonghua Jiang
- Key Laboratory of Bio-resources and Eco-environment of the Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, Sichuan 610064, China
| | - Yongmei Lu
- College of Computer Science, Sichuan University, Chengdu, Sichuan 610065, China
| | - Zhuochong Liu
- Key Laboratory of Bio-resources and Eco-environment of the Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, Sichuan 610064, China
| | - Wei Wu
- Key Laboratory of Bio-resources and Eco-environment of the Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, Sichuan 610064, China
| | - Xinyi Xu
- Key Laboratory of Bio-resources and Eco-environment of the Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, Sichuan 610064, China
| | - András Dinnyés
- BioTalentum Ltd. Aulich Lajos str. 26. 2100 Gödöllõ, Hungary
| | - Zhonghua Yu
- College of Computer Science, Sichuan University, Chengdu, Sichuan 610065, China
| | - Li Chen
- College of Computer Science, Sichuan University, Chengdu, Sichuan 610065, China
| | - Qun Sun
- Key Laboratory of Bio-resources and Eco-environment of the Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, Sichuan 610064, China
| |
Collapse
|
14
|
Dinh JC, Boone EC, Staggs VS, Pearce RE, Wang WY, Gaedigk R, Leeder JS, Gaedigk A. The Impact of the CYP2D6 "Enhancer" Single Nucleotide Polymorphism on CYP2D6 Activity. Clin Pharmacol Ther 2022; 111:646-654. [PMID: 34716917 PMCID: PMC8825689 DOI: 10.1002/cpt.2469] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2021] [Accepted: 10/21/2021] [Indexed: 11/10/2022]
Abstract
rs5758550 has been associated with enhanced transcription and suggested to be a useful marker of CYP2D6 activity. As there are limited and inconsistent data regarding the utility of this distant "enhancer" single nucleotide polymorphism (SNP), our goal was to further assess the impact of rs5758550 on CYP2D6 activity toward two probe substrates, atomoxetine (ATX) and dextromethorphan (DM), using in vivo urinary metabolite (DM; n = 188) and pharmacokinetic (ATX; n = 70) and in vitro metabolite formation (ATX and DM; n = 166) data. All subjects and tissues were extensively genotyped, the "enhancer" SNP phased with established CYP2D6 haplotypes either computationally or experimentally, and the impact on CYP2D6 activity investigated using several linear models of varying complexity to determine the proportion of variability in CYP2D6 activity captured by each model. For all datasets and models, the "enhancer" SNP had no or only a modest impact on CYP2D6 activity prediction. An increased effect, when present, was more pronounced for ATX than DM suggesting potential substate-dependency. In addition, CYP2D6*2 alleles with the "enhancer" SNP were associated with modestly higher metabolite formation rates in vitro, but not in vivo; no effect was detected for CYP2D6*1 alleles with "enhancer" SNP. In summary, it remains inconclusive whether the small effects detected in this investigation are indeed caused by the "enhancer" SNP or are rather due to its incomplete linkage with other variants within the gene. Taken together, there does not appear to be sufficient evidence to warrant the "enhancer" SNP be included in clinical CYP2D6 pharmacogenetic testing.
Collapse
Affiliation(s)
- Jean C Dinh
- Division of Clinical Pharmacology, Toxicology, and Therapeutic Innovation, Children's Mercy Kansas City, Kansas City, Missouri, USA
| | - Erin C Boone
- Division of Clinical Pharmacology, Toxicology, and Therapeutic Innovation, Children's Mercy Kansas City, Kansas City, Missouri, USA
| | - Vincent S Staggs
- Biostatistics and Epidemiology Core, Health Services and Outcomes Research, Children's Mercy Kansas City, Kansas City, Missouri, USA
- Department of Pediatrics, University of Missouri-Kansas City School of Medicine, Kansas City, Missouri, USA
| | - Robin E Pearce
- Division of Clinical Pharmacology, Toxicology, and Therapeutic Innovation, Children's Mercy Kansas City, Kansas City, Missouri, USA
| | - Wendy Y Wang
- Division of Clinical Pharmacology, Toxicology, and Therapeutic Innovation, Children's Mercy Kansas City, Kansas City, Missouri, USA
| | - Roger Gaedigk
- Division of Clinical Pharmacology, Toxicology, and Therapeutic Innovation, Children's Mercy Kansas City, Kansas City, Missouri, USA
| | - James Steven Leeder
- Division of Clinical Pharmacology, Toxicology, and Therapeutic Innovation, Children's Mercy Kansas City, Kansas City, Missouri, USA
- Department of Pediatrics, University of Missouri-Kansas City School of Medicine, Kansas City, Missouri, USA
| | - Andrea Gaedigk
- Division of Clinical Pharmacology, Toxicology, and Therapeutic Innovation, Children's Mercy Kansas City, Kansas City, Missouri, USA
- Department of Pediatrics, University of Missouri-Kansas City School of Medicine, Kansas City, Missouri, USA
| |
Collapse
|
15
|
Diepenbroek M, Bayer B, Anslinger K. Pushing the Boundaries: Forensic DNA Phenotyping Challenged by Single-Cell Sequencing. Genes (Basel) 2021; 12:genes12091362. [PMID: 34573344 PMCID: PMC8466929 DOI: 10.3390/genes12091362] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2021] [Revised: 08/24/2021] [Accepted: 08/27/2021] [Indexed: 12/26/2022] Open
Abstract
Single-cell sequencing is a fast developing and very promising field; however, it is not commonly used in forensics. The main motivation behind introducing this technology into forensics is to improve mixture deconvolution, especially when a trace consists of the same cell type. Successful studies demonstrate the ability to analyze a mixture by separating single cells and obtaining CE-based STR profiles. This indicates a potential use of the method in other forensic investigations, like forensic DNA phenotyping, in which using mixed traces is not fully recommended. For this study, we collected single-source autopsy blood from which the white cells were first stained and later separated with the DEPArray™ N×T System. Groups of 20, 10, and 5 cells, as well as 20 single cells, were collected and submitted for DNA extraction. Libraries were prepared using the Ion AmpliSeq™ PhenoTrivium Panel, which includes both phenotype (HIrisPlex-S: eye, hair, and skin color) and ancestry-associated SNP-markers. Prior to sequencing, half of the single-cell-based libraries were additionally amplified and purified in order to improve the library concentrations. Ancestry and phenotype analysis resulted in nearly full consensus profiles resulting in correct predictions not only for the cells groups but also for the ten re-amplified single-cell libraries. Our results suggest that sequencing of single cells can be a promising tool used to deconvolute mixed traces submitted for forensic DNA phenotyping.
Collapse
|
16
|
Raghu VK, Ge X, Balajiee A, Shirer DJ, Das I, Benos PV, Chrysanthis PK. A Pipeline for Integrated Theory and Data-Driven Modeling of Biomedical Data. IEEE/ACM Trans Comput Biol Bioinform 2021; 18:811-822. [PMID: 32841121 PMCID: PMC8237279 DOI: 10.1109/tcbb.2020.3019237] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Genome sequencing technologies have the potential to transform clinical decision making and biomedical research by enabling high-throughput measurements of the genome at a granular level. However, to truly understand mechanisms of disease and predict the effects of medical interventions, high-throughput data must be integrated with demographic, phenotypic, environmental, and behavioral data from individuals. Further, effective knowledge discovery methods must infer relationships between these data types. We recently proposed a pipeline (CausalMGM) to achieve this. CausalMGM uses probabilistic graphical models to infer the relationships between variables in the data; however, CausalMGM's graphical structure learning algorithm can only handle small datasets efficiently. We propose a new methodology (piPref-Div) that selects the most informative variables for CausalMGM, enabling it to scale. We validate the efficacy of piPref-Div against other feature selection methods and demonstrate how the use of the full pipeline improves breast cancer outcome prediction and provides biologically interpretable views of gene expression data.
Collapse
|
17
|
Song K, Wright FA, Zhou YH. Systematic Comparisons for Composition Profiles, Taxonomic Levels, and Machine Learning Methods for Microbiome-Based Disease Prediction. Front Mol Biosci 2020; 7:610845. [PMID: 33392266 PMCID: PMC7772236 DOI: 10.3389/fmolb.2020.610845] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2020] [Accepted: 11/25/2020] [Indexed: 12/12/2022] Open
Abstract
Microbiome composition profiles generated from 16S rRNA sequencing have been extensively studied for their usefulness in phenotype trait prediction, including for complex diseases such as diabetes and obesity. These microbiome compositions have typically been quantified in the form of Operational Taxonomic Unit (OTU) count matrices. However, alternate approaches such as Amplicon Sequence Variants (ASV) have been used, as well as the direct use of k-mer sequence counts. The overall effect of these different types of predictors when used in concert with various machine learning methods has been difficult to assess, due to varied combinations described in the literature. Here we provide an in-depth investigation of more than 1,000 combinations of these three clustering/counting methods, in combination with varied choices for normalization and filtering, grouping at various taxonomic levels, and the use of more than ten commonly used machine learning methods for phenotype prediction. The use of short k-mers, which have computational advantages and conceptual simplicity, is shown to be effective as a source for microbiome-based prediction. Among machine-learning approaches, tree-based methods show consistent, though modest, advantages in prediction accuracy. We describe the various advantages and disadvantages of combinations in analysis approaches, and provide general observations to serve as a useful guide for future trait-prediction explorations using microbiome data.
Collapse
Affiliation(s)
- Kuncheng Song
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC, United States
| | - Fred A Wright
- Departments of Statistics and Biological Sciences, North Carolina State University, Raleigh, NC, United States
| | - Yi-Hui Zhou
- Department of Biological Sciences, North Carolina State University, Raleigh, NC, United States
| |
Collapse
|
18
|
Diepenbroek M, Bayer B, Schwender K, Schiller R, Lim J, Lagacé R, Anslinger K. Evaluation of the Ion AmpliSeq™ PhenoTrivium Panel: MPS-Based Assay for Ancestry and Phenotype Predictions Challenged by Casework Samples. Genes (Basel) 2020; 11:E1398. [PMID: 33255693 DOI: 10.3390/genes11121398] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2020] [Revised: 11/19/2020] [Accepted: 11/22/2020] [Indexed: 12/21/2022] Open
Abstract
As the field of forensic DNA analysis has started to transition from genetics to genomics, new methods to aid in crime scene investigations have arisen. The development of informative single nucleotide polymorphism (SNP) markers has led the forensic community to question if DNA can be a reliable "eye-witness" and whether the data it provides can shed light on unknown perpetrators. We have developed an assay called the Ion AmpliSeq™ PhenoTrivium Panel, which combines three groups of markers: 41 phenotype- and 163 ancestry-informative autosomal SNPs together with 120 lineage-specific Y-SNPs. Here, we report the results of testing the assay's sensitivity and the predictions obtained for known reference samples. Moreover, we present the outcome of a blind study performed on real casework samples in order to understand the value and reliability of the information that would be provided to police investigators. Furthermore, we evaluated the accuracy of admixture prediction in Converge™ Software. The results show the panel to be a robust and sensitive assay which can be used to analyze casework samples. We conclude that the combination of the obtained predictions of phenotype, biogeographical ancestry, and male lineage can serve as a potential lead in challenging police investigations such as cold cases or cases with no suspect.
Collapse
|
19
|
Abstract
The prediction of breeding values and phenotypes is of central importance for both livestock and crop breeding. In this study, we analyze the use of artificial neural networks (ANN) and, in particular, local convolutional neural networks (LCNN) for genomic prediction, as a region-specific filter corresponds much better with our prior genetic knowledge on the genetic architecture of traits than traditional convolutional neural networks. Model performances are evaluated on a simulated maize data panel (n = 10,000; p = 34,595) and real Arabidopsis data (n = 2,039; p = 180,000) for a variety of traits based on their predictive ability. The baseline LCNN, containing one local convolutional layer (kernel size: 10) and two fully connected layers with 64 nodes each, is outperforming commonly proposed ANNs (multi layer perceptrons and convolutional neural networks) for basically all considered traits. For traits with high heritability and large training population as present in the simulated data, LCNN are even outperforming state-of-the-art methods like genomic best linear unbiased prediction (GBLUP), Bayesian models and extended GBLUP, indicated by an increase in predictive ability of up to 24%. However, for small training populations, these state-of-the-art methods outperform all considered ANNs. Nevertheless, the LCNN still outperforms all other considered ANNs by around 10%. Minor improvements to the tested baseline network architecture of the LCNN were obtained by increasing the kernel size and of reducing the stride, whereas the number of subsequent fully connected layers and their node sizes had neglectable impact. Although gains in predictive ability were obtained for large scale data sets by using LCNNs, the practical use of ANNs comes with additional problems, such as the need of genotyping all considered individuals, the lack of estimation of heritability and reliability. Furthermore, breeding values are additive by design, whereas ANN-based estimates are not. However, ANNs also comes with new opportunities, as networks can easily be extended to account for additional inputs (omics, weather etc.) and outputs (multi-trait models), and computing time increases linearly with the number of individuals. With advances in high-throughput phenotyping and cheaper genotyping, ANNs can become a valid alternative for genomic prediction.
Collapse
Affiliation(s)
- Torsten Pook
- Animal Breeding and Genetics Group, Department of Animal Sciences, Center for Integrated Breeding Research, University of Goettingen, Göttingen, Germany
| | - Jan Freudenthal
- Center for Computational and Theoretical Biology, University of Wuerzburg, Wuerzburg, Germany
| | - Arthur Korte
- Center for Computational and Theoretical Biology, University of Wuerzburg, Wuerzburg, Germany
| | - Henner Simianer
- Animal Breeding and Genetics Group, Department of Animal Sciences, Center for Integrated Breeding Research, University of Goettingen, Göttingen, Germany
| |
Collapse
|
20
|
Lees JA, Mai TT, Galardini M, Wheeler NE, Horsfield ST, Parkhill J, Corander J. Improved Prediction of Bacterial Genotype-Phenotype Associations Using Interpretable Pangenome-Spanning Regressions. mBio 2020; 11:e01344-20. [PMID: 32636251 PMCID: PMC7343994 DOI: 10.1128/mbio.01344-20] [Citation(s) in RCA: 38] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2020] [Accepted: 06/05/2020] [Indexed: 12/19/2022] Open
Abstract
Discovery of genetic variants underlying bacterial phenotypes and the prediction of phenotypes such as antibiotic resistance are fundamental tasks in bacterial genomics. Genome-wide association study (GWAS) methods have been applied to study these relations, but the plastic nature of bacterial genomes and the clonal structure of bacterial populations creates challenges. We introduce an alignment-free method which finds sets of loci associated with bacterial phenotypes, quantifies the total effect of genetics on the phenotype, and allows accurate phenotype prediction, all within a single computationally scalable joint modeling framework. Genetic variants covering the entire pangenome are compactly represented by extended DNA sequence words known as unitigs, and model fitting is achieved using elastic net penalization, an extension of standard multiple regression. Using an extensive set of state-of-the-art bacterial population genomic data sets, we demonstrate that our approach performs accurate phenotype prediction, comparable to popular machine learning methods, while retaining both interpretability and computational efficiency. Compared to those of previous approaches, which test each genotype-phenotype association separately for each variant and apply a significance threshold, the variants selected by our joint modeling approach overlap substantially.IMPORTANCE Being able to identify the genetic variants responsible for specific bacterial phenotypes has been the goal of bacterial genetics since its inception and is fundamental to our current level of understanding of bacteria. This identification has been based primarily on painstaking experimentation, but the availability of large data sets of whole genomes with associated phenotype metadata promises to revolutionize this approach, not least for important clinical phenotypes that are not amenable to laboratory analysis. These models of phenotype-genotype association can in the future be used for rapid prediction of clinically important phenotypes such as antibiotic resistance and virulence by rapid-turnaround or point-of-care tests. However, despite much effort being put into adapting genome-wide association study (GWAS) approaches to cope with bacterium-specific problems, such as strong population structure and horizontal gene exchange, current approaches are not yet optimal. We describe a method that advances methodology for both association and generation of portable prediction models.
Collapse
Affiliation(s)
- John A Lees
- MRC Centre for Global Infectious Disease Analysis, Department of Infectious Disease Epidemiology, Imperial College London, London, United Kingdom
| | - T Tien Mai
- Oslo Centre for Biostatistics and Epidemiology, Department of Biostatistics, University of Oslo, Oslo, Norway
| | - Marco Galardini
- Biological Design Center, Boston University, Boston, Massachusetts, USA
| | - Nicole E Wheeler
- Centre for Genomic Pathogen Surveillance, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, United Kingdom
| | - Samuel T Horsfield
- MRC Centre for Global Infectious Disease Analysis, Department of Infectious Disease Epidemiology, Imperial College London, London, United Kingdom
| | - Julian Parkhill
- Department of Veterinary Medicine, University of Cambridge, Cambridge, United Kingdom
| | - Jukka Corander
- Oslo Centre for Biostatistics and Epidemiology, Department of Biostatistics, University of Oslo, Oslo, Norway
- Centre for Genomic Pathogen Surveillance, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, United Kingdom
- Helsinki Institute of Information Technology, Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland
| |
Collapse
|
21
|
Chun S, Imakaev M, Hui D, Patsopoulos NA, Neale BM, Kathiresan S, Stitziel NO, Sunyaev SR. Non-parametric Polygenic Risk Prediction via Partitioned GWAS Summary Statistics. Am J Hum Genet 2020; 107:46-59. [PMID: 32470373 PMCID: PMC7332650 DOI: 10.1016/j.ajhg.2020.05.004] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2019] [Accepted: 05/01/2020] [Indexed: 02/07/2023] Open
Abstract
In complex trait genetics, the ability to predict phenotype from genotype is the ultimate measure of our understanding of genetic architecture underlying the heritability of a trait. A complete understanding of the genetic basis of a trait should allow for predictive methods with accuracies approaching the trait's heritability. The highly polygenic nature of quantitative traits and most common phenotypes has motivated the development of statistical strategies focused on combining myriad individually non-significant genetic effects. Now that predictive accuracies are improving, there is a growing interest in the practical utility of such methods for predicting risk of common diseases responsive to early therapeutic intervention. However, existing methods require individual-level genotypes or depend on accurately specifying the genetic architecture underlying each disease to be predicted. Here, we propose a polygenic risk prediction method that does not require explicitly modeling any underlying genetic architecture. We start with summary statistics in the form of SNP effect sizes from a large GWAS cohort. We then remove the correlation structure across summary statistics arising due to linkage disequilibrium and apply a piecewise linear interpolation on conditional mean effects. In both simulated and real datasets, this new non-parametric shrinkage (NPS) method can reliably allow for linkage disequilibrium in summary statistics of 5 million dense genome-wide markers and consistently improves prediction accuracy. We show that NPS improves the identification of groups at high risk for breast cancer, type 2 diabetes, inflammatory bowel disease, and coronary heart disease, all of which have available early intervention or prevention treatments.
Collapse
Affiliation(s)
- Sung Chun
- Division of Genetics, Brigham and Women's Hospital, Boston, MA 02115, USA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA; Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Altius Institute for Biomedical Sciences, Seattle, WA 98121, USA
| | - Maxim Imakaev
- Division of Genetics, Brigham and Women's Hospital, Boston, MA 02115, USA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA; Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Altius Institute for Biomedical Sciences, Seattle, WA 98121, USA
| | - Daniel Hui
- Division of Genetics, Brigham and Women's Hospital, Boston, MA 02115, USA; Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Systems Biology and Computer Science Program, Ann Romney Center for Neurological Diseases, Department of Neurology, Brigham & Women's Hospital, Boston, MA 02115, USA
| | - Nikolaos A Patsopoulos
- Division of Genetics, Brigham and Women's Hospital, Boston, MA 02115, USA; Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Systems Biology and Computer Science Program, Ann Romney Center for Neurological Diseases, Department of Neurology, Brigham & Women's Hospital, Boston, MA 02115, USA
| | - Benjamin M Neale
- Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA 02114, USA; Center for Human Genetic Research, Massachusetts General Hospital, Boston, MA 02114, USA
| | - Sekar Kathiresan
- Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Center for Human Genetic Research, Massachusetts General Hospital, Boston, MA 02114, USA; Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA 02114, USA
| | - Nathan O Stitziel
- Cardiovascular Division, Department of Medicine, Washington University School of Medicine, Saint Louis, MO 63110, USA; Department of Genetics, Washington University School of Medicine, Saint Louis, MO 63110, USA; McDonnell Genome Institute, Washington University School of Medicine, Saint Louis, MO 63110, USA.
| | - Shamil R Sunyaev
- Division of Genetics, Brigham and Women's Hospital, Boston, MA 02115, USA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA; Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Altius Institute for Biomedical Sciences, Seattle, WA 98121, USA.
| |
Collapse
|
22
|
Livesey BJ, Marsh JA. Using deep mutational scanning to benchmark variant effect predictors and identify disease mutations. Mol Syst Biol 2020; 16:e9380. [PMID: 32627955 PMCID: PMC7336272 DOI: 10.15252/msb.20199380] [Citation(s) in RCA: 80] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2019] [Revised: 05/18/2020] [Accepted: 05/26/2020] [Indexed: 12/23/2022] Open
Abstract
To deal with the huge number of novel protein-coding variants identified by genome and exome sequencing studies, many computational variant effect predictors (VEPs) have been developed. Such predictors are often trained and evaluated using different variant data sets, making a direct comparison between VEPs difficult. In this study, we use 31 previously published deep mutational scanning (DMS) experiments, which provide quantitative, independent phenotypic measurements for large numbers of single amino acid substitutions, in order to benchmark and compare 46 different VEPs. We also evaluate the ability of DMS measurements and VEPs to discriminate between pathogenic and benign missense variants. We find that DMS experiments tend to be superior to the top-ranking predictors, demonstrating the tremendous potential of DMS for identifying novel human disease mutations. Among the VEPs, DeepSequence clearly stood out, showing both the strongest correlations with DMS data and having the best ability to predict pathogenic mutations, which is especially remarkable given that it is an unsupervised method. We further recommend SNAP2, DEOGEN2, SNPs&GO, SuSPect and REVEL based upon their performance in these analyses.
Collapse
Affiliation(s)
- Benjamin J Livesey
- MRC Human Genetics UnitInstitute of Genetics and Molecular MedicineUniversity of EdinburghEdinburghUK
| | - Joseph A Marsh
- MRC Human Genetics UnitInstitute of Genetics and Molecular MedicineUniversity of EdinburghEdinburghUK
| |
Collapse
|
23
|
Palencia-Madrid L, Xavier C, de la Puente M, Hohoff C, Phillips C, Kayser M, Parson W. Evaluation of the VISAGE Basic Tool for Appearance and Ancestry Prediction Using PowerSeq Chemistry on the MiSeq FGx System. Genes (Basel) 2020; 11:E708. [PMID: 32604780 DOI: 10.3390/genes11060708] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2020] [Revised: 06/09/2020] [Accepted: 06/11/2020] [Indexed: 01/23/2023] Open
Abstract
The study of DNA to predict externally visible characteristics (EVCs) and the biogeographical ancestry (BGA) from unknown samples is gaining relevance in forensic genetics. Technical developments in Massively Parallel Sequencing (MPS) enable the simultaneous analysis of hundreds of DNA markers, which improves successful Forensic DNA Phenotyping (FDP). The EU-funded VISAGE (VISible Attributes through GEnomics) Consortium has developed various targeted MPS-based lab tools to apply FDP in routine forensic analyses. Here, we present an evaluation of the VISAGE Basic tool for appearance and ancestry prediction based on PowerSeq chemistry (Promega) on a MiSeq FGx System (Illumina). The panel consists of 153 single nucleotide polymorphisms (SNPs) that provide information about EVCs (41 SNPs for eye, hair and skin color from HIrisPlex-S) and continental BGA (115 SNPs; three overlap with the EVCs SNP set). The assay was evaluated for sensitivity, repeatability and genotyping concordance, as well as its performance with casework-type samples. This targeted MPS assay provided complete genotypes at all 153 SNPs down to 125 pg of input DNA and 99.67% correct genotypes at 50 pg. It was robust in terms of repeatability and concordance and provided useful results with casework-type samples. The results suggest that this MPS assay is a useful tool for basic appearance and ancestry prediction in forensic genetics for users interested in applying PowerSeq chemistry and MiSeq for this purpose.
Collapse
|
24
|
Liu YH, Xu Y, Zhang M, Cui Y, Sze SH, Smith CW, Xu S, Zhang HB. Accurate Prediction of a Quantitative Trait Using the Genes Controlling the Trait for Gene-Based Breeding in Cotton. Front Plant Sci 2020; 11:583277. [PMID: 33281846 PMCID: PMC7690289 DOI: 10.3389/fpls.2020.583277] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/14/2020] [Accepted: 10/15/2020] [Indexed: 05/03/2023]
Abstract
Accurate phenotype prediction of quantitative traits is paramount to enhanced plant research and breeding. Here, we report the accurate prediction of cotton fiber length, a typical quantitative trait, using 474 cotton (Gossypium ssp.) fiber length (GFL) genes and nine prediction models. When the SNPs/InDels contained in 226 of the GFL genes or the expressions of all 474 GFL genes was used for fiber length prediction, a prediction accuracy of r = 0.83 was obtained, approaching the maximally possible prediction accuracy of a quantitative trait. This has improved by 116%, the prediction accuracies of the fiber length thus far achieved for genomic selection using genome-wide random DNA markers. Moreover, analysis of the GFL genes identified 125 of the GFL genes that are key to accurate prediction of fiber length, with which a prediction accuracy similar to that of all 474 GFL genes was obtained. The fiber lengths of the plants predicted with expressions of the 125 key GFL genes were significantly correlated with those predicted with the SNPs/InDels of the above 226 SNP/InDel-containing GFL genes (r = 0.892, P = 0.000). The prediction accuracies of fiber length using both genic datasets were highly consistent across environments or generations. Finally, we found that a training population consisting of 100-120 plants was sufficient to train a model for accurate prediction of a quantitative trait using the genes controlling the trait. Therefore, the genes controlling a quantitative trait are capable of accurately predicting its phenotype, thereby dramatically improving the ability, accuracy, and efficiency of phenotype prediction and promoting gene-based breeding in cotton and other species.
Collapse
Affiliation(s)
- Yun-Hua Liu
- Department of Soil and Crop Sciences, Texas A&M University, College Station, TX, United States
| | - Yang Xu
- Botany and Plant Sciences, University of California, Riverside, Riverside, CA, United States
| | - Meiping Zhang
- Department of Soil and Crop Sciences, Texas A&M University, College Station, TX, United States
| | - Yanru Cui
- Botany and Plant Sciences, University of California, Riverside, Riverside, CA, United States
| | - Sing-Hoi Sze
- Department of Computer Science and Engineering and Department of Biochemistry and Biophysics, Texas A&M University, College Station, TX, United States
| | - C. Wayne Smith
- Department of Soil and Crop Sciences, Texas A&M University, College Station, TX, United States
| | - Shizhong Xu
- Department of Soil and Crop Sciences, Texas A&M University, College Station, TX, United States
- *Correspondence: Shizhong Xu,
| | - Hong-Bin Zhang
- Botany and Plant Sciences, University of California, Riverside, Riverside, CA, United States
- Hong-Bin Zhang,
| |
Collapse
|
25
|
Carraro M, Monzon AM, Chiricosta L, Reggiani F, Aspromonte MC, Bellini M, Pagel K, Jiang Y, Radivojac P, Kundu K, Pal LR, Yin Y, Limongelli I, Andreoletti G, Moult J, Wilson SJ, Katsonis P, Lichtarge O, Chen J, Wang Y, Hu Z, Brenner SE, Ferrari C, Murgia A, Tosatto SC, Leonardi E. Assessment of patient clinical descriptions and pathogenic variants from gene panel sequences in the CAGI-5 intellectual disability challenge. Hum Mutat 2019; 40:1330-1345. [PMID: 31144778 PMCID: PMC7341177 DOI: 10.1002/humu.23823] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2019] [Revised: 05/07/2019] [Accepted: 05/27/2019] [Indexed: 12/15/2022]
Abstract
The Critical Assessment of Genome Interpretation-5 intellectual disability challenge asked to use computational methods to predict patient clinical phenotypes and the causal variant(s) based on an analysis of their gene panel sequence data. Sequence data for 74 genes associated with intellectual disability (ID) and/or autism spectrum disorders (ASD) from a cohort of 150 patients with a range of neurodevelopmental manifestations (i.e. ID, autism, epilepsy, microcephaly, macrocephaly, hypotonia, ataxia) have been made available for this challenge. For each patient, predictors had to report the causative variants and which of the seven phenotypes were present. Since neurodevelopmental disorders are characterized by strong comorbidity, tested individuals often present more than one pathological condition. Considering the overall clinical manifestation of each patient, the correct phenotype has been predicted by at least one group for 93 individuals (62%). ID and ASD were the best predicted among the seven phenotypic traits. Also, causative or potentially pathogenic variants were predicted correctly by at least one group. However, the prediction of the correct causative variant seems to be insufficient to predict the correct phenotype. In some cases, the correct prediction has been supported by rare or common variants in genes different from the causative one.
Collapse
Affiliation(s)
- Marco Carraro
- Department of Biomedical Sciences, University of Padua, Padua, Italy
| | | | - Luigi Chiricosta
- Department of Biomedical Sciences, University of Padua, Padua, Italy
| | - Francesco Reggiani
- Department of Biomedical Sciences, University of Padua, Padua, Italy
- Department of Information Engineering, University of Padua, Padua, Italy
| | | | - Mariagrazia Bellini
- Department of Woman and Child Health, University of Padua, Padua, Italy
- Fondazione Istituto di Ricerca Pediatrica (IRP), Città della Speranza, Padova, Italy
| | - Kymberleigh Pagel
- Khoury College of Computer and Information Sciences, Northeastern University, 440, Huntington Avenue, Boston, MA 02115, USA
| | - Yuxiang Jiang
- Khoury College of Computer and Information Sciences, Northeastern University, 440, Huntington Avenue, Boston, MA 02115, USA
| | - Predrag Radivojac
- Khoury College of Computer and Information Sciences, Northeastern University, 440, Huntington Avenue, Boston, MA 02115, USA
| | - Kunal Kundu
- Institute for Bioscience and Biotechnology Research, University of Maryland, 9600 Gudelsky Drive, Rockville, MD 20850, USA
- Computational Biology, Bioinformatics and Genomics, Biological Sciences Graduate Program, University of Maryland, College Park, MD 20742, USA
| | - Lipika R. Pal
- Institute for Bioscience and Biotechnology Research, University of Maryland, 9600 Gudelsky Drive, Rockville, MD 20850, USA
| | - Yizhou Yin
- Institute for Bioscience and Biotechnology Research, University of Maryland, 9600 Gudelsky Drive, Rockville, MD 20850, USA
- Computational Biology, Bioinformatics and Genomics, Biological Sciences Graduate Program, University of Maryland, College Park, MD 20742, USA
| | | | - Gaia Andreoletti
- Institute for Bioscience and Biotechnology Research, University of Maryland, 9600 Gudelsky Drive, Rockville, MD 20850, USA
- Department of Cell Biology and Molecular Genetics, University of Maryland, College Park, MD 20742, USA
| | - John Moult
- Institute for Bioscience and Biotechnology Research, University of Maryland, 9600 Gudelsky Drive, Rockville, MD 20850, USA
- Department of Cell Biology and Molecular Genetics, University of Maryland, College Park, MD 20742, USA
| | - Stephen J. Wilson
- Baylor College of Medicine, Department of Molecular and Human Genetics, Houston, TX 77030, USA
| | - Panagiotis Katsonis
- Baylor College of Medicine, Department of Molecular and Human Genetics, Houston, TX 77030, USA
| | - Olivier Lichtarge
- Baylor College of Medicine, Department of Molecular and Human Genetics, Houston, TX 77030, USA
| | - Jingqi Chen
- Department of Plant and Microbial Biology, University of California, Berkeley, CA 94720, USA
| | - Yaqiong Wang
- Department of Plant and Microbial Biology, University of California, Berkeley, CA 94720, USA
| | - Zhiqiang Hu
- Department of Plant and Microbial Biology, University of California, Berkeley, CA 94720, USA
| | - Steven E. Brenner
- Department of Plant and Microbial Biology, University of California, Berkeley, CA 94720, USA
| | - Carlo Ferrari
- Department of Information Engineering, University of Padua, Padua, Italy
| | - Alessandra Murgia
- Department of Woman and Child Health, University of Padua, Padua, Italy
- Fondazione Istituto di Ricerca Pediatrica (IRP), Città della Speranza, Padova, Italy
| | - Silvio C.E. Tosatto
- Department of Biomedical Sciences, University of Padua, Padua, Italy
- CNR Institute of Neuroscience, Padua, Italy
| | - Emanuela Leonardi
- Department of Woman and Child Health, University of Padua, Padua, Italy
- Fondazione Istituto di Ricerca Pediatrica (IRP), Città della Speranza, Padova, Italy
| |
Collapse
|
26
|
Kasak L, Bakolitsa C, Hu Z, Yu C, Rine J, Dimster-Denk DF, Pandey G, Baets GD, Bromberg Y, Cao C, Capriotti E, Casadio R, Durme JV, Giollo M, Karchin R, Katsonis P, Leonardi E, Lichtarge O, Martelli PL, Masica D, Mooney SD, Olatubosun A, Pal LR, Radivojac P, Rousseau F, Savojardo C, Schymkowitz J, Thusberg J, Tosatto SC, Vihinen M, Väliaho J, Repo S, Moult J, Brenner SE, Friedberg I. Assessing computational predictions of the phenotypic effect of cystathionine-beta-synthase variants. Hum Mutat 2019; 40:1530-1545. [PMID: 31301157 PMCID: PMC7325732 DOI: 10.1002/humu.23868] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2019] [Revised: 06/22/2019] [Accepted: 07/09/2019] [Indexed: 12/28/2022]
Abstract
Accurate prediction of the impact of genomic variation on phenotype is a major goal of computational biology and an important contributor to personalized medicine. Computational predictions can lead to a better understanding of the mechanisms underlying genetic diseases, including cancer, but their adoption requires thorough and unbiased assessment. Cystathionine-beta-synthase (CBS) is an enzyme that catalyzes the first step of the transsulfuration pathway, from homocysteine to cystathionine, and in which variations are associated with human hyperhomocysteinemia and homocystinuria. We have created a computational challenge under the CAGI framework to evaluate how well different methods can predict the phenotypic effect(s) of CBS single amino acid substitutions using a blinded experimental data set. CAGI participants were asked to predict yeast growth based on the identity of the mutations. The performance of the methods was evaluated using several metrics. The CBS challenge highlighted the difficulty of predicting the phenotype of an ex vivo system in a model organism when classification models were trained on human disease data. We also discuss the variations in difficulty of prediction for known benign and deleterious variants, as well as identify methodological and experimental constraints with lessons to be learned for future challenges.
Collapse
Affiliation(s)
- Laura Kasak
- Department of Plant and Microbial Biology, University of California, Berkeley, CA, USA
- Institute of Biomedicine and Translational Medicine, University of Tartu, Tartu, Estonia
| | - Constantina Bakolitsa
- Department of Plant and Microbial Biology, University of California, Berkeley, CA, USA
| | - Zhiqiang Hu
- Department of Plant and Microbial Biology, University of California, Berkeley, CA, USA
| | - Changhua Yu
- Department of Plant and Microbial Biology, University of California, Berkeley, CA, USA
| | - Jasper Rine
- California Institute for Quantitative Biosciences, University of California, Berkeley, CA, USA
| | - Dago F. Dimster-Denk
- California Institute for Quantitative Biosciences, University of California, Berkeley, CA, USA
| | - Gaurav Pandey
- Department of Plant and Microbial Biology, University of California, Berkeley, CA, USA
| | - Greet De Baets
- Switch Laboratory, VIB Center for Brain and Disease Research, Leuven, Belgium
- Department of Cellular and Molecular Medicine, KU Leuven, Leuven, Belgium
| | - Yana Bromberg
- Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, NJ, USA
| | - Chen Cao
- Institute for Bioscience and Biotechnology Research, University of Maryland, Rockville, MD, USA
- Computational Biology, Bioinformatics and Genomics, Biological Sciences Graduate Program, University of Maryland, College Park, MD, USA
| | - Emidio Capriotti
- Department of Bioengineering, Stanford University, Stanford, CA, USA
| | - Rita Casadio
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Joost Van Durme
- Switch Laboratory, VIB Center for Brain and Disease Research, Leuven, Belgium
- Vrije Universiteit Brussel, Brussels, Belgium
| | - Manuel Giollo
- Department of Biomedical Sciences, University of Padua, Padua, Italy
| | - Rachel Karchin
- Department of Biomedical Engineering and Institute for Computational Medicine, Johns Hopkins University, Baltimore, MD, USA
| | - Panagiotis Katsonis
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | | | - Olivier Lichtarge
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | - Pier Luigi Martelli
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - David Masica
- Department of Biomedical Engineering and Institute for Computational Medicine, Johns Hopkins University, Baltimore, MD, USA
| | | | - Ayodeji Olatubosun
- Institute of Medical Technology, University of Tampere, Tampere, Finland
| | - Lipika R. Pal
- Institute for Bioscience and Biotechnology Research, University of Maryland, Rockville, MD, USA
| | - Predrag Radivojac
- School of Informatics and Computing, Indiana University, Bloomington, IN, USA
| | - Frederic Rousseau
- Switch Laboratory, VIB Center for Brain and Disease Research, Leuven, Belgium
- Department of Cellular and Molecular Medicine, KU Leuven, Leuven, Belgium
| | - Castrense Savojardo
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Joost Schymkowitz
- Switch Laboratory, VIB Center for Brain and Disease Research, Leuven, Belgium
- Department of Cellular and Molecular Medicine, KU Leuven, Leuven, Belgium
| | | | | | - Mauno Vihinen
- Institute of Medical Technology, University of Tampere, Tampere, Finland
| | - Jouni Väliaho
- Institute of Medical Technology, University of Tampere, Tampere, Finland
| | - Susanna Repo
- Department of Plant and Microbial Biology, University of California, Berkeley, CA, USA
| | - John Moult
- Department of Cellular and Molecular Medicine, KU Leuven, Leuven, Belgium
- Department of Cell Biology and Molecular Genetics, University of Maryland, College Park, MD, USA
| | - Steven E. Brenner
- Department of Plant and Microbial Biology, University of California, Berkeley, CA, USA
| | - Iddo Friedberg
- Department of Microbiology, Miami University, Oxford, OH, USA
- Department of Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA USA
| |
Collapse
|
27
|
Kasak L, Hunter JM, Udani R, Bakolitsa C, Hu Z, Adhikari AN, Babbi G, Casadio R, Gough J, Guerrero RF, Jiang Y, Joseph T, Katsonis P, Kotte S, Kundu K, Lichtarge O, Martelli PL, Mooney SD, Moult J, Pal LR, Poitras J, Radivojac P, Rao A, Sivadasan N, Sunderam U, VG S, Yin Y, Zaucha J, Brenner SE, Meyn MS. CAGI SickKids challenges: Assessment of phenotype and variant predictions derived from clinical and genomic data of children with undiagnosed diseases. Hum Mutat 2019; 40:1373-1391. [PMID: 31322791 PMCID: PMC7318886 DOI: 10.1002/humu.23874] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2019] [Revised: 07/15/2019] [Accepted: 07/15/2019] [Indexed: 01/02/2023]
Abstract
Whole-genome sequencing (WGS) holds great potential as a diagnostic test. However, the majority of patients currently undergoing WGS lack a molecular diagnosis, largely due to the vast number of undiscovered disease genes and our inability to assess the pathogenicity of most genomic variants. The CAGI SickKids challenges attempted to address this knowledge gap by assessing state-of-the-art methods for clinical phenotype prediction from genomes. CAGI4 and CAGI5 participants were provided with WGS data and clinical descriptions of 25 and 24 undiagnosed patients from the SickKids Genome Clinic Project, respectively. Predictors were asked to identify primary and secondary causal variants. In addition, for CAGI5, groups had to match each genome to one of three disorder categories (neurologic, ophthalmologic, and connective), and separately to each patient. The performance of matching genomes to categories was no better than random but two groups performed significantly better than chance in matching genomes to patients. Two of the ten variants proposed by two groups in CAGI4 were deemed to be diagnostic, and several proposed pathogenic variants in CAGI5 are good candidates for phenotype expansion. We discuss implications for improving in silico assessment of genomic variants and identifying new disease genes.
Collapse
Affiliation(s)
- Laura Kasak
- Department of Plant and Microbial Biology, University of California, Berkeley, CA, USA
- Institute of Biomedicine and Translational Medicine, University of Tartu, Tartu, Estonia
| | - Jesse M. Hunter
- Department of Pediatrics and Wisconsin State Lab of Hygiene, University of Wisconsin Madison, WI, USA
| | - Rupa Udani
- Department of Pediatrics and Wisconsin State Lab of Hygiene, University of Wisconsin Madison, WI, USA
| | - Constantina Bakolitsa
- Department of Plant and Microbial Biology, University of California, Berkeley, CA, USA
| | - Zhiqiang Hu
- Department of Plant and Microbial Biology, University of California, Berkeley, CA, USA
| | - Aashish N. Adhikari
- Department of Plant and Microbial Biology, University of California, Berkeley, CA, USA
| | - Giulia Babbi
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Rita Casadio
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Julian Gough
- Department of Computer Science, University of Bristol, Bristol, UK
| | | | - Yuxiang Jiang
- Department of Computer Science, Indiana University, IN, USA
| | | | - Panagiotis Katsonis
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | | | - Kunal Kundu
- Institute for Bioscience and Biotechnology Research, University of Maryland, Rockville, MD, USA
- Computational Biology, Bioinformatics and Genomics, Biological Sciences Graduate Program, University of Maryland, College Park, MD, USA
| | - Olivier Lichtarge
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
- Department of Biochemistry & Molecular Biology, Department of Pharmacology, Computational and Integrative Biomedical Research Center, Baylor College of Medicine, Houston, TX, USA
| | - Pier Luigi Martelli
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Sean D. Mooney
- Department of Biomedical Informatics and Medical Education, University of Washington, WA, USA
| | - John Moult
- Institute for Bioscience and Biotechnology Research, University of Maryland, Rockville, MD, USA
- Department of Cell Biology and Molecular Genetics, University of Maryland, MD, USA
| | - Lipika R. Pal
- Institute for Bioscience and Biotechnology Research, University of Maryland, Rockville, MD, USA
| | | | - Predrag Radivojac
- Khoury College of Computer Sciences, Northeastern University, MA, USA
| | | | | | | | | | - Yizhou Yin
- Institute for Bioscience and Biotechnology Research, University of Maryland, Rockville, MD, USA
- Computational Biology, Bioinformatics and Genomics, Biological Sciences Graduate Program, University of Maryland, College Park, MD, USA
| | - Jan Zaucha
- Department of Computer Science, University of Bristol, Bristol, UK
| | - Steven E. Brenner
- Department of Plant and Microbial Biology, University of California, Berkeley, CA, USA
| | - M. Stephen Meyn
- Center for Human Genomics and Precision Medicine, University of Wisconsin School of Medicine and Public Health, Madison, WI, USA
- Department of Paediatrics, The Hospital for Sick Children, Toronto, Canada
| |
Collapse
|
28
|
McInnes G, Daneshjou R, Katsonis P, Lichtarge O, Srinivasan R, Rana S, Radivojac P, Mooney SD, Pagel KA, Stamboulian M, Jiang Y, Capriotti E, Wang Y, Bromberg Y, Bovo S, Savojardo C, Martelli PL, Casadio R, Pal LR, Moult J, Brenner SE, Altman R. Predicting venous thromboembolism risk from exomes in the Critical Assessment of Genome Interpretation (CAGI) challenges. Hum Mutat 2019; 40:1314-1320. [PMID: 31140652 DOI: 10.1002/humu.23825] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2019] [Revised: 05/07/2019] [Accepted: 05/27/2019] [Indexed: 01/14/2023]
Abstract
Genetics play a key role in venous thromboembolism (VTE) risk, however established risk factors in European populations do not translate to individuals of African descent because of the differences in allele frequencies between populations. As part of the fifth iteration of the Critical Assessment of Genome Interpretation, participants were asked to predict VTE status in exome data from African American subjects. Participants were provided with 103 unlabeled exomes from patients treated with warfarin for non-VTE causes or VTE and asked to predict which disease each subject had been treated for. Given the lack of training data, many participants opted to use unsupervised machine learning methods, clustering the exomes by variation in genes known to be associated with VTE. The best performing method using only VTE related genes achieved an area under the ROC curve of 0.65. Here, we discuss the range of methods used in the prediction of VTE from sequence data and explore some of the difficulties of conducting a challenge with known confounders. In addition, we show that an existing genetic risk score for VTE that was developed in European subjects works well in African Americans.
Collapse
Affiliation(s)
- Gregory McInnes
- Biomedical Informatics Training Program, Stanford University, Stanford, California
| | - Roxana Daneshjou
- Department of Dermatology, Stanford School of Medicine, Stanford, California
| | - Panagiostis Katsonis
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas
| | - Olivier Lichtarge
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas.,Department of Biochemistry & Molecular Biology, Baylor College of Medicine, Houston, Texas.,Department of Pharmacology, Baylor College of Medicine, Houston, Texas.,Computational and Integrative Biomedical Research Center, Baylor College of Medicine, Houston, Texas
| | | | - Sadhna Rana
- Innovations Labs, Tata Consultancy Services, Hyderabad, India
| | - Predrag Radivojac
- Khoury College of Computer and Information Sciences, Northeastern University, Boston, Massachusetts
| | - Sean D Mooney
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, Washington
| | - Kymberleigh A Pagel
- Department of Computer Science and Informatics, Indiana University, Bloomington, Indiana
| | - Moses Stamboulian
- Department of Computer Science and Informatics, Indiana University, Bloomington, Indiana
| | - Yuxiang Jiang
- Department of Computer Science and Informatics, Indiana University, Bloomington, Indiana
| | - Emidio Capriotti
- BioFolD Unit, Department of Pharmacy and Biotechnology (FaBiT), University of Bologna, Bologna, Italy
| | - Yanran Wang
- Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, New Jersey
| | - Yana Bromberg
- Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, New Jersey
| | - Samuele Bovo
- Department of Pharmacy and Biotechnology, Bologna Biocomputing Group, University of Bologna, Italy
| | - Castrense Savojardo
- Department of Pharmacy and Biotechnology, Bologna Biocomputing Group, University of Bologna, Italy
| | - Pier Luigi Martelli
- Department of Pharmacy and Biotechnology, Bologna Biocomputing Group, University of Bologna, Italy
| | - Rita Casadio
- Department of Pharmacy and Biotechnology, Bologna Biocomputing Group, University of Bologna, Italy.,Institute of Biomembrane and Bioenergetics, Consiglio Nazionale delle Ricerche, Bari, Italy
| | - Lipika R Pal
- Institute for Bioscience and Biotechnology Research, University of Maryland, Rockville, Maryland
| | - John Moult
- Institute for Bioscience and Biotechnology Research, University of Maryland, Rockville, Maryland.,Department of Cell Biology and Molecular Genetics, University of Maryland, College Park, Maryland
| | - Steven E Brenner
- Department of Plant and Microbial biology, University of California Berkeley, Berkeley, California
| | - Russ Altman
- Departments of Bioengineering, Biomedical Data Science, Genetics, and Medicine, Stanford University, Stanford, California
| |
Collapse
|
29
|
Abstract
Gene expression profiles potentially hold valuable information for the prediction of breeding values and phenotypes. In this study, the utility of transcriptome data for phenotype prediction was tested with 185 inbred lines of Drosophila melanogaster for nine traits in two sexes. We incorporated the transcriptome data into genomic prediction via two methods: GTBLUP and GRBLUP, both combining single nucleotide polymorphisms (SNPs) and transcriptome data. The genotypic data was used to construct the common additive genomic relationship, which was used in genomic best linear unbiased prediction (GBLUP) or jointly in a linear mixed model with a transcriptome-based linear kernel (GTBLUP), or with a transcriptome-based Gaussian kernel (GRBLUP). We studied the predictive ability of the models and discuss a concept of "omics-augmented broad sense heritability" for the multi-omics era. For most traits, GRBLUP and GBLUP provided similar predictive abilities, but GRBLUP explained more of the phenotypic variance. There was only one trait (olfactory perception to Ethyl Butyrate in females) in which the predictive ability of GRBLUP (0.23) was significantly higher than the predictive ability of GBLUP (0.21). Our results suggest that accounting for transcriptome data has the potential to improve genomic predictions if transcriptome data can be included on a larger scale.
Collapse
Affiliation(s)
- Zhengcao Li
- Animal Breeding and Genetics Group, Department of Animal Sciences, Center for Integrated Breeding Research, University of Göttingen, Göttingen, Germany
| | - Ning Gao
- State Key Laboratory of Biocontrol, Guangzhou Higher Education Mega Center, School of Life Science, Sun Yat-sen University, Guangzhou, China
| | | | - Henner Simianer
- Animal Breeding and Genetics Group, Department of Animal Sciences, Center for Integrated Breeding Research, University of Göttingen, Göttingen, Germany
| |
Collapse
|
30
|
Kim OD, Rocha M, Maia P. A Review of Dynamic Modeling Approaches and Their Application in Computational Strain Optimization for Metabolic Engineering. Front Microbiol 2018; 9:1690. [PMID: 30108559 PMCID: PMC6079213 DOI: 10.3389/fmicb.2018.01690] [Citation(s) in RCA: 42] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2018] [Accepted: 07/06/2018] [Indexed: 12/03/2022] Open
Abstract
Mathematical modeling is a key process to describe the behavior of biological networks. One of the most difficult challenges is to build models that allow quantitative predictions of the cells' states along time. Recently, this issue started to be tackled through novel in silico approaches, such as the reconstruction of dynamic models, the use of phenotype prediction methods, and pathway design via efficient strain optimization algorithms. The use of dynamic models, which include detailed kinetic information of the biological systems, potentially increases the scope of the applications and the accuracy of the phenotype predictions. New efforts in metabolic engineering aim at bridging the gap between this approach and other different paradigms of mathematical modeling, as constraint-based approaches. These strategies take advantage of the best features of each method, and deal with the most remarkable limitation—the lack of available experimental information—which affects the accuracy and feasibility of solutions. Parameter estimation helps to solve this problem, but adding more computational cost to the overall process. Moreover, the existing approaches include limitations such as their scalability, flexibility, convergence time of the simulations, among others. The aim is to establish a trade-off between the size of the model and the level of accuracy of the solutions. In this work, we review the state of the art of dynamic modeling and related methods used for metabolic engineering applications, including approaches based on hybrid modeling. We describe approaches developed to undertake issues regarding the mathematical formulation and the underlying optimization algorithms, and that address the phenotype prediction by including available kinetic rate laws of metabolic processes. Then, we discuss how these have been used and combined as the basis to build computational strain optimization methods for metabolic engineering purposes, how they lead to bi-level schemes that can be used in the industry, including a consideration of their limitations.
Collapse
Affiliation(s)
- Osvaldo D Kim
- SilicoLife Lda, Braga, Portugal.,Centre of Biological Engineering, Universidade do Minho, Braga, Portugal
| | - Miguel Rocha
- Centre of Biological Engineering, Universidade do Minho, Braga, Portugal
| | | |
Collapse
|
31
|
Robertson J, Yoshida C, Kruczkiewicz P, Nadon C, Nichani A, Taboada EN, Nash JHE. Comprehensive assessment of the quality of Salmonella whole genome sequence data available in public sequence databases using the Salmonella in silico Typing Resource (SISTR). Microb Genom 2018; 4:e000151. [PMID: 29338812 PMCID: PMC5857378 DOI: 10.1099/mgen.0.000151] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2017] [Accepted: 12/19/2017] [Indexed: 12/16/2022] Open
Abstract
Public health and food safety institutions around the world are adopting whole genome sequencing (WGS) to replace conventional methods for characterizing Salmonella for use in surveillance and outbreak response. Falling costs and increased throughput of WGS have resulted in an explosion of data, but questions remain as to the reliability and robustness of the data. Due to the critical importance of serovar information to public health, it is essential to have reliable serovar assignments available for all of the Salmonella records. The current study used a systematic assessment and curation of all Salmonella in the sequence read archive (SRA) to assess the state of the data and their utility. A total of 67 758 genomes were assembled de novo and quality-assessed for their assembly metrics as well as species and serovar assignments. A total of 42 400 genomes passed all of the quality criteria but 30.16 % of genomes were deposited without serotype information. These data were used to compare the concordance of reported and predicted serovars for two in silico prediction tools, multi-locus sequence typing (MLST) and the Salmonella in silico Typing Resource (SISTR), which produced predictions that were fully concordant with 87.51 and 91.91 % of the tested isolates, respectively. Concordance of in silico predictions increased when serovar variants were grouped together, 89.25 % for MLST and 94.98 % for SISTR. This study represents the first large-scale validation of serovar information in public genomes and provides a large validated set of genomes, which can be used to benchmark new bioinformatics tools.
Collapse
Affiliation(s)
- James Robertson
- National Microbiology Laboratory, Public Health Agency of Canada, Guelph, ON, Canada
| | - Catherine Yoshida
- National Microbiology Laboratory, Public Health Agency of Canada, Winnipeg, MB, Canada
| | - Peter Kruczkiewicz
- National Microbiology Laboratory, Public Health Agency of Canada, Winnipeg, MB, Canada
| | - Celine Nadon
- National Microbiology Laboratory, Public Health Agency of Canada, Winnipeg, MB, Canada
| | - Anil Nichani
- National Microbiology Laboratory, Public Health Agency of Canada, Guelph, ON, Canada
| | - Eduardo N. Taboada
- National Microbiology Laboratory, Public Health Agency of Canada, Lethbridge, AB, Canada
| | | |
Collapse
|
32
|
Lippert C, Sabatini R, Maher MC, Kang EY, Lee S, Arikan O, Harley A, Bernal A, Garst P, Lavrenko V, Yocum K, Wong T, Zhu M, Yang WY, Chang C, Lu T, Lee CWH, Hicks B, Ramakrishnan S, Tang H, Xie C, Piper J, Brewerton S, Turpaz Y, Telenti A, Roby RK, Och FJ, Venter JC. Identification of individuals by trait prediction using whole-genome sequencing data. Proc Natl Acad Sci U S A 2017; 114:10166-10171. [PMID: 28874526 PMCID: PMC5617305 DOI: 10.1073/pnas.1711125114] [Citation(s) in RCA: 96] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Prediction of human physical traits and demographic information from genomic data challenges privacy and data deidentification in personalized medicine. To explore the current capabilities of phenotype-based genomic identification, we applied whole-genome sequencing, detailed phenotyping, and statistical modeling to predict biometric traits in a cohort of 1,061 participants of diverse ancestry. Individually, for a large fraction of the traits, their predictive accuracy beyond ancestry and demographic information is limited. However, we have developed a maximum entropy algorithm that integrates multiple predictions to determine which genomic samples and phenotype measurements originate from the same person. Using this algorithm, we have reidentified an average of >8 of 10 held-out individuals in an ethnically mixed cohort and an average of 5 of either 10 African Americans or 10 Europeans. This work challenges current conceptions of personal privacy and may have far-reaching ethical and legal implications.
Collapse
Affiliation(s)
| | | | | | | | | | - Okan Arikan
- Human Longevity, Inc., Mountain View, CA 94303
| | | | - Axel Bernal
- Human Longevity, Inc., Mountain View, CA 94303
| | - Peter Garst
- Human Longevity, Inc., Mountain View, CA 94303
| | | | - Ken Yocum
- Human Longevity, Inc., Mountain View, CA 94303
| | | | - Mingfu Zhu
- Human Longevity, Inc., Mountain View, CA 94303
| | | | - Chris Chang
- Human Longevity, Inc., Mountain View, CA 94303
| | - Tim Lu
- Human Longevity, Inc., San Diego, CA 92121
| | | | - Barry Hicks
- Human Longevity, Inc., Mountain View, CA 94303
| | | | - Haibao Tang
- Human Longevity, Inc., Mountain View, CA 94303
| | - Chao Xie
- Human Longevity Singapore, Pte. Ltd., Singapore 138542
| | - Jason Piper
- Human Longevity Singapore, Pte. Ltd., Singapore 138542
| | | | - Yaron Turpaz
- Human Longevity, Inc., San Diego, CA 92121
- Human Longevity Singapore, Pte. Ltd., Singapore 138542
| | | | - Rhonda K Roby
- Human Longevity, Inc., San Diego, CA 92121
- J. Craig Venter Institute, La Jolla, CA 92037
| | - Franz J Och
- Human Longevity, Inc., Mountain View, CA 94303
| | - J Craig Venter
- Human Longevity, Inc., San Diego, CA 92121;
- J. Craig Venter Institute, La Jolla, CA 92037
| |
Collapse
|
33
|
Ray B, Liu W, Fenyö D. Adaptive Multiview Nonnegative Matrix Factorization Algorithm for Integration of Multimodal Biomedical Data. Cancer Inform 2017; 16:1176935117725727. [PMID: 28835735 PMCID: PMC5564898 DOI: 10.1177/1176935117725727] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2017] [Accepted: 07/08/2017] [Indexed: 11/16/2022] Open
Abstract
The amounts and types of available multimodal tumor data are rapidly increasing, and their integration is critical for fully understanding the underlying cancer biology and personalizing treatment. However, the development of methods for effectively integrating multimodal data in a principled manner is lagging behind our ability to generate the data. In this article, we introduce an extension to a multiview nonnegative matrix factorization algorithm (NNMF) for dimensionality reduction and integration of heterogeneous data types and compare the predictive modeling performance of the method on unimodal and multimodal data. We also present a comparative evaluation of our novel multiview approach and current data integration methods. Our work provides an efficient method to extend an existing dimensionality reduction method. We report rigorous evaluation of the method on large-scale quantitative protein and phosphoprotein tumor data from the Clinical Proteomic Tumor Analysis Consortium (CPTAC) acquired using state-of-the-art liquid chromatography mass spectrometry. Exome sequencing and RNA-Seq data were also available from The Cancer Genome Atlas for the same tumors. For unimodal data, in case of breast cancer, transcript levels were most predictive of estrogen and progesterone receptor status and copy number variation of human epidermal growth factor receptor 2 status. For ovarian and colon cancers, phosphoprotein and protein levels were most predictive of tumor grade and stage and residual tumor, respectively. When multiview NNMF was applied to multimodal data to predict outcomes, the improvement in performance is not overall statistically significant beyond unimodal data, suggesting that proteomics data may contain more predictive information regarding tumor phenotypes than transcript levels, probably due to the fact that proteins are the functional gene products and therefore a more direct measurement of the functional state of the tumor. Here, we have applied our proposed approach to multimodal molecular data for tumors, but it is generally applicable to dimensionality reduction and joint analysis of any type of multimodal data.
Collapse
Affiliation(s)
- Bisakha Ray
- Institute for Systems Genetics and Department of Biochemistry and Molecular Pharmacology, NYU School of Medicine, New York, NY, USA
| | - Wenke Liu
- Institute for Systems Genetics and Department of Biochemistry and Molecular Pharmacology, NYU School of Medicine, New York, NY, USA
| | - David Fenyö
- Institute for Systems Genetics and Department of Biochemistry and Molecular Pharmacology, NYU School of Medicine, New York, NY, USA
| |
Collapse
|
34
|
Daneshjou R, Wang Y, Bromberg Y, Bovo S, Martelli PL, Babbi G, Lena PD, Casadio R, Edwards M, Gifford D, Jones DT, Sundaram L, Bhat RR, Li X, Pal LR, Kundu K, Yin Y, Moult J, Jiang Y, Pejaver V, Pagel KA, Li B, Mooney SD, Radivojac P, Shah S, Carraro M, Gasparini A, Leonardi E, Giollo M, Ferrari C, Tosatto SCE, Bachar E, Azaria JR, Ofran Y, Unger R, Niroula A, Vihinen M, Chang B, Wang MH, Franke A, Petersen BS, Pirooznia M, Zandi P, McCombie R, Potash JB, Altman RB, Klein TE, Hoskins RA, Repo S, Brenner SE, Morgan AA. Working toward precision medicine: Predicting phenotypes from exomes in the Critical Assessment of Genome Interpretation (CAGI) challenges. Hum Mutat 2017. [PMID: 28634997 DOI: 10.1002/humu.23280] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
Precision medicine aims to predict a patient's disease risk and best therapeutic options by using that individual's genetic sequencing data. The Critical Assessment of Genome Interpretation (CAGI) is a community experiment consisting of genotype-phenotype prediction challenges; participants build models, undergo assessment, and share key findings. For CAGI 4, three challenges involved using exome-sequencing data: Crohn's disease, bipolar disorder, and warfarin dosing. Previous CAGI challenges included prior versions of the Crohn's disease challenge. Here, we discuss the range of techniques used for phenotype prediction as well as the methods used for assessing predictive models. Additionally, we outline some of the difficulties associated with making predictions and evaluating them. The lessons learned from the exome challenges can be applied to both research and clinical efforts to improve phenotype prediction from genotype. In addition, these challenges serve as a vehicle for sharing clinical and research exome data in a secure manner with scientists who have a broad range of expertise, contributing to a collaborative effort to advance our understanding of genotype-phenotype relationships.
Collapse
Affiliation(s)
- Roxana Daneshjou
- Department of Genetics, Stanford School of Medicine, Stanford, California
| | - Yanran Wang
- Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, New Jersey
| | - Yana Bromberg
- Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, New Jersey
| | - Samuele Bovo
- Biocomputing Group, BiGeA/CIG, "Luigi Galvani" Interdepartmental Center for Integrated Studies of Bioinformatics, Biophysics, and Biocomplexity, University of Bologna, Bologna, Italy
| | - Pier L Martelli
- Biocomputing Group, BiGeA/CIG, "Luigi Galvani" Interdepartmental Center for Integrated Studies of Bioinformatics, Biophysics, and Biocomplexity, University of Bologna, Bologna, Italy
| | - Giulia Babbi
- Biocomputing Group, BiGeA/CIG, "Luigi Galvani" Interdepartmental Center for Integrated Studies of Bioinformatics, Biophysics, and Biocomplexity, University of Bologna, Bologna, Italy
| | - Pietro Di Lena
- Biocomputing Group/Department of Computer Science and Engineering, University of Bologna, Bologna, Italy
| | - Rita Casadio
- Biocomputing Group, BiGeA/CIG, "Luigi Galvani" Interdepartmental Center for Integrated Studies of Bioinformatics, Biophysics, and Biocomplexity, University of Bologna, Bologna, Italy.,"Giorgio Prodi" Interdepartmental Center for Cancer Research, University of Bologna, Bologna, Italy
| | - Matthew Edwards
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts
| | - David Gifford
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts
| | - David T Jones
- Bioinformatics Group, Department of Computer Science, University College London, London, United Kingdom
| | - Laksshman Sundaram
- Large-scale Intelligent Systems Laboratory, NSF Center for Big Learning, University of Florida, Gainesville, Florida
| | - Rajendra Rana Bhat
- Large-scale Intelligent Systems Laboratory, NSF Center for Big Learning, University of Florida, Gainesville, Florida
| | - Xiaolin Li
- Large-scale Intelligent Systems Laboratory, NSF Center for Big Learning, University of Florida, Gainesville, Florida
| | - Lipika R Pal
- Institute for Bioscience and Biotechnology Research, University of Maryland, Rockville, Maryland
| | - Kunal Kundu
- Institute for Bioscience and Biotechnology Research, University of Maryland, Rockville, Maryland.,Computational Biology, Bioinformatics and Genomics, Biological Sciences Graduate Program, University of Maryland, College Park, Maryland
| | - Yizhou Yin
- Institute for Bioscience and Biotechnology Research, University of Maryland, Rockville, Maryland.,Computational Biology, Bioinformatics and Genomics, Biological Sciences Graduate Program, University of Maryland, College Park, Maryland
| | - John Moult
- Institute for Bioscience and Biotechnology Research, University of Maryland, Rockville, Maryland.,Department of Cell Biology and Molecular Genetics, University of Maryland, College Park, Maryland
| | - Yuxiang Jiang
- Department of Computer Science and Informatics, Indiana University, Bloomington, Indiana
| | - Vikas Pejaver
- Department of Computer Science and Informatics, Indiana University, Bloomington, Indiana.,Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, Washington
| | - Kymberleigh A Pagel
- Department of Computer Science and Informatics, Indiana University, Bloomington, Indiana
| | - Biao Li
- Gilead Sciences, Foster City, California
| | - Sean D Mooney
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, Washington
| | - Predrag Radivojac
- Department of Computer Science and Informatics, Indiana University, Bloomington, Indiana
| | - Sohela Shah
- Qiagen Bioinformatics, Redwood City, California
| | - Marco Carraro
- Department of Biomedical Science, University of Padova, Padova, Italy
| | - Alessandra Gasparini
- Department of Biomedical Science, University of Padova, Padova, Italy.,Department of Woman and Child Health, University of Padova, Padova, Italy
| | - Emanuela Leonardi
- Department of Woman and Child Health, University of Padova, Padova, Italy
| | - Manuel Giollo
- Department of Biomedical Science, University of Padova, Padova, Italy.,Department of Information Engineering, University of Padova, Padova, Italy
| | - Carlo Ferrari
- Department of Information Engineering, University of Padova, Padova, Italy
| | - Silvio C E Tosatto
- Department of Biomedical Science, University of Padova, Padova, Italy.,CNR Institute of Neuroscience, Padova, Italy
| | - Eran Bachar
- The Mina and Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat-Gan, Israel
| | - Johnathan R Azaria
- The Mina and Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat-Gan, Israel
| | - Yanay Ofran
- The Mina and Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat-Gan, Israel
| | - Ron Unger
- The Mina and Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat-Gan, Israel
| | - Abhishek Niroula
- Protein Structure and Bioinformatics Group, Department of Experimental Medical Science, Lund University, Lund, Sweden
| | - Mauno Vihinen
- Protein Structure and Bioinformatics Group, Department of Experimental Medical Science, Lund University, Lund, Sweden
| | - Billy Chang
- Division of Biostatistics and Centre for Clinical Research and Biostatistics, JC School of Public Health and Primary Care, Chinese University of Hong Kong, Shatin, N.T., Hong Kong
| | - Maggie H Wang
- Division of Biostatistics and Centre for Clinical Research and Biostatistics, JC School of Public Health and Primary Care, Chinese University of Hong Kong, Shatin, N.T., Hong Kong.,CUHK Shenzhen Research Institute, Shenzhen, China
| | - Andre Franke
- Institute of Clinical Molecular Biology, Christian-Albrechts-University Kiel, Kiel, Germany
| | - Britt-Sabina Petersen
- Institute of Clinical Molecular Biology, Christian-Albrechts-University Kiel, Kiel, Germany
| | - Mehdi Pirooznia
- Department of Psychiatry, The Johns Hopkins University School of Medicine, Baltimore, Maryland
| | - Peter Zandi
- Department of Mental Health, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland
| | | | - James B Potash
- Department of Psychiatry, University of Iowa, Iowa City, Iowa
| | - Russ B Altman
- Department of Genetics, Stanford School of Medicine, Stanford, California
| | - Teri E Klein
- Department of Genetics, Stanford School of Medicine, Stanford, California
| | - Roger A Hoskins
- Department of Plant and Microbial Biology, University of California Berkeley, Berkeley, California
| | - Susanna Repo
- Department of Plant and Microbial Biology, University of California Berkeley, Berkeley, California
| | - Steven E Brenner
- Department of Plant and Microbial Biology, University of California Berkeley, Berkeley, California
| | | |
Collapse
|
35
|
Chandonia JM, Adhikari A, Carraro M, Chhibber A, Cutting GR, Fu Y, Gasparini A, Jones DT, Kramer A, Kundu K, Lam HYK, Leonardi E, Moult J, Pal LR, Searls DB, Shah S, Sunyaev S, Tosatto SCE, Yin Y, Buckley BA. Lessons from the CAGI-4 Hopkins clinical panel challenge. Hum Mutat 2017; 38:1155-1168. [PMID: 28397312 DOI: 10.1002/humu.23225] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2017] [Revised: 03/24/2017] [Accepted: 03/29/2017] [Indexed: 12/17/2022]
Abstract
The CAGI-4 Hopkins clinical panel challenge was an attempt to assess state-of-the-art methods for clinical phenotype prediction from DNA sequence. Participants were provided with exonic sequences of 83 genes for 106 patients from the Johns Hopkins DNA Diagnostic Laboratory. Five groups participated in the challenge, predicting both the probability that each patient had each of the 14 possible classes of disease, as well as one or more causal variants. In cases where the Hopkins laboratory reported a variant, at least one predictor correctly identified the disease class in 36 of the 43 patients (84%). Even in cases where the Hopkins laboratory did not find a variant, at least one predictor correctly identified the class in 39 of the 63 patients (62%). Each prediction group correctly diagnosed at least one patient that was not successfully diagnosed by any other group. We discuss the causal variant predictions by different groups and their implications for further development of methods to assess variants of unknown significance. Our results suggest that clinically relevant variants may be missed when physicians order small panels targeted on a specific phenotype. We also quantify the false-positive rate of DNA-guided analysis in the absence of prior phenotypic indication.
Collapse
Affiliation(s)
- John-Marc Chandonia
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, California
| | - Aashish Adhikari
- Department of Plant and Microbial Biology, University of California, Berkeley, California
| | - Marco Carraro
- Department of Biomedical Sciences, University of Padova, Padova, Italy
| | | | - Garry R Cutting
- McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, Maryland
| | - Yao Fu
- Roche Sequencing Solutions, Belmont, California
| | - Alessandra Gasparini
- Department of Biomedical Sciences, University of Padova, Padova, Italy.,Department of Women's and Children's Health, University of Padova, Padova, Italy
| | - David T Jones
- Department of Computer Science, University College London, London, United Kingdom
| | | | - Kunal Kundu
- Institute for Bioscience and Biotechnology Research, University of Maryland, Rockville, Maryland.,Computational Biology, Bioinformatics and Genomics, Biological Sciences Graduate Program, University of Maryland, College Park, Maryland
| | | | - Emanuela Leonardi
- Department of Women's and Children's Health, University of Padova, Padova, Italy
| | - John Moult
- Institute for Bioscience and Biotechnology Research, University of Maryland, Rockville, Maryland.,Department of Cell Biology and Molecular Genetics, University of Maryland, College Park, Maryland
| | - Lipika R Pal
- Institute for Bioscience and Biotechnology Research, University of Maryland, Rockville, Maryland
| | | | - Sohela Shah
- Qiagen Bioinformatics, Redwood City, California
| | - Shamil Sunyaev
- Division of Genetics, Department of Medicine, Brigham & Women's Hospital, Harvard Medical School, Boston, Massachusetts.,Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts
| | - Silvio C E Tosatto
- Department of Biomedical Sciences, University of Padova, Padova, Italy.,CNR Institute of Neuroscience, Padova, Italy
| | - Yizhou Yin
- Institute for Bioscience and Biotechnology Research, University of Maryland, Rockville, Maryland.,Computational Biology, Bioinformatics and Genomics, Biological Sciences Graduate Program, University of Maryland, College Park, Maryland
| | - Bethany A Buckley
- McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, Maryland
| |
Collapse
|
36
|
Yachison CA, Yoshida C, Robertson J, Nash JHE, Kruczkiewicz P, Taboada EN, Walker M, Reimer A, Christianson S, Nichani A, Nadon C. The Validation and Implications of Using Whole Genome Sequencing as a Replacement for Traditional Serotyping for a National Salmonella Reference Laboratory. Front Microbiol 2017. [PMID: 28649236 PMCID: PMC5465390 DOI: 10.3389/fmicb.2017.01044] [Citation(s) in RCA: 62] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023] Open
Abstract
Salmonella serotyping remains the gold-standard tool for the classification of Salmonella isolates and forms the basis of Canada’s national surveillance program for this priority foodborne pathogen. Public health officials have been increasingly looking toward whole genome sequencing (WGS) to provide a large set of data from which all the relevant information about an isolate can be mined. However, rigorous validation and careful consideration of potential implications in the replacement of traditional surveillance methodologies with WGS data analysis tools is needed. Two in silico tools for Salmonella serotyping have been developed, the Salmonella in silico Typing Resource (SISTR) and SeqSero, while seven gene MLST for serovar prediction can be adapted for in silico analysis. All three analysis methods were assessed and compared to traditional serotyping techniques using a set of 813 verified clinical and laboratory isolates, including 492 Canadian clinical isolates and 321 isolates of human and non-human sources. Successful results were obtained for 94.8, 88.2, and 88.3% of the isolates tested using SISTR, SeqSero, and MLST, respectively, indicating all would be suitable for maintaining historical records, surveillance systems, and communication structures currently in place and the choice of the platform used will ultimately depend on the users need. Results also pointed to the need to reframe serotyping in the genomic era as a test to understand the genes that are carried by an isolate, one which is not necessarily congruent with what is antigenically expressed. The adoption of WGS for serotyping will provide the simultaneous collection of information that can be used by multiple programs within the current surveillance paradigm; however, this does not negate the importance of the various programs or the role of serotyping going forward.
Collapse
Affiliation(s)
- Chris A Yachison
- National Microbiology Laboratory, Public Health Agency of Canada, WinnipegMB, Canada.,Department of Medical Microbiology, University of Manitoba, WinnipegMB, Canada
| | - Catherine Yoshida
- National Microbiology Laboratory, Public Health Agency of Canada, GuelphON, Canada
| | - James Robertson
- National Microbiology Laboratory, Public Health Agency of Canada, GuelphON, Canada
| | - John H E Nash
- National Microbiology Laboratory, Public Health Agency of Canada, GuelphON, Canada
| | - Peter Kruczkiewicz
- National Microbiology Laboratory, Public Health Agency of Canada, LethbridgeAB, Canada
| | - Eduardo N Taboada
- National Microbiology Laboratory, Public Health Agency of Canada, LethbridgeAB, Canada
| | - Matthew Walker
- National Microbiology Laboratory, Public Health Agency of Canada, WinnipegMB, Canada
| | - Aleisha Reimer
- National Microbiology Laboratory, Public Health Agency of Canada, WinnipegMB, Canada
| | - Sara Christianson
- National Microbiology Laboratory, Public Health Agency of Canada, WinnipegMB, Canada
| | - Anil Nichani
- National Microbiology Laboratory, Public Health Agency of Canada, GuelphON, Canada
| | | | - Celine Nadon
- National Microbiology Laboratory, Public Health Agency of Canada, WinnipegMB, Canada.,Department of Medical Microbiology, University of Manitoba, WinnipegMB, Canada
| |
Collapse
|
37
|
Abstract
Most diseases, including those of genetic origin, express a continuum of severity. Clinical interventions for numerous diseases are based on the severity of the phenotype. Predicting severity due to genetic variants could facilitate diagnosis and choice of therapy. Although computational predictions have been used as evidence for classifying the disease relevance of genetic variants, special tools for predicting disease severity in large scale are missing. Here, we manually curated a dataset containing variants leading to severe and less severe phenotypes and studied the abilities of variation impact predictors to distinguish between them. We found that these tools cannot separate the two groups of variants. Then, we developed a novel machine-learning-based method, PON-PS (http://structure.bmc.lu.se/PON-PS), for the classification of amino acid substitutions associated with benign, severe, and less severe phenotypes. We tested the method using an independent test dataset and variants in four additional proteins. For distinguishing severe and nonsevere variants, PON-PS showed an accuracy of 61% in the test dataset, which is higher than for existing tolerance prediction methods. PON-PS is the first generic tool developed for this task. The tool can be used together with other evidence for improving diagnosis and prognosis and for prioritization of preventive interventions, clinical monitoring, and molecular tests.
Collapse
Affiliation(s)
- Abhishek Niroula
- Department of Experimental Medical Science, Lund University, Lund, SE-22184, Sweden
| | - Mauno Vihinen
- Department of Experimental Medical Science, Lund University, Lund, SE-22184, Sweden
| |
Collapse
|
38
|
Klein A, Mazor Y, Karban A, Ben-Itzhak O, Chowers Y, Sabo E. Early histological findings may predict the clinical phenotype in Crohn's colitis. United European Gastroenterol J 2016; 5:694-701. [PMID: 28815033 DOI: 10.1177/2050640616676435] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/11/2016] [Accepted: 10/03/2016] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND AND AIMS Predicting the clinical course of Crohn's disease (CD) is relevant for treatment selection. Currently, such diagnostic tools are lacking. In a previous pilot study, morphometric tissue image analysis showed promise in predicting the clinical phenotype and need for surgery. In this study, we aimed to validate our previous results on a larger cohort. METHODS Colonic biopsies from CD patients with colonic or ileocolonic disease and at least five years of post-biopsy clinical follow-up were analyzed. The results were used to predict post-biopsy clinical phenotypes and outcomes. Data analysis was performed using multivariate regression models, discriminant score (DS) computations and Neural Network (NNET). RESULTS Multivariate analysis of morphometric variables differentiated between B1 and B2 phenotypes (sensitivity 81%, specificity 74%, accuracy on cross-validation 75%; area under the curve (AUC) of 0.74 (CI 0.6-0.84; NNET model sensitivity 87%, specificity 67% on the testing population)). Differentiation between B1 and B3 phenotypes was also possible (sensitivity 69%, specificity 76%, accuracy 70.5% on cross-validation; AUC 0.78 (CI 0.68-0.89); NNET model sensitivity 78%, specificity 77% on the testing population)). Differentiating between B2 and B3 phenotypes was not possible using morphometric variables. Multivariate analysis predicted surgery (sensitivity 67%, specificity 72.5%, accuracy 69%; AUC 0.72 (CI 0.61-0.82); NNET model sensitivity 80%, specificity 91% on the testing population)). CONCLUSIONS This study validates previous results and suggests that morphometric image analysis of early biopsies from Crohn's colitis patients may contribute to the prediction of future outcomes such as clinical phenotype and surgery. Prospective validation on larger cohorts is still needed.
Collapse
Affiliation(s)
- Amir Klein
- Department of Gastroenterology, Rambam Health Care Campus, Haifa, Israel
| | - Yoav Mazor
- Department of Gastroenterology, Rambam Health Care Campus, Haifa, Israel
| | - Amir Karban
- Department of Gastroenterology, Rambam Health Care Campus, Haifa, Israel
| | - Ofer Ben-Itzhak
- Department of Pathology, Rambam Health Care Campus, Haifa, Israel
| | - Yehuda Chowers
- Department of Gastroenterology, Rambam Health Care Campus, Haifa, Israel
| | - Edmond Sabo
- Department of Pathology, Rambam Health Care Campus, Haifa, Israel
| |
Collapse
|
39
|
Abstract
Genomics has been used with varying degrees of success in the context of drug discovery and in defining mechanisms of action for diseases like cancer and neurodegenerative and rare diseases in the quest for orphan drugs. To improve its utility, accuracy, and cost-effectiveness optimization of analytical methods, especially those that translate to clinically relevant outcomes, is critical. Here we define a novel tool for genomic analysis termed a biomedical robot in order to improve phenotype prediction, identifying disease pathogenesis and significantly defining therapeutic targets. Biomedical robot analytics differ from historical methods in that they are based on melding feature selection methods and ensemble learning techniques. The biomedical robot mathematically exploits the structure of the uncertainty space of any classification problem conceived as an ill-posed optimization problem. Given a classifier, there exist different equivalent small-scale genetic signatures that provide similar predictive accuracies. We perform the sensitivity analysis to noise of the biomedical robot concept using synthetic microarrays perturbed by different kinds of noises in expression and class assignment. Finally, we show the application of this concept to the analysis of different diseases, inferring the pathways and the correlation networks. The final aim of a biomedical robot is to improve knowledge discovery and provide decision systems to optimize diagnosis, treatment, and prognosis. This analysis shows that the biomedical robots are robust against different kinds of noises and particularly to a wrong class assignment of the samples. Assessing the uncertainty that is inherent to any phenotype prediction problem is the right way to address this kind of problem.
Collapse
|
40
|
Lopes MS, Bastiaansen JW, Janss L, Knol EF, Bovenhuis H. Estimation of Additive, Dominance, and Imprinting Genetic Variance Using Genomic Data. G3 (Bethesda) 2015; 5:2629-37. [PMID: 26438289 DOI: 10.1534/g3.115.019513] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Traditionally, exploration of genetic variance in humans, plants, and livestock species has been limited mostly to the use of additive effects estimated using pedigree data. However, with the development of dense panels of single-nucleotide polymorphisms (SNPs), the exploration of genetic variation of complex traits is moving from quantifying the resemblance between family members to the dissection of genetic variation at individual loci. With SNPs, we were able to quantify the contribution of additive, dominance, and imprinting variance to the total genetic variance by using a SNP regression method. The method was validated in simulated data and applied to three traits (number of teats, backfat, and lifetime daily gain) in three purebred pig populations. In simulated data, the estimates of additive, dominance, and imprinting variance were very close to the simulated values. In real data, dominance effects account for a substantial proportion of the total genetic variance (up to 44%) for these traits in these populations. The contribution of imprinting to the total phenotypic variance of the evaluated traits was relatively small (1–3%). Our results indicate a strong relationship between additive variance explained per chromosome and chromosome length, which has been described previously for other traits in other species. We also show that a similar linear relationship exists for dominance and imprinting variance. These novel results improve our understanding of the genetic architecture of the evaluated traits and shows promise to apply the SNP regression method to other traits and species, including human diseases.
Collapse
|
41
|
Porth I, Klápště J, Skyba O, Friedmann MC, Hannemann J, Ehlting J, El-Kassaby YA, Mansfield SD, Douglas CJ. Network analysis reveals the relationship among wood properties, gene expression levels and genotypes of natural Populus trichocarpa accessions. New Phytol 2013; 200:727-742. [PMID: 23889128 DOI: 10.1111/nph.12419] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/08/2013] [Accepted: 06/17/2013] [Indexed: 05/21/2023]
Abstract
High-throughput approaches have been widely applied to elucidate the genetic underpinnings of industrially important wood properties. Wood traits are polygenic in nature, but gene hierarchies can be assessed to identify the most important gene variants controlling specific traits within complex networks defining the overall wood phenotype. We tested a large set of genetic, genomic, and phenotypic information in an integrative approach to predict wood properties in Populus trichocarpa. Nine-yr-old natural P. trichocarpa trees including accessions with high contrasts in six traits related to wood chemistry and ultrastructure were profiled for gene expression on 49k Nimblegen (Roche NimbleGen Inc., Madison, WI, USA) array elements and for 28,831 polymorphic single nucleotide polymorphisms (SNPs). Pre-selected transcripts and SNPs with high statistical dependence on phenotypic traits were used in Bayesian network learning procedures with a stepwise K2 algorithm to infer phenotype-centric networks. Transcripts were pre-selected at a much lower logarithm of Bayes factor (logBF) threshold than SNPs and were not accommodated in the networks. Using persistent variables, we constructed cross-validated networks for variability in wood attributes, which contained four to six variables with 94-100% predictive accuracy. Accommodated gene variants revealed the hierarchy in the genetic architecture that underpins substantial phenotypic variability, and represent new tools to support the maximization of response to selection.
Collapse
Affiliation(s)
- Ilga Porth
- Department of Wood Science, University of British Columbia, Vancouver, BC, Canada, V6T 1Z4
- Department of Botany, University of British Columbia, Vancouver, BC, Canada, V6T 1Z4
| | - Jaroslav Klápště
- Department of Forest and Conservation Sciences, University of British Columbia, Vancouver, BC, Canada, V6T 1Z4
- Department of Dendrology and Forest Tree Breeding, Faculty of Forestry and Wood Sciences, Czech University of Life Sciences, Prague, 165 21, Czech Republic
| | - Oleksandr Skyba
- Department of Wood Science, University of British Columbia, Vancouver, BC, Canada, V6T 1Z4
| | - Michael C Friedmann
- Department of Botany, University of British Columbia, Vancouver, BC, Canada, V6T 1Z4
| | - Jan Hannemann
- Department of Biology and Centre for Forest Biology, University of Victoria, Victoria, BC, Canada, V8W 3N5
| | - Juergen Ehlting
- Department of Biology and Centre for Forest Biology, University of Victoria, Victoria, BC, Canada, V8W 3N5
| | - Yousry A El-Kassaby
- Department of Forest and Conservation Sciences, University of British Columbia, Vancouver, BC, Canada, V6T 1Z4
| | - Shawn D Mansfield
- Department of Wood Science, University of British Columbia, Vancouver, BC, Canada, V6T 1Z4
| | - Carl J Douglas
- Department of Botany, University of British Columbia, Vancouver, BC, Canada, V6T 1Z4
| |
Collapse
|
42
|
Chang R, Shoemaker R, Wang W. A novel knowledge-driven systems biology approach for phenotype prediction upon genetic intervention. IEEE/ACM Trans Comput Biol Bioinform 2011; 8:1170-1182. [PMID: 21282866 PMCID: PMC3211072 DOI: 10.1109/tcbb.2011.18] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
Deciphering the biological networks underlying complex phenotypic traits, e.g., human disease is undoubtedly crucial to understand the underlying molecular mechanisms and to develop effective therapeutics. Due to the network complexity and the relatively small number of available experiments, data-driven modeling is a great challenge for deducing the functions of genes/proteins in the network and in phenotype formation. We propose a novel knowledge-driven systems biology method that utilizes qualitative knowledge to construct a Dynamic Bayesian network (DBN) to represent the biological network underlying a specific phenotype. Edges in this network depict physical interactions between genes and/or proteins. A qualitative knowledge model first translates typical molecular interactions into constraints when resolving the DBN structure and parameters. Therefore, the uncertainty of the network is restricted to a subset of models which are consistent with the qualitative knowledge. All models satisfying the constraints are considered as candidates for the underlying network. These consistent models are used to perform quantitative inference. By in silico inference, we can predict phenotypic traits upon genetic interventions and perturbing in the network. We applied our method to analyze the puzzling mechanism of breast cancer cell proliferation network and we accurately predicted cancer cell growth rate upon manipulating (anti)cancerous marker genes/proteins.
Collapse
|
43
|
Wang PI, Marcotte EM. It's the machine that matters: Predicting gene function and phenotype from protein networks. J Proteomics 2010; 73:2277-89. [PMID: 20637909 PMCID: PMC2953423 DOI: 10.1016/j.jprot.2010.07.005] [Citation(s) in RCA: 102] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2010] [Revised: 06/22/2010] [Accepted: 07/07/2010] [Indexed: 12/17/2022]
Abstract
Increasing knowledge about the organization of proteins into complexes, systems, and pathways has led to a flowering of theoretical approaches for exploiting this knowledge in order to better learn the functions of proteins and their roles underlying phenotypic traits and diseases. Much of this body of theory has been developed and tested in model organisms, relying on their relative simplicity and genetic and biochemical tractability to accelerate the research. In this review, we discuss several of the major approaches for computationally integrating proteomics and genomics observations into integrated protein networks, then applying guilt-by-association in these networks in order to identify genes underlying traits. Recent trends in this field include a rising appreciation of the modular network organization of proteins underlying traits or mutational phenotypes, and how to exploit such protein modularity using computational approaches related to the internet search algorithm PageRank. Many protein network-based predictions have recently been experimentally confirmed in yeast, worms, plants, and mice, and several successful approaches in model organisms have been directly translated to analyze human disease, with notable recent applications to glioma and breast cancer prognosis.
Collapse
Affiliation(s)
- Peggy I Wang
- Center for Systems and Synthetic Biology, Institute for Cellular and Molecular Biology, University of Texas at Austin, Austin, TX 78712-1064, USA.
| | | |
Collapse
|
44
|
Lamers SL, Salemi M, McGrath MS, Fogel GB. Prediction of R5, X4, and R5X4 HIV-1 coreceptor usage with evolved neural networks. IEEE/ACM Trans Comput Biol Bioinform 2008; 5:291-300. [PMID: 18451438 PMCID: PMC3523352 DOI: 10.1109/tcbb.2007.1074] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
The HIV-1 genome is highly heterogeneous. This variation affords the virus a wide range of molecular properties, including the ability to infect cell types, such as macrophages and lymphocytes, expressing different chemokine receptors on the cell surface. In particular, R5 HIV-1 viruses use CCR5 as co-receptor for viral entry, X4 viruses use CXCR4, whereas some viral strains, known as R5X4 or D-tropic, have the ability to utilize both co-receptors. X4 and R5X4 viruses are associated with rapid disease progression to AIDS. R5X4 viruses differ in that they have yet to be characterized by the examination of the genetic sequence of HIV-1 alone. In this study, a series of experiments was performed to evaluate different strategies of feature selection and neural network optimization. We demonstrate the use of artificial neural networks trained via evolutionary computation to predict viral co-receptor usage. The results indicate identification of R5X4 viruses with predictive accuracy of 75.5%.
Collapse
Affiliation(s)
| | - Marco Salemi
- Department of Pathology, Immunology, and Laboratory Medicine, University of Florida (UF-COM) Gainesville, 1600 S.W. Archer Road, Gainesville, FL 32610
| | - Michael S. McGrath
- Department of Medicine, University of California San Francisco, San Francisco, CA 94143-0874
| | - Gary B. Fogel
- Natural Selection, Inc., 9330 Scranton Rd., Suite 150, San Diego, CA 92121
| |
Collapse
|
45
|
Matthew R, Banjevic M, Chan AS, Myers L, Wolkowicz R, Haberer J, Singer J. Use of the l1 norm for selection of sparse parameter sets that accurately predict drug response phenotype from viral genetic sequences. AMIA Annu Symp Proc 2005; 2005:505-9. [PMID: 16779091 PMCID: PMC1560612] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 05/10/2023]
Abstract
We describe the use of the l1 norm for selection of a sparse set of model parameters that are used in the prediction of viral drug response, based on genetic sequence data of the Human Immunodeficiency Virus (HIV) reverse-transcriptase enzyme. We discuss the use of the l1 norm in the Least Absolute Selection and Shrinkage Operator (LASSO) regression model and the Support Vector Machine model. When tested by cross-validation with laboratory measurements, these models predict viral phenotype, or resistance, in response to Reverse-Transcriptase Inhibitors (RTIs) more accurately than other known models. The l1 norm is the most selective convex function, which sets a large proportion of the parameters to zero and also assures that a single optimal solution will be found, given a particular model formulation and training data set. A statistical model that reliably predicts viral drug response is an important tool in the selection of Anti-Retroviral Therapy. These techniques have general application to modeling phenotype from complex genetic data.
Collapse
|