1
|
Huang S, Wu Z, Wang T, Yu R, Song Z, Wang H. MmisAT and MmisP: an efficient and accurate suite of variant analysis toolkit for primary mitochondrial diseases. Hum Genomics 2023; 17:108. [PMID: 38012712 PMCID: PMC10683248 DOI: 10.1186/s40246-023-00557-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2023] [Accepted: 11/22/2023] [Indexed: 11/29/2023] Open
Abstract
Recent advances in next-generation sequencing (NGS) technology have greatly accelerated the need for efficient annotation to accurately interpret clinically relevant genetic variants in human diseases. Therefore, it is crucial to develop appropriate analytical tools to improve the interpretation of disease variants. Given the unique genetic characteristics of mitochondria, including haplogroup, heteroplasmy, and maternal inheritance, we developed a suite of variant analysis toolkits specifically designed for primary mitochondrial diseases: the Mitochondrial Missense Variant Annotation Tool (MmisAT) and the Mitochondrial Missense Variant Pathogenicity Predictor (MmisP). MmisAT can handle protein-coding variants from both nuclear DNA and mtDNA and generate 349 annotation types across six categories. It processes 4.78 million variant data in 76 min, making it a valuable resource for clinical and research applications. Additionally, MmisP provides pathogenicity scores to predict the pathogenicity of genetic variations in mitochondrial disease. It has been validated using cross-validation and external datasets and demonstrated higher overall discriminant accuracy with a receiver operating characteristic (ROC) curve area under the curve (AUC) of 0.94, outperforming existing pathogenicity predictors. In conclusion, the MmisAT is an efficient tool that greatly facilitates the process of variant annotation, expanding the scope of variant annotation information. Furthermore, the development of MmisP provides valuable insights into the creation of disease-specific, phenotype-specific, and even gene-specific predictors of pathogenicity, further advancing our understanding of specific fields.
Collapse
Affiliation(s)
- Shuangshuang Huang
- Department of Clinical Laboratory, Children's Hospital, Zhejiang University School of Medicine, National Clinical Research Center for Child Health, Hangzhou, China
| | - Zhaoyu Wu
- Department of Clinical Laboratory, The Affiliated Hospital of Guangdong Medical University, Zhanjiang, China
| | - Tong Wang
- Department of Clinical Laboratory, Children's Hospital, Zhejiang University School of Medicine, National Clinical Research Center for Child Health, Hangzhou, China
| | - Rui Yu
- Department of Ophthalmology, Children's Hospital, Zhejiang University School of Medicine, National Clinical Research Center for Child Health, Hangzhou, China
| | - Zhijian Song
- OrigiMed, 5th Floor, Building 3, No.115 Xin Jun Huan Road, Minhang District, Shanghai, China.
| | - Hao Wang
- Department of Clinical Laboratory, Children's Hospital, Zhejiang University School of Medicine, National Clinical Research Center for Child Health, Hangzhou, China.
| |
Collapse
|
2
|
Dall'Alba G, Casa PL, Abreu FPD, Notari DL, de Avila E Silva S. A Survey of Biological Data in a Big Data Perspective. BIG DATA 2022; 10:279-297. [PMID: 35394342 DOI: 10.1089/big.2020.0383] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
The amount of available data is continuously growing. This phenomenon promotes a new concept, named big data. The highlight technologies related to big data are cloud computing (infrastructure) and Not Only SQL (NoSQL; data storage). In addition, for data analysis, machine learning algorithms such as decision trees, support vector machines, artificial neural networks, and clustering techniques present promising results. In a biological context, big data has many applications due to the large number of biological databases available. Some limitations of biological big data are related to the inherent features of these data, such as high degrees of complexity and heterogeneity, since biological systems provide information from an atomic level to interactions between organisms or their environment. Such characteristics make most bioinformatic-based applications difficult to build, configure, and maintain. Although the rise of big data is relatively recent, it has contributed to a better understanding of the underlying mechanisms of life. The main goal of this article is to provide a concise and reliable survey of the application of big data-related technologies in biology. As such, some fundamental concepts of information technology, including storage resources, analysis, and data sharing, are described along with their relation to biological data.
Collapse
Affiliation(s)
- Gabriel Dall'Alba
- Computational Biology and Bioinformatics Laboratory, Biotechnology Institute, Department of Life Sciences, University of Caxias do Sul, Caxias do Sul, Brazil
- Genome Science and Technology Program, Faculty of Science, The University of British Columbia, Vancouver, Canada
| | - Pedro Lenz Casa
- Computational Biology and Bioinformatics Laboratory, Biotechnology Institute, Department of Life Sciences, University of Caxias do Sul, Caxias do Sul, Brazil
| | - Fernanda Pessi de Abreu
- Computational Biology and Bioinformatics Laboratory, Biotechnology Institute, Department of Life Sciences, University of Caxias do Sul, Caxias do Sul, Brazil
| | - Daniel Luis Notari
- Computational Biology and Bioinformatics Laboratory, Biotechnology Institute, Department of Life Sciences, University of Caxias do Sul, Caxias do Sul, Brazil
| | - Scheila de Avila E Silva
- Computational Biology and Bioinformatics Laboratory, Biotechnology Institute, Department of Life Sciences, University of Caxias do Sul, Caxias do Sul, Brazil
| |
Collapse
|
3
|
Koppad S, B A, Gkoutos GV, Acharjee A. Cloud Computing Enabled Big Multi-Omics Data Analytics. Bioinform Biol Insights 2021; 15:11779322211035921. [PMID: 34376975 PMCID: PMC8323418 DOI: 10.1177/11779322211035921] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2021] [Accepted: 07/12/2021] [Indexed: 12/27/2022] Open
Abstract
High-throughput experiments enable researchers to explore complex multifactorial
diseases through large-scale analysis of omics data. Challenges for such
high-dimensional data sets include storage, analyses, and sharing. Recent
innovations in computational technologies and approaches, especially in cloud
computing, offer a promising, low-cost, and highly flexible solution in the
bioinformatics domain. Cloud computing is rapidly proving increasingly useful in
molecular modeling, omics data analytics (eg, RNA sequencing, metabolomics, or
proteomics data sets), and for the integration, analysis, and interpretation of
phenotypic data. We review the adoption of advanced cloud-based and big data
technologies for processing and analyzing omics data and provide insights into
state-of-the-art cloud bioinformatics applications.
Collapse
Affiliation(s)
- Saraswati Koppad
- Department of Computer Science and Engineering, National Institute of Technology Karnataka, Surathkal, India
| | - Annappa B
- Department of Computer Science and Engineering, National Institute of Technology Karnataka, Surathkal, India
| | - Georgios V Gkoutos
- Institute of Cancer and Genomic Sciences and Centre for Computational Biology, College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK.,Institute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK.,NIHR Surgical Reconstruction and Microbiology Research Centre, University Hospitals Birmingham, Birmingham, UK.,MRC Health Data Research UK (HDR UK), London, UK.,NIHR Experimental Cancer Medicine Centre, Birmingham, UK.,NIHR Biomedical Research Centre, University Hospitals Birmingham, Birmingham, UK
| | - Animesh Acharjee
- Institute of Cancer and Genomic Sciences and Centre for Computational Biology, College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK.,Institute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK.,NIHR Surgical Reconstruction and Microbiology Research Centre, University Hospitals Birmingham, Birmingham, UK
| |
Collapse
|
4
|
Hu T, Chitnis N, Monos D, Dinh A. Next-generation sequencing technologies: An overview. Hum Immunol 2021; 82:801-811. [PMID: 33745759 DOI: 10.1016/j.humimm.2021.02.012] [Citation(s) in RCA: 223] [Impact Index Per Article: 74.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2020] [Revised: 02/18/2021] [Accepted: 02/23/2021] [Indexed: 12/14/2022]
Abstract
Since the days of Sanger sequencing, next-generation sequencing technologies have significantly evolved to provide increased data output, efficiencies, and applications. These next generations of technologies can be categorized based on read length. This review provides an overview of these technologies as two paradigms: short-read, or "second-generation," technologies, and long-read, or "third-generation," technologies. Herein, short-read sequencing approaches are represented by the most prevalent technologies, Illumina and Ion Torrent, and long-read sequencing approaches are represented by Pacific Biosciences and Oxford Nanopore technologies. All technologies are reviewed along with reported advantages and disadvantages. Until recently, short-read sequencing was thought to provide high accuracy limited by read-length, while long-read technologies afforded much longer read-lengths at the expense of accuracy. Emerging developments for third-generation technologies hold promise for the next wave of sequencing evolution, with the co-existence of longer read lengths and high accuracy.
Collapse
Affiliation(s)
- Taishan Hu
- Department of Pathology and Laboratory Medicine, Children's Hospital of Philadelphia, Philadelphia, PA, United States
| | - Nilesh Chitnis
- Department of Pathology and Laboratory Medicine, Children's Hospital of Philadelphia, Philadelphia, PA, United States; Department of Surgery, Baylor College of Medicine, Houston, TX, United States
| | - Dimitri Monos
- Department of Pathology and Laboratory Medicine, Children's Hospital of Philadelphia, Philadelphia, PA, United States; Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States.
| | - Anh Dinh
- Department of Pathology and Laboratory Medicine, Children's Hospital of Philadelphia, Philadelphia, PA, United States; Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States.
| |
Collapse
|
5
|
Verma A, Halder A, Marathe S, Purwar R, Srivastava S. A proteogenomic approach to target neoantigens in solid tumors. Expert Rev Proteomics 2021; 17:797-812. [PMID: 33491499 DOI: 10.1080/14789450.2020.1881889] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Abstract
INTRODUCTION Proteogenomic techniques find applications in identifying novel cancer-specific peptides called neoantigens; they are non-self peptides derived from tumor-specific non-synonymous mutations. These peptides with MHCs are recognized by the T cells and induce an antitumor response. Due to their selective expression of tumor cells, neoantigens are considered attractive targets for cancer immunotherapy. AREAS COVERED In this review, we have discussed the proteogenomic strategies to identify neoantigens. We have also provided a neoantigen identification pipeline using data from whole-exome sequencing, RNA sequencing, and MHC peptidomics. Further, we have reviewed recent tools for neoantigen discovery. EXPERT COMMENTARY The limitations in instrument sensitivity and availability of bioinformatics tools have restricted the identification of neoantigens from tumor samples. Nonetheless, the recent improvement in genome sequencing, mass spectrometry technologies, and the development of reliable algorithms for epitope prediction provide hope for efficient identification of neoantigens. Translating this workflow on patient samples would represent a massive advancement in neoantigen identification methods, leading to the constitution of novel personalized neoantigen cancer vaccines.
Collapse
Affiliation(s)
- Ayushi Verma
- Department of Biosciences and Bioengineering, Indian Institute of Technology Bombay , Mumbai, India
| | - Ankit Halder
- Department of Biosciences and Bioengineering, Indian Institute of Technology Bombay , Mumbai, India
| | - Soumitra Marathe
- Department of Biosciences and Bioengineering, Indian Institute of Technology Bombay , Mumbai, India
| | - Rahul Purwar
- Department of Biosciences and Bioengineering, Indian Institute of Technology Bombay , Mumbai, India
| | - Sanjeeva Srivastava
- Department of Biosciences and Bioengineering, Indian Institute of Technology Bombay , Mumbai, India
| |
Collapse
|
6
|
Tafazoli A, Wawrusiewicz-Kurylonek N, Posmyk R, Miltyk W. Pharmacogenomics, How to Deal with Different Types of Variants in Next Generation Sequencing Data in the Personalized Medicine Area. J Clin Med 2020; 10:jcm10010034. [PMID: 33374421 PMCID: PMC7796098 DOI: 10.3390/jcm10010034] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2020] [Revised: 12/21/2020] [Accepted: 12/22/2020] [Indexed: 12/15/2022] Open
Abstract
Pharmacogenomics (PGx) is the knowledge of diverse drug responses and effects in people, based on their genomic profiles. Such information is considered as one of the main directions to reach personalized medicine in future clinical practices. Since the start of applying next generation sequencing (NGS) methods in drug related clinical investigations, many common medicines found their genetic data for the related metabolizing/shipping proteins in the human body. Yet, the employing of technology is accompanied by big obtained data, which most of them have no clear guidelines for consideration in routine treatment decisions for patients. This review article talks about different types of NGS derived PGx variants in clinical studies and try to display the current and newly developed approaches to deal with pharmacogenetic data with/without clear guidelines for considering in clinical settings.
Collapse
Affiliation(s)
- Alireza Tafazoli
- Department of Analysis and Bioanalysis of Medicines, Faculty of Pharmacy with the Division of Laboratory Medicine, Medical University of Białystok, 15-089 Białystok, Poland;
- Clinical Research Centre, Medical University of Białystok, 15-276 Bialystok, Poland
| | | | - Renata Posmyk
- Department of Clinical Genetics, Medical University of Białystok, 15-089 Białystok, Poland; (N.W.-K.); (R.P.)
| | - Wojciech Miltyk
- Department of Analysis and Bioanalysis of Medicines, Faculty of Pharmacy with the Division of Laboratory Medicine, Medical University of Białystok, 15-089 Białystok, Poland;
- Correspondence: ; Tel.: +48-857485845
| |
Collapse
|
7
|
Pal LR, Kundu K, Yin Y, Moult J. Matching whole genomes to rare genetic disorders: Identification of potential causative variants using phenotype-weighted knowledge in the CAGI SickKids5 clinical genomes challenge. Hum Mutat 2020; 41:347-362. [PMID: 31680375 PMCID: PMC7182498 DOI: 10.1002/humu.23933] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2019] [Revised: 09/26/2019] [Accepted: 10/13/2019] [Indexed: 02/06/2023]
Abstract
Precise identification of causative variants from whole-genome sequencing data, including both coding and noncoding variants, is challenging. The Critical Assessment of Genome Interpretation 5 SickKids clinical genome challenge provided an opportunity to assess our ability to extract such information. Participants in the challenge were required to match each of the 24 whole-genome sequences to the correct phenotypic profile and to identify the disease class of each genome. These are all rare disease cases that have resisted genetic diagnosis in a state-of-the-art pipeline. The patients have a range of eye, neurological, and connective-tissue disorders. We used a gene-centric approach to address this problem, assigning each gene a multiphenotype-matching score. Mutations in the top-scoring genes for each phenotype profile were ranked on a 6-point scale of pathogenicity probability, resulting in an approximately equal number of top-ranked coding and noncoding candidate variants overall. We were able to assign the correct disease class for 12 cases and the correct genome to a clinical profile for five cases. The challenge assessor found genes in three of these five cases as likely appropriate. In the postsubmission phase, after careful screening of the genes in the correct genome, we identified additional potential diagnostic variants, a high proportion of which are noncoding.
Collapse
Affiliation(s)
- Lipika R. Pal
- Institute for Bioscience and Biotechnology Research, University of Maryland, 9600 Gudelsky Drive, Rockville, MD 20850, USA
| | - Kunal Kundu
- Institute for Bioscience and Biotechnology Research, University of Maryland, 9600 Gudelsky Drive, Rockville, MD 20850, USA
- Computational Biology, Bioinformatics and Genomics, Biological Sciences Graduate Program, University of Maryland, College Park, MD 20742, USA
| | - Yizhou Yin
- Institute for Bioscience and Biotechnology Research, University of Maryland, 9600 Gudelsky Drive, Rockville, MD 20850, USA
| | - John Moult
- Institute for Bioscience and Biotechnology Research, University of Maryland, 9600 Gudelsky Drive, Rockville, MD 20850, USA
- Department of Cell Biology and Molecular Genetics, University of Maryland, College Park, MD 20742, USA
| |
Collapse
|
8
|
Jiang Y, Wu C, Zhang Y, Zhang S, Yu S, Lei P, Lu Q, Xi Y, Wang H, Song Z. GTX.Digest.VCF: an online NGS data interpretation system based on intelligent gene ranking and large-scale text mining. BMC Med Genomics 2019; 12:193. [PMID: 31856831 PMCID: PMC6923899 DOI: 10.1186/s12920-019-0637-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2019] [Accepted: 11/26/2019] [Indexed: 02/07/2023] Open
Abstract
Background An important task in the interpretation of sequencing data is to highlight pathogenic genes (or detrimental variants) in the field of Mendelian diseases. It is still challenging despite the recent rapid development of genomics and bioinformatics. A typical interpretation workflow includes annotation, filtration, manual inspection and literature review. Those steps are time-consuming and error-prone in the absence of systematic support. Therefore, we developed GTX.Digest.VCF, an online DNA sequencing interpretation system, which prioritizes genes and variants for novel disease-gene relation discovery and integrates text mining results to provide literature evidence for the discovery. Its phenotype-driven ranking and biological data mining approach significantly speed up the whole interpretation process. Results The GTX.Digest.VCF system is freely available as a web portal at http://vcf.gtxlab.com for academic research. Evaluation on the DDD project dataset demonstrates an accuracy of 77% (235 out of 305 cases) for top-50 genes and an accuracy of 41.6% (127 out of 305 cases) for top-5 genes. Conclusions GTX.Digest.VCF provides an intelligent web portal for genomics data interpretation via the integration of bioinformatics tools, distributed parallel computing, biomedical text mining. It can facilitate the application of genomic analytics in clinical research and practices.
Collapse
Affiliation(s)
| | - Chengkun Wu
- State Key Laboratory of High-Performance Computing, College of Computer, National University of Defense Technology, Changsha, 410073, China
| | - Yanghui Zhang
- NHC key laboratory of birth defects research, prevention and treatment (Hunan Provincial Maternal and Child Health Care Hospital), NO.53 Xiangchun Road, Changsha, 410008, Hunan, China
| | - Shaowei Zhang
- Genetalks Biotech. Co., Ltd., Changsha, 410000, China
| | - Shuojun Yu
- Genetalks Biotech. Co., Ltd., Changsha, 410000, China
| | - Peng Lei
- Genetalks Biotech. Co., Ltd., Changsha, 410000, China
| | - Qin Lu
- Genetalks Biotech. Co., Ltd., Changsha, 410000, China
| | - Yanwei Xi
- Cytogenetics and Human Molecular Genetics Laboratories, Royal University Hospital, Saskatoon, SK, Canada
| | - Hua Wang
- NHC key laboratory of birth defects research, prevention and treatment (Hunan Provincial Maternal and Child Health Care Hospital), NO.53 Xiangchun Road, Changsha, 410008, Hunan, China. .,Hunan Provincial Maternal and Child Health Care Hospital, Changsha, 410073, China.
| | - Zhuo Song
- Genetalks Biotech. Co., Ltd., Changsha, 410000, China.
| |
Collapse
|
9
|
Leveraging protein dynamics to identify cancer mutational hotspots using 3D structures. Proc Natl Acad Sci U S A 2019; 116:18962-18970. [PMID: 31462496 PMCID: PMC6754584 DOI: 10.1073/pnas.1901156116] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
Large-scale exome sequencing of tumors has enabled the identification of cancer drivers using recurrence-based approaches. Some of these methods also employ 3D protein structures to identify mutational hotspots in cancer-associated genes. In determining such mutational clusters in structures, existing approaches overlook protein dynamics, despite its essential role in protein function. We present a framework to identify cancer driver genes using a dynamics-based search of mutational hotspot communities. Mutations are mapped to protein structures, which are partitioned into distinct residue communities. These communities are identified in a framework where residue-residue contact edges are weighted by correlated motions (as inferred by dynamics-based models). We then search for signals of positive selection among these residue communities to identify putative driver genes, while applying our method to the TCGA (The Cancer Genome Atlas) PanCancer Atlas missense mutation catalog. Overall, we predict 1 or more mutational hotspots within the resolved structures of proteins encoded by 434 genes. These genes were enriched among biological processes associated with tumor progression. Additionally, a comparison between our approach and existing cancer hotspot detection methods using structural data suggests that including protein dynamics significantly increases the sensitivity of driver detection.
Collapse
|
10
|
Rao AR, Nelson SF. Calculating the statistical significance of rare variants causal for Mendelian and complex disorders. BMC Med Genomics 2018; 11:53. [PMID: 29898714 PMCID: PMC6001062 DOI: 10.1186/s12920-018-0371-9] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2017] [Accepted: 05/25/2018] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND With the expanding use of next-gen sequencing (NGS) to diagnose the thousands of rare Mendelian genetic diseases, it is critical to be able to interpret individual DNA variation. To calculate the significance of finding a rare protein-altering variant in a given gene, one must know the frequency of seeing a variant in the general population that is at least as damaging as the variant in question. METHODS We developed a general method to better interpret the likelihood that a rare variant is disease causing if observed in a given gene or genic region mapping to a described protein domain, using genome-wide information from a large control sample. Based on data from 2504 individuals in the 1000 Genomes Project dataset, we calculated the number of individuals who have a rare variant in a given gene for numerous filtering threshold scenarios, which may be used for calculating the significance of an observed rare variant being causal for disease. Additionally, we calculated mutational burden data on the number of individuals with rare variants in genic regions mapping to protein domains. RESULTS We describe methods to use the mutational burden data for calculating the significance of observing rare variants in a given proportion of sequenced individuals. We present SORVA, an implementation of these methods as a web tool, and we demonstrate application to 20 relevant but diverse next-gen sequencing studies. Specifically, we calculate the statistical significance of findings involving multi-family studies with rare Mendelian disease and a large-scale study of a complex disorder, autism spectrum disorder. If we use the frequency counts to rank genes based on intolerance for variation, the ranking correlates well with pLI scores derived from the Exome Aggregation Consortium (ExAC) dataset (ρ = 0.515), with the benefit that the scores are directly interpretable. CONCLUSIONS We have presented a strategy that is useful for vetting candidate genes from NGS studies and allows researchers to calculate the significance of seeing a variant in a given gene or protein domain. This approach is an important step towards developing a quantitative, statistics-based approach for presenting clinical findings.
Collapse
Affiliation(s)
- Aliz R Rao
- Department of Human Genetics, University of California, Los Angeles, California, Los Angeles, USA.
| | - Stanley F Nelson
- Department of Human Genetics, University of California, Los Angeles, California, Los Angeles, USA.,Department of Psychiatry and Biobehavioral Sciences at the David Geffen School of Medicine, University of California, Los Angeles, California, Los Angeles, USA.,Department of Pathology and Laboratory Medicine, University of California, Los Angeles, California, Los Angeles, USA
| |
Collapse
|
11
|
Gosalia N, Economides AN, Dewey FE, Balasubramanian S. MAPPIN: a method for annotating, predicting pathogenicity and mode of inheritance for nonsynonymous variants. Nucleic Acids Res 2017; 45:10393-10402. [PMID: 28977528 PMCID: PMC5737764 DOI: 10.1093/nar/gkx730] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2017] [Accepted: 08/21/2017] [Indexed: 01/24/2023] Open
Abstract
Nonsynonymous single nucleotide variants (nsSNVs) constitute about 50% of known disease-causing mutations and understanding their functional impact is an area of active research. Existing algorithms predict pathogenicity of nsSNVs; however, they are unable to differentiate heterozygous, dominant disease-causing variants from heterozygous carrier variants that lead to disease only in the homozygous state. Here, we present MAPPIN (Method for Annotating, Predicting Pathogenicity, and mode of Inheritance for Nonsynonymous variants), a prediction method which utilizes a random forest algorithm to distinguish between nsSNVs with dominant, recessive, and benign effects. We apply MAPPIN to a set of Mendelian disease-causing mutations and accurately predict pathogenicity for all mutations. Furthermore, MAPPIN predicts mode of inheritance correctly for 70.3% of nsSNVs. MAPPIN also correctly predicts pathogenicity for 87.3% of mutations from the Deciphering Developmental Disorders Study with a 78.5% accuracy for mode of inheritance. When tested on a larger collection of mutations from the Human Gene Mutation Database, MAPPIN is able to significantly discriminate between mutations in known dominant and recessive genes. Finally, we demonstrate that MAPPIN outperforms CADD and Eigen in predicting disease inheritance modes for all validation datasets. To our knowledge, MAPPIN is the first nsSNV pathogenicity prediction algorithm that provides mode of inheritance predictions, adding another layer of information for variant prioritization.
Collapse
Affiliation(s)
- Nehal Gosalia
- Regeneron Genetics Center, Tarrytown, NY 10591, USA.,Regeneron Pharmaceuticals, Tarrytown, NY 10591, USA
| | - Aris N Economides
- Regeneron Genetics Center, Tarrytown, NY 10591, USA.,Regeneron Pharmaceuticals, Tarrytown, NY 10591, USA
| | | | | |
Collapse
|
12
|
Balasubramanian S, Fu Y, Pawashe M, McGillivray P, Jin M, Liu J, Karczewski KJ, MacArthur DG, Gerstein M. Using ALoFT to determine the impact of putative loss-of-function variants in protein-coding genes. Nat Commun 2017; 8:382. [PMID: 28851873 PMCID: PMC5575292 DOI: 10.1038/s41467-017-00443-5] [Citation(s) in RCA: 30] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2016] [Accepted: 06/29/2017] [Indexed: 11/09/2022] Open
Abstract
Variants predicted to result in the loss of function of human genes have attracted interest because of their clinical impact and surprising prevalence in healthy individuals. Here, we present ALoFT (annotation of loss-of-function transcripts), a method to annotate and predict the disease-causing potential of loss-of-function variants. Using data from Mendelian disease-gene discovery projects, we show that ALoFT can distinguish between loss-of-function variants that are deleterious as heterozygotes and those causing disease only in the homozygous state. Investigation of variants discovered in healthy populations suggests that each individual carries at least two heterozygous premature stop alleles that could potentially lead to disease if present as homozygotes. When applied to de novo putative loss-of-function variants in autism-affected families, ALoFT distinguishes between deleterious variants in patients and benign variants in unaffected siblings. Finally, analysis of somatic variants in >6500 cancer exomes shows that putative loss-of-function variants predicted to be deleterious by ALoFT are enriched in known driver genes.Variants causing loss of function (LoF) of human genes have clinical implications. Here, the authors present a method to predict disease-causing potential of LoF variants, ALoFT (annotation of Loss-of-Function Transcripts) and show its application to interpreting LoF variants in different contexts.
Collapse
Affiliation(s)
- Suganthi Balasubramanian
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, 06520, USA.
- Molecular Biophysics and Biochemistry Department, Yale University, New Haven, CT, 06520, USA.
- Regeneron Genetics Center, Tarrytown, NY, 10591, USA.
| | - Yao Fu
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, 06520, USA
- Bina Technologies, Part of Roche Sequencing, Belmont, CA, 94002, USA
| | - Mayur Pawashe
- Molecular Biophysics and Biochemistry Department, Yale University, New Haven, CT, 06520, USA
| | - Patrick McGillivray
- Molecular Biophysics and Biochemistry Department, Yale University, New Haven, CT, 06520, USA
| | - Mike Jin
- Molecular Biophysics and Biochemistry Department, Yale University, New Haven, CT, 06520, USA
| | - Jeremy Liu
- Molecular Biophysics and Biochemistry Department, Yale University, New Haven, CT, 06520, USA
| | - Konrad J Karczewski
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, 02114, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, 02142, USA
| | - Daniel G MacArthur
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, 02114, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, 02142, USA
| | - Mark Gerstein
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, 06520, USA.
- Molecular Biophysics and Biochemistry Department, Yale University, New Haven, CT, 06520, USA.
- Department of Computer Science, Yale University, New Haven, CT, 06520, USA.
| |
Collapse
|
13
|
Dhingra P, Fu Y, Gerstein M, Khurana E. Using FunSeq2 for Coding and Non‐Coding Variant Annotation and Prioritization. ACTA ACUST UNITED AC 2017; 57:15.11.1-15.11.17. [DOI: 10.1002/cpbi.23] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Affiliation(s)
- Priyanka Dhingra
- Institute for Computational Biomedicine, Weill Cornell Medical College New York New York
- Department of Physiology and Biophysics, Weill Cornell Medical College New York New York 10021
| | - Yao Fu
- Bina Technologies, Roche Sequencing Redwood City California
| | - Mark Gerstein
- Program in Computational Biology and Bioinformatics, Yale University New Haven Connecticut
- Department of Molecular Biophysics and Biochemistry, Yale University New Haven Connecticut
- Department of Computer Science, Yale University New Haven Connecticut
| | - Ekta Khurana
- Institute for Computational Biomedicine, Weill Cornell Medical College New York New York
- Department of Physiology and Biophysics, Weill Cornell Medical College New York New York 10021
- Meyer Cancer Center, Weill Cornell Medical College New York New York
- Englander Institute for Precision Medicine, Weill Cornell Medical College New York New York
| |
Collapse
|
14
|
Lee CR, Svardal H, Farlow A, Exposito-Alonso M, Ding W, Novikova P, Alonso-Blanco C, Weigel D, Nordborg M. On the post-glacial spread of human commensal Arabidopsis thaliana. Nat Commun 2017; 8:14458. [PMID: 28181519 PMCID: PMC5309843 DOI: 10.1038/ncomms14458] [Citation(s) in RCA: 44] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2016] [Accepted: 01/03/2017] [Indexed: 02/03/2023] Open
Abstract
Recent work has shown that Arabidopsis thaliana contains genetic groups originating from different ice age refugia, with one particular group comprising over 95% of the current worldwide population. In Europe, relicts of other groups can be found in local populations along the Mediterranean Sea. Here we provide evidence that these 'relicts' occupied post-glacial Eurasia first and were later replaced by the invading 'non-relicts', which expanded through the east-west axis of Eurasia, leaving traces of admixture in the north and south of the species range. The non-relict expansion was likely associated with human activity and led to a demographic replacement similar to what occurred in humans. Introgressed genomic regions from relicts are associated with flowering time and enriched for genes associated with environmental conditions, such as root cap development or metal ion trans-membrane transport, which suggest that admixture with locally adapted relicts helped the non-relicts colonize new habitats.
Collapse
Affiliation(s)
- Cheng-Ruei Lee
- Gregor Mendel Institute, Austrian Academy of Sciences, Vienna Biocenter (VBC), Dr Bohr-Gasse 3, 1030 Vienna, Austria
- Institute of Ecology and Evolutionary Biology & Institute of Plant Biology, National Taiwan University, No. 1, Section 4, Roosevelt Rd, 10617 Taipei, Taiwan
| | - Hannes Svardal
- Gregor Mendel Institute, Austrian Academy of Sciences, Vienna Biocenter (VBC), Dr Bohr-Gasse 3, 1030 Vienna, Austria
| | - Ashley Farlow
- Gregor Mendel Institute, Austrian Academy of Sciences, Vienna Biocenter (VBC), Dr Bohr-Gasse 3, 1030 Vienna, Austria
| | - Moises Exposito-Alonso
- Max Planck Institute for Developmental Biology, Spemannstrasse 35, 72076 Tübingen, Germany
| | - Wei Ding
- Max Planck Institute for Developmental Biology, Spemannstrasse 35, 72076 Tübingen, Germany
| | - Polina Novikova
- Gregor Mendel Institute, Austrian Academy of Sciences, Vienna Biocenter (VBC), Dr Bohr-Gasse 3, 1030 Vienna, Austria
| | - Carlos Alonso-Blanco
- Departamento de Genética Molecular de Plantas, Centro Nacional de Biotecnología (CNB), Consejo Superior de Investigaciones Científicas (CSIC), Madrid 28049, Spain
| | - Detlef Weigel
- Max Planck Institute for Developmental Biology, Spemannstrasse 35, 72076 Tübingen, Germany
| | - Magnus Nordborg
- Gregor Mendel Institute, Austrian Academy of Sciences, Vienna Biocenter (VBC), Dr Bohr-Gasse 3, 1030 Vienna, Austria
| |
Collapse
|
15
|
A Survey of Computational Tools to Analyze and Interpret Whole Exome Sequencing Data. Int J Genomics 2016; 2016:7983236. [PMID: 28070503 PMCID: PMC5192301 DOI: 10.1155/2016/7983236] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2016] [Accepted: 10/26/2016] [Indexed: 12/31/2022] Open
Abstract
Whole Exome Sequencing (WES) is the application of the next-generation technology to determine the variations in the exome and is becoming a standard approach in studying genetic variants in diseases. Understanding the exomes of individuals at single base resolution allows the identification of actionable mutations for disease treatment and management. WES technologies have shifted the bottleneck in experimental data production to computationally intensive informatics-based data analysis. Novel computational tools and methods have been developed to analyze and interpret WES data. Here, we review some of the current tools that are being used to analyze WES data. These tools range from the alignment of raw sequencing reads all the way to linking variants to actionable therapeutics. Strengths and weaknesses of each tool are discussed for the purpose of helping researchers make more informative decisions on selecting the best tools to analyze their WES data.
Collapse
|
16
|
Kumar S, Clarke D, Gerstein M. Localized structural frustration for evaluating the impact of sequence variants. Nucleic Acids Res 2016; 44:10062-10073. [PMID: 27915290 PMCID: PMC5137452 DOI: 10.1093/nar/gkw927] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2016] [Revised: 09/30/2016] [Accepted: 10/14/2016] [Indexed: 12/13/2022] Open
Abstract
Population-scale sequencing is increasingly uncovering large numbers of rare single-nucleotide variants (SNVs) in coding regions of the genome. The rarity of these variants makes it challenging to evaluate their deleteriousness with conventional phenotype-genotype associations. Protein structures provide a way of addressing this challenge. Previous efforts have focused on globally quantifying the impact of SNVs on protein stability. However, local perturbations may severely impact protein functionality without strongly disrupting global stability (e.g. in relation to catalysis or allostery). Here, we describe a workflow in which localized frustration, quantifying unfavorable local interactions, is employed as a metric to investigate such effects. Using this workflow on the Protein Databank, we find that frustration produces many immediately intuitive results: for instance, disease-related SNVs create stronger changes in localized frustration than non-disease related variants, and rare SNVs tend to disrupt local interactions to a larger extent than common variants. Less obviously, we observe that somatic SNVs associated with oncogenes and tumor suppressor genes (TSGs) induce very different changes in frustration. In particular, those associated with TSGs change the frustration more in the core than the surface (by introducing loss-of-function events), whereas those associated with oncogenes manifest the opposite pattern, creating gain-of-function events.
Collapse
Affiliation(s)
- Sushant Kumar
- Program in Computational Biology and Bioinformatics, Yale University, 260/266 Whitney Avenue PO Box 208114, New Haven, CT 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale University, 260/266 Whitney Avenue PO Box 208114, New Haven, CT 06520, USA
| | - Declan Clarke
- Program in Computational Biology and Bioinformatics, Yale University, 260/266 Whitney Avenue PO Box 208114, New Haven, CT 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale University, 260/266 Whitney Avenue PO Box 208114, New Haven, CT 06520, USA
- Department of Chemistry, Yale University, 225 Prospect Street, New Haven, CT 06520, USA
| | - Mark Gerstein
- Program in Computational Biology and Bioinformatics, Yale University, 260/266 Whitney Avenue PO Box 208114, New Haven, CT 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale University, 260/266 Whitney Avenue PO Box 208114, New Haven, CT 06520, USA
- Department of Computer Science, Yale University, 260/266 Whitney Avenue PO Box 208114, New Haven, CT 06520, USA
| |
Collapse
|
17
|
eMERGE Phenome-Wide Association Study (PheWAS) identifies clinical associations and pleiotropy for stop-gain variants. BMC Med Genomics 2016; 9 Suppl 1:32. [PMID: 27535653 PMCID: PMC4989894 DOI: 10.1186/s12920-016-0191-8] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND We explored premature stop-gain variants to test the hypothesis that variants, which are likely to have a consequence on protein structure and function, will reveal important insights with respect to the phenotypes associated with them. We performed a phenome-wide association study (PheWAS) exploring the association between a selected list of functional stop-gain genetic variants (variation resulting in truncated proteins or in nonsense-mediated decay) and an extensive group of diagnoses to identify novel associations and uncover potential pleiotropy. RESULTS In this study, we selected 25 stop-gain variants: 5 stop-gain variants with previously reported phenotypic associations, and a set of 20 putative stop-gain variants identified using dbSNP. For the PheWAS, we used data from the electronic MEdical Records and GEnomics (eMERGE) Network across 9 sites with a total of 41,057 unrelated patients. We divided all these samples into two datasets by equal proportion of eMERGE site, sex, race, and genotyping platform. We calculated single effect associations between these 25 stop-gain variants and ICD-9 defined case-control diagnoses. We also performed stratified analyses for samples of European and African ancestry. Associations were adjusted for sex, site, genotyping platform and the first three principal components to account for global ancestry. We identified previously known associations, such as variants in LPL associated with hyperglyceridemia indicating that our approach was robust. We also found a total of three significant associations with p < 0.01 in both datasets, with the most significant replicating result being LPL SNP rs328 and ICD-9 code 272.1 "Disorder of Lipoid metabolism" (pdiscovery = 2.59x10-6, preplicating = 2.7x10-4). The other two significant replicated associations identified by this study are: variant rs1137617 in KCNH2 gene associated with ICD-9 code category 244 "Acquired Hypothyroidism" (pdiscovery = 5.31x103, preplicating = 1.15x10-3) and variant rs12060879 in DPT gene associated with ICD-9 code category 996 "Complications peculiar to certain specified procedures" (pdiscovery = 8.65x103, preplicating = 4.16x10-3). CONCLUSION In conclusion, this PheWAS revealed novel associations of stop-gained variants with interesting phenotypes (ICD-9 codes) along with pleiotropic effects.
Collapse
|
18
|
Lelieveld SH, Veltman JA, Gilissen C. Novel bioinformatic developments for exome sequencing. Hum Genet 2016; 135:603-14. [PMID: 27075447 PMCID: PMC4883269 DOI: 10.1007/s00439-016-1658-6] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2016] [Accepted: 03/15/2016] [Indexed: 01/19/2023]
Abstract
With the widespread adoption of next generation sequencing technologies by the genetics community and the rapid decrease in costs per base, exome sequencing has become a standard within the repertoire of genetic experiments for both research and diagnostics. Although bioinformatics now offers standard solutions for the analysis of exome sequencing data, many challenges still remain; especially the increasing scale at which exome data are now being generated has given rise to novel challenges in how to efficiently store, analyze and interpret exome data of this magnitude. In this review we discuss some of the recent developments in bioinformatics for exome sequencing and the directions that this is taking us to. With these developments, exome sequencing is paving the way for the next big challenge, the application of whole genome sequencing.
Collapse
Affiliation(s)
- Stefan H Lelieveld
- Department of Human Genetics, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Geert Grooteplein 10, 6525 GA, Nijmegen, The Netherlands
| | - Joris A Veltman
- Department of Human Genetics, Donders Centre for Neuroscience, Radboudumc, Geert Grooteplein 10, 6525 GA, Nijmegen, The Netherlands
- Department of Clinical Genetics, GROW-School for Oncology and Developmental Biology, Maastricht University Medical Centre, Universiteitssingel 50, 6229 ER, Maastricht, The Netherlands
| | - Christian Gilissen
- Department of Human Genetics, Donders Centre for Neuroscience, Radboudumc, Geert Grooteplein 10, 6525 GA, Nijmegen, The Netherlands.
| |
Collapse
|
19
|
Lelieveld SH, Veltman JA, Gilissen C. Novel bioinformatic developments for exome sequencing. Hum Genet 2016. [PMID: 27075447 DOI: 10.1007/s00439‐016‐1658‐6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
With the widespread adoption of next generation sequencing technologies by the genetics community and the rapid decrease in costs per base, exome sequencing has become a standard within the repertoire of genetic experiments for both research and diagnostics. Although bioinformatics now offers standard solutions for the analysis of exome sequencing data, many challenges still remain; especially the increasing scale at which exome data are now being generated has given rise to novel challenges in how to efficiently store, analyze and interpret exome data of this magnitude. In this review we discuss some of the recent developments in bioinformatics for exome sequencing and the directions that this is taking us to. With these developments, exome sequencing is paving the way for the next big challenge, the application of whole genome sequencing.
Collapse
Affiliation(s)
- Stefan H Lelieveld
- Department of Human Genetics, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Geert Grooteplein 10, 6525 GA, Nijmegen, The Netherlands
| | - Joris A Veltman
- Department of Human Genetics, Donders Centre for Neuroscience, Radboudumc, Geert Grooteplein 10, 6525 GA, Nijmegen, The Netherlands.,Department of Clinical Genetics, GROW-School for Oncology and Developmental Biology, Maastricht University Medical Centre, Universiteitssingel 50, 6229 ER, Maastricht, The Netherlands
| | - Christian Gilissen
- Department of Human Genetics, Donders Centre for Neuroscience, Radboudumc, Geert Grooteplein 10, 6525 GA, Nijmegen, The Netherlands.
| |
Collapse
|
20
|
Abstract
High-throughput platforms such as microarray, mass spectrometry, and next-generation sequencing are producing an increasing volume of omics data that needs large data storage and computing power. Cloud computing offers massive scalable computing and storage, data sharing, on-demand anytime and anywhere access to resources and applications, and thus, it may represent the key technology for facing those issues. In fact, in the recent years it has been adopted for the deployment of different bioinformatics solutions and services both in academia and in the industry. Although this, cloud computing presents several issues regarding the security and privacy of data, that are particularly important when analyzing patients data, such as in personalized medicine. This chapter reviews main academic and industrial cloud-based bioinformatics solutions; with a special focus on microarray data analysis solutions and underlines main issues and problems related to the use of such platforms for the storage and analysis of patients data.
Collapse
Affiliation(s)
- Barbara Calabrese
- Department of Medical and Surgical Sciences, University Magna Graecia of Catanzaro, 88100, Catanzaro, Italy
| | - Mario Cannataro
- Department of Medical and Surgical Sciences, University Magna Graecia of Catanzaro, 88100, Catanzaro, Italy.
| |
Collapse
|
21
|
Fu Y, Liu Z, Lou S, Bedford J, Mu XJ, Yip KY, Khurana E, Gerstein M. FunSeq2: a framework for prioritizing noncoding regulatory variants in cancer. Genome Biol 2015; 15:480. [PMID: 25273974 PMCID: PMC4203974 DOI: 10.1186/s13059-014-0480-5] [Citation(s) in RCA: 226] [Impact Index Per Article: 25.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2014] [Indexed: 12/15/2022] Open
Abstract
Identification of noncoding drivers from thousands of somatic alterations in a typical tumor is a difficult and unsolved problem. We report a computational framework, FunSeq2, to annotate and prioritize these mutations. The framework combines an adjustable data context integrating large-scale genomics and cancer resources with a streamlined variant-prioritization pipeline. The pipeline has a weighted scoring system combining: inter- and intra-species conservation; loss- and gain-of-function events for transcription-factor binding; enhancer-gene linkages and network centrality; and per-element recurrence across samples. We further highlight putative drivers with information specific to a particular sample, such as differential expression. FunSeq2 is available from funseq2.gersteinlab.org.
Collapse
Affiliation(s)
- Yao Fu
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA
| | | | | | | | | | | | | | | |
Collapse
|
22
|
Yang H, Wang K. Genomic variant annotation and prioritization with ANNOVAR and wANNOVAR. Nat Protoc 2015; 10:1556-66. [PMID: 26379229 DOI: 10.1038/nprot.2015.105] [Citation(s) in RCA: 599] [Impact Index Per Article: 66.6] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
Recent developments in sequencing techniques have enabled rapid and high-throughput generation of sequence data, democratizing the ability to compile information on large amounts of genetic variations in individual laboratories. However, there is a growing gap between the generation of raw sequencing data and the extraction of meaningful biological information. Here, we describe a protocol to use the ANNOVAR (ANNOtate VARiation) software to facilitate fast and easy variant annotations, including gene-based, region-based and filter-based annotations on a variant call format (VCF) file generated from human genomes. We further describe a protocol for gene-based annotation of a newly sequenced nonhuman species. Finally, we describe how to use a user-friendly and easily accessible web server called wANNOVAR to prioritize candidate genes for a Mendelian disease. The variant annotation protocols take 5-30 min of computer time, depending on the size of the variant file, and 5-10 min of hands-on time. In summary, through the command-line tool and the web server, these protocols provide a convenient means to analyze genetic variants generated in humans and other species.
Collapse
Affiliation(s)
- Hui Yang
- Zilkha Neurogenetic Institute, University of Southern California, Los Angeles, California, USA.,Neuroscience Graduate Program, University of Southern California, Los Angeles, California, USA
| | - Kai Wang
- Zilkha Neurogenetic Institute, University of Southern California, Los Angeles, California, USA.,Department of Psychiatry, University of Southern California, Los Angeles, California, USA.,Department of Preventive Medicine, Division of Bioinformatics, University of Southern California, Los Angeles, California, USA
| |
Collapse
|
23
|
Yang H, Robinson PN, Wang K. Phenolyzer: phenotype-based prioritization of candidate genes for human diseases. Nat Methods 2015; 12:841-3. [PMID: 26192085 PMCID: PMC4718403 DOI: 10.1038/nmeth.3484] [Citation(s) in RCA: 263] [Impact Index Per Article: 29.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2015] [Accepted: 05/18/2015] [Indexed: 12/21/2022]
Abstract
Prior biological knowledge and phenotype information may help to identify disease genes from human whole-genome and whole-exome sequencing studies. We developed Phenolyzer (http://phenolyzer.usc.edu), a tool that uses prior information to implicate genes involved in diseases. Phenolyzer exhibits superior performance over competing methods for prioritizing Mendelian and complex disease genes, based on disease or phenotype terms entered as free text.
Collapse
Affiliation(s)
- Hui Yang
- Zilkha Neurogenetic Institute, University of Southern California, Los Angeles, California, USA
- Neuroscience Graduate Program, University of Southern California, Los Angeles, California, USA
| | - Peter N Robinson
- Institute for Medical and Human Genetics, Charité-Universitätsmedizin Berlin, Berlin, Germany
- Max Planck Institute for Molecular Genetics, Berlin, Germany
- Berlin Brandenburg Center for Regenerative Therapies (BCRT), Charité-Universitätsmedizin Berlin, Berlin, Germany
- Institute for Bioinformatics, Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin, Germany
| | - Kai Wang
- Zilkha Neurogenetic Institute, University of Southern California, Los Angeles, California, USA
- Department of Psychiatry, University of Southern California, Los Angeles, California, USA
- Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, California, USA
| |
Collapse
|
24
|
Frankish A, Uszczynska B, Ritchie GRS, Gonzalez JM, Pervouchine D, Petryszak R, Mudge JM, Fonseca N, Brazma A, Guigo R, Harrow J. Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction. BMC Genomics 2015; 16 Suppl 8:S2. [PMID: 26110515 PMCID: PMC4502323 DOI: 10.1186/1471-2164-16-s8-s2] [Citation(s) in RCA: 59] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023] Open
Abstract
Background A vast amount of DNA variation is being identified by increasingly large-scale exome and genome sequencing projects. To be useful, variants require accurate functional annotation and a wide range of tools are available to this end. McCarthy et al recently demonstrated the large differences in prediction of loss-of-function (LoF) variation when RefSeq and Ensembl transcripts are used for annotation, highlighting the importance of the reference transcripts on which variant functional annotation is based. Results We describe a detailed analysis of the similarities and differences between the gene and transcript annotation in the GENCODE and RefSeq genesets. We demonstrate that the GENCODE Comprehensive set is richer in alternative splicing, novel CDSs, novel exons and has higher genomic coverage than RefSeq, while the GENCODE Basic set is very similar to RefSeq. Using RNAseq data we show that exons and introns unique to one geneset are expressed at a similar level to those common to both. We present evidence that the differences in gene annotation lead to large differences in variant annotation where GENCODE and RefSeq are used as reference transcripts, although this is predominantly confined to non-coding transcripts and UTR sequence, with at most ~30% of LoF variants annotated discordantly. We also describe an investigation of dominant transcript expression, showing that it both supports the utility of the GENCODE Basic set in providing a smaller set of more highly expressed transcripts and provides a useful, biologically-relevant filter for further reducing the complexity of the transcriptome. Conclusions The reference transcripts selected for variant functional annotation do have a large effect on the outcome. The GENCODE Comprehensive transcripts contain more exons, have greater genomic coverage and capture many more variants than RefSeq in both genome and exome datasets, while the GENCODE Basic set shows a higher degree of concordance with RefSeq and has fewer unique features. We propose that the GENCODE Comprehensive set has great utility for the discovery of new variants with functional potential, while the GENCODE Basic set is more suitable for applications demanding less complex interpretation of functional variants.
Collapse
|
25
|
Podicheti R, Mockaitis K. FEATnotator: A tool for integrated annotation of sequence features and variation, facilitating interpretation in genomics experiments. Methods 2015; 79-80:11-7. [PMID: 25934264 DOI: 10.1016/j.ymeth.2015.04.028] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2014] [Revised: 03/25/2015] [Accepted: 04/22/2015] [Indexed: 11/16/2022] Open
Abstract
As approaches are sought for more efficient and democratized uses of non-model and expanded model genomics references, ease of integration of genomic feature datasets is especially desirable in multidisciplinary research communities. Valuable conclusions are often missed or slowed when researchers refer experimental results to a single reference sequence that lacks integrated pan-genomic and multi-experiment data in accessible formats. Association of genomic positional information, such as results from an expansive variety of next-generation sequencing experiments, with annotated reference features such as genes or predicted protein binding sites, provides the context essential for conclusions and ongoing research. When the experimental system includes polymorphic genomic inputs, rapid calculation of gene structural and protein translational effects of sequence variation from the reference can be invaluable. Here we present FEATnotator, a lightweight, fast and easy to use open source software program that integrates and reports overlap and proximity in genomic information from any user-defined datasets including those from next generation sequencing applications. We illustrate use of the tool by summarizing whole genome sequence variation of a widely used natural isolate of Arabidopsis thaliana in the context of gene models of the reference accession. Previous discovery of a protein coding deletion influencing root development is replicated rapidly. Appropriate even in investigations of a single gene or genic regions such as QTL, comprehensive reports provided by FEATnotator better prepare researchers for interpretation of their experimental results. The tool is available for download at http://featnotator.sourceforge.net.
Collapse
Affiliation(s)
- Ram Podicheti
- Center for Genomics and Bioinformatics, Indiana University, 1001 E. Third Street, Bloomington, IN 47405, USA; School of Informatics and Computing, Indiana University, 919 E. Tenth Street, Bloomington, IN 47408, USA.
| | - Keithanne Mockaitis
- Pervasive Technology Institute, Indiana University, 2709 E. Tenth Street, Bloomington, IN 47408, USA; Department of Biology, Indiana University, 915 E. Third Street, Bloomington, IN 47405, USA.
| |
Collapse
|
26
|
Li MJ, Deng J, Wang P, Yang W, Ho SL, Sham PC, Wang J, Li M. wKGGSeq: A Comprehensive Strategy-Based and Disease-Targeted Online Framework to Facilitate Exome Sequencing Studies of Inherited Disorders. Hum Mutat 2015; 36:496-503. [PMID: 25676918 DOI: 10.1002/humu.22766] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2014] [Accepted: 02/03/2015] [Indexed: 12/19/2022]
Abstract
With the rapid advances in high-throughput sequencing technologies, exome sequencing and targeted region sequencing have become routine approaches for identifying mutations of inherited disorders in both genetics research and molecular diagnosis. There is an imminent need for comprehensive and easy-to-use downstream analysis tools to isolate causal mutations in exome sequencing studies. We have developed a user-friendly online framework, wKGGSeq, to provide systematic annotation, filtration, prioritization, and visualization functions for characterizing causal mutation(s) in exome sequencing studies of inherited disorders. wKGGSeq provides: (1) a novel strategy-based procedure for downstream analysis of a large amount of exome sequencing data and (2) a disease-targeted analysis procedure to facilitate clinical diagnosis of well-studied genetic diseases. In addition, it is also equipped with abundant online annotation functions for sequence variants. We demonstrate that wKGGSeq either outperforms or is comparable to two popular tools in several real exome sequencing samples. This tool will greatly facilitate the downstream analysis of exome sequencing data and can play a useful role for researchers and clinicians in identifying causal mutations of inherited disorders. The wKGGSeq is freely available at http://statgenpro.psychiatry.hku.hk/wkggseq or http://jjwanglab.org/wkggseq, and will be updated frequently.
Collapse
Affiliation(s)
- Mulin Jun Li
- Centre for Genomic Sciences, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong, SAR, China; Departments of Biochemistry, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong, SAR, China; Shenzhen Institute of Research and Innovation, The University of Hong Kong, Shenzhen, Guangdong, 518057, China
| | | | | | | | | | | | | | | |
Collapse
|
27
|
|
28
|
Scheuch M, Höper D, Beer M. RIEMS: a software pipeline for sensitive and comprehensive taxonomic classification of reads from metagenomics datasets. BMC Bioinformatics 2015; 16:69. [PMID: 25886935 PMCID: PMC4351923 DOI: 10.1186/s12859-015-0503-6] [Citation(s) in RCA: 68] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2014] [Accepted: 02/20/2015] [Indexed: 01/28/2023] Open
Abstract
BACKGROUND Fuelled by the advent and subsequent development of next generation sequencing technologies, metagenomics became a powerful tool for the analysis of microbial communities both scientifically and diagnostically. The biggest challenge is the extraction of relevant information from the huge sequence datasets generated for metagenomics studies. Although a plethora of tools are available, data analysis is still a bottleneck. RESULTS To overcome the bottleneck of data analysis, we developed an automated computational workflow called RIEMS - Reliable Information Extraction from Metagenomic Sequence datasets. RIEMS assigns every individual read sequence within a dataset taxonomically by cascading different sequence analyses with decreasing stringency of the assignments using various software applications. After completion of the analyses, the results are summarised in a clearly structured result protocol organised taxonomically. The high accuracy and performance of RIEMS analyses were proven in comparison with other tools for metagenomics data analysis using simulated sequencing read datasets. CONCLUSIONS RIEMS has the potential to fill the gap that still exists with regard to data analysis for metagenomics studies. The usefulness and power of RIEMS for the analysis of genuine sequencing datasets was demonstrated with an early version of RIEMS in 2011 when it was used to detect the orthobunyavirus sequences leading to the discovery of Schmallenberg virus.
Collapse
Affiliation(s)
- Matthias Scheuch
- Institute of Diagnostic Virology, Friedrich-Loeffler-Institut, Federal Research Institute for Animal Health, Südufer 10, 17493, Greifswald - Insel Riems, Germany.
| | - Dirk Höper
- Institute of Diagnostic Virology, Friedrich-Loeffler-Institut, Federal Research Institute for Animal Health, Südufer 10, 17493, Greifswald - Insel Riems, Germany.
| | - Martin Beer
- Institute of Diagnostic Virology, Friedrich-Loeffler-Institut, Federal Research Institute for Animal Health, Südufer 10, 17493, Greifswald - Insel Riems, Germany.
| |
Collapse
|
29
|
Abstract
Identifying sequence variants that play a mechanistic role in human disease and other phenotypes is a fundamental goal in human genetics and will be important in translating the results of variation studies. Experimental validation to confirm that a variant causes the biochemical changes responsible for a given disease or phenotype is considered the gold standard, but this cannot currently be applied to the 3 million or so variants expected in an individual genome. This has prompted the development of a wide variety of computational approaches that use several different sources of information to identify functional variation. Here, we review and assess the limitations of computational techniques for categorizing variants according to functional classes, prioritizing variants for experimental follow-up and generating hypotheses about the possible molecular mechanisms to inform downstream experiments. We discuss the main current bioinformatics approaches to identifying functional variation, including widely used algorithms for coding variation such as SIFT and PolyPhen and also novel techniques for interpreting variation across the genome.
Collapse
Affiliation(s)
- Graham RS Ritchie
- />European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge CB10 1SD UK
- />Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA UK
| | - Paul Flicek
- />European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge CB10 1SD UK
- />Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA UK
| |
Collapse
|
30
|
Li MJ, Wang J. Current trend of annotating single nucleotide variation in humans--A case study on SNVrap. Methods 2014; 79-80:32-40. [PMID: 25308971 DOI: 10.1016/j.ymeth.2014.10.003] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2014] [Revised: 09/25/2014] [Accepted: 10/02/2014] [Indexed: 12/16/2022] Open
Abstract
As high throughput methods, such as whole genome genotyping arrays, whole exome sequencing (WES) and whole genome sequencing (WGS), have detected huge amounts of genetic variants associated with human diseases, function annotation of these variants is an indispensable step in understanding disease etiology. Large-scale functional genomics projects, such as The ENCODE Project and Roadmap Epigenomics Project, provide genome-wide profiling of functional elements across different human cell types and tissues. With the urgent demands for identification of disease-causal variants, comprehensive and easy-to-use annotation tool is highly in demand. Here we review and discuss current progress and trend of the variant annotation field. Furthermore, we introduce a comprehensive web portal for annotating human genetic variants. We use gene-based features and the latest functional genomics datasets to annotate single nucleotide variation (SNVs) in human, at whole genome scale. We further apply several function prediction algorithms to annotate SNVs that might affect different biological processes, including transcriptional gene regulation, alternative splicing, post-transcriptional regulation, translation and post-translational modifications. The SNVrap web portal is freely available at http://jjwanglab.org/snvrap.
Collapse
Affiliation(s)
- Mulin Jun Li
- Centre for Genomic Sciences, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong Special Administrative Region, China; Department of Biochemistry, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong Special Administrative Region, China; Shenzhen Institute of Research and Innovation, The University of Hong Kong, Shenzhen, China
| | - Junwen Wang
- Centre for Genomic Sciences, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong Special Administrative Region, China; Department of Biochemistry, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong Special Administrative Region, China; Shenzhen Institute of Research and Innovation, The University of Hong Kong, Shenzhen, China.
| |
Collapse
|
31
|
Ho ED, Cao Q, Lee SD, Yip KY. VAS: a convenient web portal for efficient integration of genomic features with millions of genetic variants. BMC Genomics 2014; 15:886. [PMID: 25306238 PMCID: PMC4210471 DOI: 10.1186/1471-2164-15-886] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2014] [Accepted: 10/03/2014] [Indexed: 12/29/2022] Open
Abstract
Background High-throughput experimental methods have fostered the systematic detection of millions of genetic variants from any human genome. To help explore the potential biological implications of these genetic variants, software tools have been previously developed for integrating various types of information about these genomic regions from multiple data sources. Most of these tools were designed either for studying a small number of variants at a time, or for local execution on powerful machines. Results To make exploration of whole lists of genetic variants simple and accessible, we have developed a new Web-based system called VAS (Variant Annotation System, available at
https://yiplab.cse.cuhk.edu.hk/vas/). It provides a large variety of information useful for studying both coding and non-coding variants, including whole-genome transcription factor binding, open chromatin and transcription data from the ENCODE consortium. By means of data compression, millions of variants can be uploaded from a client machine to the server in less than 50 megabytes of data. On the server side, our customized data integration algorithms can efficiently link millions of variants with tens of whole-genome datasets. These two enabling technologies make VAS a practical tool for annotating genetic variants from large genomic studies. We demonstrate the use of VAS in annotating genetic variants obtained from a migraine meta-analysis study and multiple data sets from the Personal Genomes Project. We also compare the running time of annotating 6.4 million SNPs of the CEU trio by VAS and another tool, showing that VAS is efficient in handling new variant lists without requiring any pre-computations. Conclusions VAS is specially designed to handle annotation tasks with long lists of genetic variants and large numbers of annotating features efficiently. It is complementary to other existing tools with more specific aims such as evaluating the potential impacts of genetic variants in terms of disease risk. We recommend using VAS for a quick first-pass identification of potentially interesting genetic variants, to minimize the time required for other more in-depth downstream analyses.
Collapse
Affiliation(s)
| | | | | | - Kevin Y Yip
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong.
| |
Collapse
|
32
|
Shanahan HP, Owen AM, Harrison AP. Bioinformatics on the cloud computing platform Azure. PLoS One 2014; 9:e102642. [PMID: 25050811 PMCID: PMC4106841 DOI: 10.1371/journal.pone.0102642] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2013] [Accepted: 06/20/2014] [Indexed: 12/27/2022] Open
Abstract
We discuss the applicability of the Microsoft cloud computing platform, Azure, for bioinformatics. We focus on the usability of the resource rather than its performance. We provide an example of how R can be used on Azure to analyse a large amount of microarray expression data deposited at the public database ArrayExpress. We provide a walk through to demonstrate explicitly how Azure can be used to perform these analyses in Appendix S1 and we offer a comparison with a local computation. We note that the use of the Platform as a Service (PaaS) offering of Azure can represent a steep learning curve for bioinformatics developers who will usually have a Linux and scripting language background. On the other hand, the presence of an additional set of libraries makes it easier to deploy software in a parallel (scalable) fashion and explicitly manage such a production run with only a few hundred lines of code, most of which can be incorporated from a template. We propose that this environment is best suited for running stable bioinformatics software by users not involved with its development.
Collapse
Affiliation(s)
- Hugh P. Shanahan
- Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, United Kingdom
- * E-mail:
| | - Anne M. Owen
- Department of Mathematical Sciences, University of Essex, Wivenhoe Park, Colchester, United Kingdom
| | - Andrew P. Harrison
- Department of Mathematical Sciences, University of Essex, Wivenhoe Park, Colchester, United Kingdom
- Department of Biological Sciences, University of Essex, Wivenhoe Park, Colchester, United Kingdom
| |
Collapse
|
33
|
Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat Genet 2014; 46:818-25. [PMID: 24974849 DOI: 10.1038/ng.3021] [Citation(s) in RCA: 486] [Impact Index Per Article: 48.6] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2013] [Accepted: 06/06/2014] [Indexed: 12/16/2022]
Abstract
Whole-genome sequencing enables complete characterization of genetic variation, but geographic clustering of rare alleles demands many diverse populations be studied. Here we describe the Genome of the Netherlands (GoNL) Project, in which we sequenced the whole genomes of 250 Dutch parent-offspring families and constructed a haplotype map of 20.4 million single-nucleotide variants and 1.2 million insertions and deletions. The intermediate coverage (∼13×) and trio design enabled extensive characterization of structural variation, including midsize events (30-500 bp) previously poorly catalogued and de novo mutations. We demonstrate that the quality of the haplotypes boosts imputation accuracy in independent samples, especially for lower frequency alleles. Population genetic analyses demonstrate fine-scale structure across the country and support multiple ancient migrations, consistent with historical changes in sea level and flooding. The GoNL Project illustrates how single-population whole-genome sequencing can provide detailed characterization of genetic variation and may guide the design of future population studies.
Collapse
|
34
|
Dall'Olio GM, Bertranpetit J, Wagner A, Laayouni H. Human genome variation and the concept of genotype networks. PLoS One 2014; 9:e99424. [PMID: 24911413 PMCID: PMC4049842 DOI: 10.1371/journal.pone.0099424] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2014] [Accepted: 05/14/2014] [Indexed: 12/29/2022] Open
Abstract
Genotype networks are a concept used in systems biology to study sets of genotypes having the same phenotype, and the ability of these to bring forth novel phenotypes. In the past they have been applied to determine the genetic heterogeneity, and stability to mutations, of systems such as metabolic networks and RNA folds. Recently, they have been the base for reconciling the neutralist and selectionist views on evolution. Here, we adapted this concept to the study of population genetics data. Specifically, we applied genotype networks to the human 1000 genomes dataset, and analyzed networks composed of short haplotypes of Single Nucleotide Variants (SNV). The result is a scan of how properties related to genetic heterogeneity and stability to mutations are distributed along the human genome. We found that genes involved in acquired immunity, such as some HLA and MHC genes, tend to have the most heterogeneous and connected networks, and that coding regions tend to be more heterogeneous and stable to mutations than non-coding regions. We also found, using coalescent simulations, that regions under selection have more extended and connected networks. The application of the concept of genotype networks can provide a new opportunity to understand the evolutionary processes that shaped our genome. Learning how the genotype space of each region of our genome has been explored during the evolutionary history of the human species can lead to a better understanding on how selective pressures and neutral factors have shaped genetic diversity within populations and among individuals. Combined with the availability of larger datasets of sequencing data, genotype networks represent a new approach to the study of human genetic diversity that looks to the whole genome, and goes beyond the classical division between selection and neutrality methods.
Collapse
Affiliation(s)
| | - Jaume Bertranpetit
- Institut de Biologia Evolutiva, CSIC-Universitat Pompeu Fabra, Barcelona, Catalonia, Spain
| | - Andreas Wagner
- Institute of Evolutionary Biology and Environmental Studies, University of Zurich, Zürich, Switzerland
- The Swiss Institute of Bioinformatics, Lausanne, Switzerland
- The Santa Fe Institute, Santa Fe, New Mexico, United States of America
| | - Hafid Laayouni
- Institut de Biologia Evolutiva, CSIC-Universitat Pompeu Fabra, Barcelona, Catalonia, Spain
- Universitat Autonòma de Barcelona, Barcelona, Spain
| |
Collapse
|
35
|
Jäger M, Wang K, Bauer S, Smedley D, Krawitz P, Robinson PN. Jannovar: a java library for exome annotation. Hum Mutat 2014; 35:548-55. [PMID: 24677618 DOI: 10.1002/humu.22531] [Citation(s) in RCA: 53] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2013] [Accepted: 02/11/2014] [Indexed: 01/03/2023]
Abstract
Transcript-based annotation and pedigree analysis are two basic steps in the computational analysis of whole-exome sequencing experiments in genetic diagnostics and disease-gene discovery projects. Here, we present Jannovar, a stand-alone Java application as well as a Java library designed to be used in larger software frameworks for exome and genome analysis. Jannovar uses an interval tree to identify all transcripts affected by a given variant, and provides Human Genome Variation Society-compliant annotations both for variants affecting coding sequences and splice junctions as well as untranslated regions and noncoding RNA transcripts. Jannovar can also perform family-based pedigree analysis with Variant Call Format (VCF) files with data from members of a family segregating a Mendelian disorder. Using a desktop computer, Jannovar requires a few seconds to annotate a typical VCF file with exome data. Jannovar is freely available under the BSD2 license. Source code as well as the Java application and library file can be downloaded from http://compbio.charite.de (with tutorial) and https://github.com/charite/jannovar.
Collapse
Affiliation(s)
- Marten Jäger
- Institute for Medical and Human Genetics, Charité-Universitätsmedizin Berlin, Berlin, Germany; Berlin Brandenburg Center for Regenerative Therapies (BCRT), Charité-Universitätsmedizin Berlin, Berlin, Germany
| | | | | | | | | | | |
Collapse
|
36
|
McCarthy DJ, Humburg P, Kanapin A, Rivas MA, Gaulton K, Cazier JB, Donnelly P. Choice of transcripts and software has a large effect on variant annotation. Genome Med 2014; 6:26. [PMID: 24944579 PMCID: PMC4062061 DOI: 10.1186/gm543] [Citation(s) in RCA: 131] [Impact Index Per Article: 13.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2013] [Accepted: 03/20/2014] [Indexed: 12/19/2022] Open
Abstract
Background Variant annotation is a crucial step in the analysis of genome sequencing data. Functional annotation results can have a strong influence on the ultimate conclusions of disease studies. Incorrect or incomplete annotations can cause researchers both to overlook potentially disease-relevant DNA variants and to dilute interesting variants in a pool of false positives. Researchers are aware of these issues in general, but the extent of the dependency of final results on the choice of transcripts and software used for annotation has not been quantified in detail. Methods This paper quantifies the extent of differences in annotation of 80 million variants from a whole-genome sequencing study. We compare results using the RefSeq and Ensembl transcript sets as the basis for variant annotation with the software Annovar, and also compare the results from two annotation software packages, Annovar and VEP (Ensembl’s Variant Effect Predictor), when using Ensembl transcripts. Results We found only 44% agreement in annotations for putative loss-of-function variants when using the RefSeq and Ensembl transcript sets as the basis for annotation with Annovar. The rate of matching annotations for loss-of-function and nonsynonymous variants combined was 79% and for all exonic variants it was 83%. When comparing results from Annovar and VEP using Ensembl transcripts, matching annotations were seen for only 65% of loss-of-function variants and 87% of all exonic variants, with splicing variants revealed as the category with the greatest discrepancy. Using these comparisons, we characterised the types of apparent errors made by Annovar and VEP and discuss their impact on the analysis of DNA variants in genome sequencing studies. Conclusions Variant annotation is not yet a solved problem. Choice of transcript set can have a large effect on the ultimate variant annotations obtained in a whole-genome sequencing study. Choice of annotation software can also have a substantial effect. The annotation step in the analysis of a genome sequencing study must therefore be considered carefully, and a conscious choice made as to which transcript set and software are used for annotation.
Collapse
Affiliation(s)
- Davis J McCarthy
- Department of Statistics, University of Oxford, South Parks Road, Oxford, UK ; Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford, UK
| | - Peter Humburg
- Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford, UK
| | - Alexander Kanapin
- Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford, UK
| | - Manuel A Rivas
- Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford, UK
| | - Kyle Gaulton
- Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford, UK
| | | | - Peter Donnelly
- Department of Statistics, University of Oxford, South Parks Road, Oxford, UK ; Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford, UK
| |
Collapse
|
37
|
Lee IH, Lee K, Hsing M, Choe Y, Park JH, Kim SH, Bohn JM, Neu MB, Hwang KB, Green RC, Kohane IS, Kong SW. Prioritizing disease-linked variants, genes, and pathways with an interactive whole-genome analysis pipeline. Hum Mutat 2014; 35:537-47. [PMID: 24478219 DOI: 10.1002/humu.22520] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2013] [Accepted: 01/23/2014] [Indexed: 01/02/2023]
Abstract
Whole-genome sequencing (WGS) studies are uncovering disease-associated variants in both rare and nonrare diseases. Utilizing the next-generation sequencing for WGS requires a series of computational methods for alignment, variant detection, and annotation, and the accuracy and reproducibility of annotation results are essential for clinical implementation. However, annotating WGS with up to date genomic information is still challenging for biomedical researchers. Here, we present one of the fastest and highly scalable annotation, filtering, and analysis pipeline-gNOME-to prioritize phenotype-associated variants while minimizing false-positive findings. Intuitive graphical user interface of gNOME facilitates the selection of phenotype-associated variants, and the result summaries are provided at variant, gene, and genome levels. Moreover, the enrichment results of specific variants, genes, and gene sets between two groups or compared with population scale WGS datasets that is already integrated in the pipeline can help the interpretation. We found a small number of discordant results between annotation software tools in part due to different reporting strategies for the variants with complex impacts. Using two published whole-exome datasets of uveal melanoma and bladder cancer, we demonstrated gNOME's accuracy of variant annotation and the enrichment of loss-of-function variants in known cancer pathways. gNOME Web server and source codes are freely available to the academic community (http://gnome.tchlab.org).
Collapse
Affiliation(s)
- In-Hee Lee
- Children's Hospital Informatics Program at the Harvard-MIT Division of Health Sciences and Technology, Department of Medicine, Boston Children's Hospital, Boston, Massachusetts, 02115
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
38
|
|
39
|
Preeprem T, Gibson G. An association-adjusted consensus deleterious scheme to classify homozygous Mis-sense mutations for personal genome interpretation. BioData Min 2013; 6:24. [PMID: 24365473 PMCID: PMC3892026 DOI: 10.1186/1756-0381-6-24] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2013] [Accepted: 12/17/2013] [Indexed: 11/22/2022] Open
Abstract
BACKGROUND Personal genome analysis is now being considered for evaluation of disease risk in healthy individuals, utilizing both rare and common variants. Multiple scores have been developed to predict the deleteriousness of amino acid substitutions, using information on the allele frequencies, level of evolutionary conservation, and averaged structural evidence. However, agreement among these scores is limited and they likely over-estimate the fraction of the genome that is deleterious. METHOD This study proposes an integrative approach to identify a subset of homozygous non-synonymous single nucleotide polymorphisms (nsSNPs). An 8-level classification scheme is constructed from the presence/absence of deleterious predictions combined with evidence of association with disease or complex traits. Detailed literature searches and structural validations are then performed for a subset of homozygous 826 mis-sense mutations in 575 proteins found in the genomes of 12 healthy adults. RESULTS Implementation of the Association-Adjusted Consensus Deleterious Scheme (AACDS) classifies 11% of all predicted highly deleterious homozygous variants as most likely to influence disease risk. The number of such variants per genome ranges from 0 to 8 with no significant difference between African and Caucasian Americans. Detailed analysis of mutations affecting the APOE, MTMR2, THSB1, CHIA, αMyHC, and AMY2A proteins shows how the protein structure is likely to be disrupted, even though the associated phenotypes have not been documented in the corresponding individuals. CONCLUSIONS The classification system for homozygous nsSNPs provides an opportunity to systematically rank nsSNPs based on suggestive evidence from annotations and sequence-based predictions. The ranking scheme, in-depth literature searches, and structural validations of highly prioritized mis-sense mutations compliment traditional sequence-based approaches and should have particular utility for the development of individualized health profiles. An online tool reporting the AACDS score for any variant is provided at the authors' website.
Collapse
Affiliation(s)
| | - Greg Gibson
- School of Biology, Georgia Institute of Technology, Atlanta, GA 30332, USA
| |
Collapse
|
40
|
Computational approaches to identify functional genetic variants in cancer genomes. Nat Methods 2013; 10:723-9. [PMID: 23900255 DOI: 10.1038/nmeth.2562] [Citation(s) in RCA: 127] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2013] [Accepted: 06/07/2013] [Indexed: 12/13/2022]
Abstract
The International Cancer Genome Consortium (ICGC) aims to catalog genomic abnormalities in tumors from 50 different cancer types. Genome sequencing reveals hundreds to thousands of somatic mutations in each tumor but only a minority of these drive tumor progression. We present the result of discussions within the ICGC on how to address the challenge of identifying mutations that contribute to oncogenesis, tumor maintenance or response to therapy, and recommend computational techniques to annotate somatic variants and predict their impact on cancer phenotype.
Collapse
|
41
|
Lescai F, Marasco E, Bacchelli C, Stanier P, Mantovani V, Beales P. Identification and validation of loss of function variants in clinical contexts. Mol Genet Genomic Med 2013; 2:58-63. [PMID: 24498629 PMCID: PMC3907911 DOI: 10.1002/mgg3.42] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2013] [Accepted: 09/05/2013] [Indexed: 12/20/2022] Open
Abstract
The choice of an appropriate variant calling pipeline for exome sequencing data is becoming increasingly more important in translational medicine projects and clinical contexts. Within GOSgene, which facilitates genetic analysis as part of a joint effort of the University College London and the Great Ormond Street Hospital, we aimed to optimize a variant calling pipeline suitable for our clinical context. We implemented the GATK/Queue framework and evaluated the performance of its two callers: the classical UnifiedGenotyper and the new variant discovery tool HaplotypeCaller. We performed an experimental validation of the loss-of-function (LoF) variants called by the two methods using Sequenom technology. UnifiedGenotyper showed a total validation rate of 97.6% for LoF single-nucleotide polymorphisms (SNPs) and 92.0% for insertions or deletions (INDELs), whereas HaplotypeCaller was 91.7% for SNPs and 55.9% for INDELs. We confirm that GATK/Queue is a reliable pipeline in translational medicine and clinical context. We conclude that in our working environment, UnifiedGenotyper is the caller of choice, being an accurate method, with a high validation rate of error-prone calls like LoF variants. We finally highlight the importance of experimental validation, especially for INDELs, as part of a standard pipeline in clinical environments.
Collapse
Affiliation(s)
- Francesco Lescai
- University College London, Institute of Child Health, GOSgene team London, U.K ; Department of Biomedicine, Human Genetics, Aarhus University Aarhus, Denmark
| | - Elena Marasco
- CRBA Centro Ricerca Biomedica Applicata, Azienda Ospedaliero-Universitaria Policlinico S. Orsola - Malpighi Bologna, Italy
| | - Chiara Bacchelli
- University College London, Institute of Child Health, GOSgene team London, U.K
| | - Philip Stanier
- University College London, Institute of Child Health, GOSgene team London, U.K
| | - Vilma Mantovani
- Department of Biomedicine, Human Genetics, Aarhus University Aarhus, Denmark
| | - Philip Beales
- University College London, Institute of Child Health, GOSgene team London, U.K
| |
Collapse
|
42
|
Khurana E, Fu Y, Colonna V, Mu XJ, Kang HM, Lappalainen T, Sboner A, Lochovsky L, Chen J, Harmanci A, Das J, Abyzov A, Balasubramanian S, Beal K, Chakravarty D, Challis D, Chen Y, Clarke D, Clarke L, Cunningham F, Evani US, Flicek P, Fragoza R, Garrison E, Gibbs R, Gümüş ZH, Herrero J, Kitabayashi N, Kong Y, Lage K, Liluashvili V, Lipkin SM, MacArthur DG, Marth G, Muzny D, Pers TH, Ritchie GRS, Rosenfeld JA, Sisu C, Wei X, Wilson M, Xue Y, Yu F, Dermitzakis ET, Yu H, Rubin MA, Tyler-Smith C, Gerstein M. Integrative annotation of variants from 1092 humans: application to cancer genomics. Science 2013; 342:1235587. [PMID: 24092746 PMCID: PMC3947637 DOI: 10.1126/science.1235587] [Citation(s) in RCA: 269] [Impact Index Per Article: 24.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Interpreting variants, especially noncoding ones, in the increasing number of personal genomes is challenging. We used patterns of polymorphisms in functionally annotated regions in 1092 humans to identify deleterious variants; then we experimentally validated candidates. We analyzed both coding and noncoding regions, with the former corroborating the latter. We found regions particularly sensitive to mutations ("ultrasensitive") and variants that are disruptive because of mechanistic effects on transcription-factor binding (that is, "motif-breakers"). We also found variants in regions with higher network centrality tend to be deleterious. Insertions and deletions followed a similar pattern to single-nucleotide variants, with some notable exceptions (e.g., certain deletions and enhancers). On the basis of these patterns, we developed a computational tool (FunSeq), whose application to ~90 cancer genomes reveals nearly a hundred candidate noncoding drivers.
Collapse
Affiliation(s)
- Ekta Khurana
- Program in Computational Biology and Bioinformatics, Yale
University, New Haven, CT 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale
University, New Haven, CT 06520, USA
| | - Yao Fu
- Program in Computational Biology and Bioinformatics, Yale
University, New Haven, CT 06520, USA
| | - Vincenza Colonna
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus,
Cambridge, CB10 1SA, UK
- Institute of Genetics and Biophysics, National Research Council
(CNR), 80131 Naples, Italy
| | - Xinmeng Jasmine Mu
- Program in Computational Biology and Bioinformatics, Yale
University, New Haven, CT 06520, USA
| | - Hyun Min Kang
- Center for Statistical Genetics, Biostatistics, University of
Michigan, Ann Arbor, MI 48109, USA
| | - Tuuli Lappalainen
- Department of Genetic Medicine and Development, University of Geneva
Medical School, 1211 Geneva, Switzerland
- Institute for Genetics and Genomics in Geneva (iGE3), University of
Geneva, 1211 Geneva, Switzerland
- Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland
| | - Andrea Sboner
- Institute for Precision Medicine and the Department of Pathology and
Laboratory Medicine, Weill Cornell Medical College and New York-Presbyterian
Hospital, New York, NY 10065, USA
- The HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute
for Computational Biomedicine, Weill Cornell Medical College, New York, NY 10021,
USA
| | - Lucas Lochovsky
- Program in Computational Biology and Bioinformatics, Yale
University, New Haven, CT 06520, USA
| | - Jieming Chen
- Program in Computational Biology and Bioinformatics, Yale
University, New Haven, CT 06520, USA
- Integrated Graduate Program in Physical and Engineering Biology,
Yale University, New Haven, CT 06520, USA
| | - Arif Harmanci
- Program in Computational Biology and Bioinformatics, Yale
University, New Haven, CT 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale
University, New Haven, CT 06520, USA
| | - Jishnu Das
- Department of Biological Statistics and Computational Biology,
Cornell University, Ithaca, NY 14853, USA
- Weill Institute for Cell and Molecular Biology, Cornell University,
Ithaca, NY 14853, USA
| | - Alexej Abyzov
- Program in Computational Biology and Bioinformatics, Yale
University, New Haven, CT 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale
University, New Haven, CT 06520, USA
| | - Suganthi Balasubramanian
- Program in Computational Biology and Bioinformatics, Yale
University, New Haven, CT 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale
University, New Haven, CT 06520, USA
| | - Kathryn Beal
- European Molecular Biology Laboratory, European Bioinformatics
Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Dimple Chakravarty
- Institute for Precision Medicine and the Department of Pathology and
Laboratory Medicine, Weill Cornell Medical College and New York-Presbyterian
Hospital, New York, NY 10065, USA
| | - Daniel Challis
- Baylor College of Medicine, Human Genome Sequencing Center,
Houston, TX 77030, USA
| | - Yuan Chen
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus,
Cambridge, CB10 1SA, UK
| | - Declan Clarke
- Department of Chemistry, Yale University, New Haven, CT 06520, USA
| | - Laura Clarke
- European Molecular Biology Laboratory, European Bioinformatics
Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Fiona Cunningham
- European Molecular Biology Laboratory, European Bioinformatics
Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Uday S. Evani
- Baylor College of Medicine, Human Genome Sequencing Center,
Houston, TX 77030, USA
| | - Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics
Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Robert Fragoza
- Weill Institute for Cell and Molecular Biology, Cornell University,
Ithaca, NY 14853, USA
- Department of Molecular Biology and Genetics, Cornell University,
Ithaca, NY 14853, USA
| | - Erik Garrison
- Department of Biology, Boston College, Chestnut Hill, MA 02467, USA
| | - Richard Gibbs
- Baylor College of Medicine, Human Genome Sequencing Center,
Houston, TX 77030, USA
| | - Zeynep H. Gümüş
- The HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute
for Computational Biomedicine, Weill Cornell Medical College, New York, NY 10021,
USA
- Department of Physiology and Biophysics, Weill Cornell Medical
College, New York, NY, 10065, USA
| | - Javier Herrero
- European Molecular Biology Laboratory, European Bioinformatics
Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Naoki Kitabayashi
- Institute for Precision Medicine and the Department of Pathology and
Laboratory Medicine, Weill Cornell Medical College and New York-Presbyterian
Hospital, New York, NY 10065, USA
| | - Yong Kong
- Department of Molecular Biophysics and Biochemistry, Yale
University, New Haven, CT 06520, USA
- Keck Biotechnology Resource Laboratory, Yale University, New Haven,
CT 06511, USA
| | - Kasper Lage
- Pediatric Surgical Research Laboratories, MassGeneral Hospital for
Children, Massachusetts General Hospital, Boston, MA 02114, USA
- Analytical and Translational Genetics Unit, Massachusetts General
Hospital, Boston, MA 02114, USA
- Harvard Medical School, Boston, MA 02115, USA
- Center for Biological Sequence Analysis, Department of Systems
Biology, Technical University of Denmark, Lyngby, Denmark
- Center for Protein Research, University of Copenhagen, Copenhagen,
Denmark
| | - Vaja Liluashvili
- The HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute
for Computational Biomedicine, Weill Cornell Medical College, New York, NY 10021,
USA
- Department of Physiology and Biophysics, Weill Cornell Medical
College, New York, NY, 10065, USA
| | - Steven M. Lipkin
- Department of Medicine, Weill Cornell Medical College, New York, NY
10065, USA
| | - Daniel G. MacArthur
- Analytical and Translational Genetics Unit, Massachusetts General
Hospital, Boston, MA 02114, USA
- Program in Medical and Population Genetics, Broad Institute of
Harvard and Massachusetts Institute of Technology (MIT), Cambridge, MA 02142,
USA
| | - Gabor Marth
- Department of Biology, Boston College, Chestnut Hill, MA 02467, USA
| | - Donna Muzny
- Baylor College of Medicine, Human Genome Sequencing Center,
Houston, TX 77030, USA
| | - Tune H. Pers
- Center for Biological Sequence Analysis, Department of Systems
Biology, Technical University of Denmark, Lyngby, Denmark
- Division of Endocrinology and Center for Basic and Translational
Obesity Research, Children’s Hospital, Boston, MA 02115, USA
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Graham R. S. Ritchie
- European Molecular Biology Laboratory, European Bioinformatics
Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Jeffrey A. Rosenfeld
- Department of Medicine, Rutgers New Jersey Medical School, Newark,
NJ 07101, USA
- IST/High Performance and Research Computing, Rutgers University
Newark, NJ 07101, USA
- Sackler Institute for Comparative Genomics, American Museum of
Natural History, New York, NY 10024, USA
| | - Cristina Sisu
- Program in Computational Biology and Bioinformatics, Yale
University, New Haven, CT 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale
University, New Haven, CT 06520, USA
| | - Xiaomu Wei
- Weill Institute for Cell and Molecular Biology, Cornell University,
Ithaca, NY 14853, USA
- Department of Medicine, Weill Cornell Medical College, New York, NY
10065, USA
| | - Michael Wilson
- Program in Computational Biology and Bioinformatics, Yale
University, New Haven, CT 06520, USA
- Child Study Center, Yale University, New Haven, CT 06520, USA
| | - Yali Xue
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus,
Cambridge, CB10 1SA, UK
| | - Fuli Yu
- Baylor College of Medicine, Human Genome Sequencing Center,
Houston, TX 77030, USA
| | | | - Emmanouil T. Dermitzakis
- Department of Genetic Medicine and Development, University of Geneva
Medical School, 1211 Geneva, Switzerland
- Institute for Genetics and Genomics in Geneva (iGE3), University of
Geneva, 1211 Geneva, Switzerland
- Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland
| | - Haiyuan Yu
- Department of Biological Statistics and Computational Biology,
Cornell University, Ithaca, NY 14853, USA
- Weill Institute for Cell and Molecular Biology, Cornell University,
Ithaca, NY 14853, USA
| | - Mark A. Rubin
- Institute for Precision Medicine and the Department of Pathology and
Laboratory Medicine, Weill Cornell Medical College and New York-Presbyterian
Hospital, New York, NY 10065, USA
| | - Chris Tyler-Smith
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus,
Cambridge, CB10 1SA, UK
| | - Mark Gerstein
- Program in Computational Biology and Bioinformatics, Yale
University, New Haven, CT 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale
University, New Haven, CT 06520, USA
- Department of Computer Science, Yale University, New Haven, CT
06520, USA
| |
Collapse
|
43
|
Dorn C, Grunert M, Sperling SR. Application of high-throughput sequencing for studying genomic variations in congenital heart disease. Brief Funct Genomics 2013; 13:51-65. [PMID: 24095982 DOI: 10.1093/bfgp/elt040] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
Congenital heart diseases (CHD) represent the most common birth defect in human. The majority of cases are caused by a combination of complex genetic alterations and environmental influences. In the past, many disease-causing mutations have been identified; however, there is still a large proportion of cardiac malformations with unknown precise origin. High-throughput sequencing technologies established during the last years offer novel opportunities to further study the genetic background underlying the disease. In this review, we provide a roadmap for designing and analyzing high-throughput sequencing studies focused on CHD, but also with general applicability to other complex diseases. The three main next-generation sequencing (NGS) platforms including their particular advantages and disadvantages are presented. To identify potentially disease-related genomic variations and genes, different filtering steps and gene prioritization strategies are discussed. In addition, available control datasets based on NGS are summarized. Finally, we provide an overview of current studies already using NGS technologies and showing that these techniques will help to further unravel the complex genetics underlying CHD.
Collapse
Affiliation(s)
- Cornelia Dorn
- Department of Cardiovascular Genetics, Experimental and Clinical Research Center (ECRC), Charité-University Medicine Berlin and Max Delbrück Center (MDC) for Molecular Medicine, Lindenberger Weg 80, 13125 Berlin, Germany. Department of Biochemistry, Free University Berlin, Berlin, Germany. Tel.: +49-(0)30-450540123; Fax: +49-(0)30-84131699;
| | | | | |
Collapse
|
44
|
Tool for rapid annotation of microbial SNPs (TRAMS): a simple program for rapid annotation of genomic variation in prokaryotes. Antonie Van Leeuwenhoek 2013; 104:431-4. [PMID: 23828175 DOI: 10.1007/s10482-013-9953-x] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2013] [Accepted: 06/12/2013] [Indexed: 10/26/2022]
Abstract
Next generation sequencing (NGS) has been widely used to study genomic variation in a variety of prokaryotes. Single nucleotide polymorphisms (SNPs) resulting from genomic comparisons need to be annotated for their functional impact on the coding sequences. We have developed a program, TRAMS, for functional annotation of genomic SNPs which is available to download as a single file executable for WINDOWS users with limited computational experience and as a Python script for Mac OS and Linux users. TRAMS needs a tab delimited text file containing SNP locations, reference nucleotide and SNPs in variant strains along with a reference genome sequence in GenBank or EMBL format. SNPs are annotated as synonymous, nonsynonymous or nonsense. Nonsynonymous SNPs in start and stop codons are separated as non-start and non-stop SNPs, respectively. SNPs in multiple overlapping features are annotated separately for each feature and multiple nucleotide polymorphisms within a codon are combined before annotation. We have also developed a workflow for Galaxy, a highly used tool for analysing NGS data, to map short reads to a reference genome and extract and annotate the SNPs. TRAMS is a simple program for rapid and accurate annotation of SNPs that will be very useful for microbiologists in analysing genomic diversity in microbial populations.
Collapse
|
45
|
Shen H, Li J, Zhang J, Xu C, Jiang Y, Wu Z, Zhao F, Liao L, Chen J, Lin Y, Tian Q, Papasian CJ, Deng HW. Comprehensive characterization of human genome variation by high coverage whole-genome sequencing of forty four Caucasians. PLoS One 2013; 8:e59494. [PMID: 23577066 PMCID: PMC3618277 DOI: 10.1371/journal.pone.0059494] [Citation(s) in RCA: 53] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2012] [Accepted: 02/14/2013] [Indexed: 12/14/2022] Open
Abstract
Whole genome sequencing studies are essential to obtain a comprehensive understanding of the vast pattern of human genomic variations. Here we report the results of a high-coverage whole genome sequencing study for 44 unrelated healthy Caucasian adults, each sequenced to over 50-fold coverage (averaging 65.8×). We identified approximately 11 million single nucleotide polymorphisms (SNPs), 2.8 million short insertions and deletions, and over 500,000 block substitutions. We showed that, although previous studies, including the 1000 Genomes Project Phase 1 study, have catalogued the vast majority of common SNPs, many of the low-frequency and rare variants remain undiscovered. For instance, approximately 1.4 million SNPs and 1.3 million short indels that we found were novel to both the dbSNP and the 1000 Genomes Project Phase 1 data sets, and the majority of which (∼96%) have a minor allele frequency less than 5%. On average, each individual genome carried ∼3.3 million SNPs and ∼492,000 indels/block substitutions, including approximately 179 variants that were predicted to cause loss of function of the gene products. Moreover, each individual genome carried an average of 44 such loss-of-function variants in a homozygous state, which would completely "knock out" the corresponding genes. Across all the 44 genomes, a total of 182 genes were "knocked-out" in at least one individual genome, among which 46 genes were "knocked out" in over 30% of our samples, suggesting that a number of genes are commonly "knocked-out" in general populations. Gene ontology analysis suggested that these commonly "knocked-out" genes are enriched in biological process related to antigen processing and immune response. Our results contribute towards a comprehensive characterization of human genomic variation, especially for less-common and rare variants, and provide an invaluable resource for future genetic studies of human variation and diseases.
Collapse
Affiliation(s)
- Hui Shen
- Center for Bioinformatics and Genomics, Department of Biostatistics and Bioinformatics, School of Public Health and Tropical Medicine, Tulane University, New Orleans, Louisiana, United States of America
- School of Medicine, University of Missouri-Kansas City, Kansas City, Missouri, United States of America
| | - Jian Li
- Center for Bioinformatics and Genomics, Department of Biostatistics and Bioinformatics, School of Public Health and Tropical Medicine, Tulane University, New Orleans, Louisiana, United States of America
- School of Medicine, University of Missouri-Kansas City, Kansas City, Missouri, United States of America
| | - Jigang Zhang
- Center for Bioinformatics and Genomics, Department of Biostatistics and Bioinformatics, School of Public Health and Tropical Medicine, Tulane University, New Orleans, Louisiana, United States of America
- School of Medicine, University of Missouri-Kansas City, Kansas City, Missouri, United States of America
| | - Chao Xu
- Center for Bioinformatics and Genomics, Department of Biostatistics and Bioinformatics, School of Public Health and Tropical Medicine, Tulane University, New Orleans, Louisiana, United States of America
- School of Medicine, University of Missouri-Kansas City, Kansas City, Missouri, United States of America
- Center of System Biomedical Sciences, University of Shanghai for Science and Technology, Shanghai, P. R. China
| | - Yan Jiang
- Center for Bioinformatics and Genomics, Department of Biostatistics and Bioinformatics, School of Public Health and Tropical Medicine, Tulane University, New Orleans, Louisiana, United States of America
- School of Medicine, University of Missouri-Kansas City, Kansas City, Missouri, United States of America
- Center of System Biomedical Sciences, University of Shanghai for Science and Technology, Shanghai, P. R. China
| | - Zikai Wu
- Center for Bioinformatics and Genomics, Department of Biostatistics and Bioinformatics, School of Public Health and Tropical Medicine, Tulane University, New Orleans, Louisiana, United States of America
- School of Medicine, University of Missouri-Kansas City, Kansas City, Missouri, United States of America
- Center of System Biomedical Sciences, University of Shanghai for Science and Technology, Shanghai, P. R. China
| | - Fuping Zhao
- Center for Bioinformatics and Genomics, Department of Biostatistics and Bioinformatics, School of Public Health and Tropical Medicine, Tulane University, New Orleans, Louisiana, United States of America
- School of Medicine, University of Missouri-Kansas City, Kansas City, Missouri, United States of America
| | - Li Liao
- Center for Bioinformatics and Genomics, Department of Biostatistics and Bioinformatics, School of Public Health and Tropical Medicine, Tulane University, New Orleans, Louisiana, United States of America
- School of Medicine, University of Missouri-Kansas City, Kansas City, Missouri, United States of America
| | - Jun Chen
- Center for Bioinformatics and Genomics, Department of Biostatistics and Bioinformatics, School of Public Health and Tropical Medicine, Tulane University, New Orleans, Louisiana, United States of America
| | - Yong Lin
- Center of System Biomedical Sciences, University of Shanghai for Science and Technology, Shanghai, P. R. China
| | - Qing Tian
- Center for Bioinformatics and Genomics, Department of Biostatistics and Bioinformatics, School of Public Health and Tropical Medicine, Tulane University, New Orleans, Louisiana, United States of America
- School of Medicine, University of Missouri-Kansas City, Kansas City, Missouri, United States of America
| | - Christopher J. Papasian
- School of Medicine, University of Missouri-Kansas City, Kansas City, Missouri, United States of America
| | - Hong-Wen Deng
- Center for Bioinformatics and Genomics, Department of Biostatistics and Bioinformatics, School of Public Health and Tropical Medicine, Tulane University, New Orleans, Louisiana, United States of America
- School of Medicine, University of Missouri-Kansas City, Kansas City, Missouri, United States of America
- Center of System Biomedical Sciences, University of Shanghai for Science and Technology, Shanghai, P. R. China
| |
Collapse
|
46
|
Computational and bioinformatics frameworks for next-generation whole exome and genome sequencing. ScientificWorldJournal 2013; 2013:730210. [PMID: 23365548 PMCID: PMC3556895 DOI: 10.1155/2013/730210] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2012] [Accepted: 11/22/2012] [Indexed: 12/28/2022] Open
Abstract
It has become increasingly apparent that one of the major hurdles in the genomic age will be the bioinformatics challenges of next-generation sequencing. We provide an overview of a general framework of bioinformatics analysis. For each of the three stages of (1) alignment, (2) variant calling, and (3) filtering and annotation, we describe the analysis required and survey the different software packages that are used. Furthermore, we discuss possible future developments as data sources grow and highlight opportunities for new bioinformatics tools to be developed.
Collapse
|
47
|
Abstract
Abstract As advances in life sciences and information technology bring profound influences on bioinformatics due to its interdisciplinary nature, bioinformatics is experiencing a new leap-forward from in-house computing infrastructure into utility-supplied cloud computing delivered over the Internet, in order to handle the vast quantities of biological data generated by high-throughput experimental technologies. Albeit relatively new, cloud computing promises to address big data storage and analysis issues in the bioinformatics field. Here we review extant cloud-based services in bioinformatics, classify them into Data as a Service (DaaS), Software as a Service (SaaS), Platform as a Service (PaaS), and Infrastructure as a Service (IaaS), and present our perspectives on the adoption of cloud computing in bioinformatics. Reviewers This article was reviewed by Frank Eisenhaber, Igor Zhulin, and Sandor Pongor.
Collapse
Affiliation(s)
- Lin Dai
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, No.7 Beitucheng West Road, Building G, Chaoyang District, Beijing 100029, China
| | | | | | | | | |
Collapse
|
48
|
Do R, Kathiresan S, Abecasis GR. Exome sequencing and complex disease: practical aspects of rare variant association studies. Hum Mol Genet 2012; 21:R1-9. [PMID: 22983955 PMCID: PMC3459641 DOI: 10.1093/hmg/dds387] [Citation(s) in RCA: 105] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2012] [Accepted: 09/07/2012] [Indexed: 11/13/2022] Open
Abstract
Genetic association and linkage studies can provide insights into complex disease biology, guiding the development of new diagnostic and therapeutic strategies. Over the past decade, genetic association studies have largely focused on common, easy to measure genetic variants shared between many individuals. These common variants typically have subtle functional consequence and translating the resulting association signals into biological insights can be challenging. In the last few years, exome sequencing has emerged as a cost-effective strategy for extending these studies to include rare coding variants, which often have more marked functional consequences. Here, we provide practical guidance in the design and analysis of complex trait association studies focused on rare, coding variants.
Collapse
Affiliation(s)
- Ron Do
- Center for Human Genetic Research and Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA, USA
- Harvard Medical School, Boston, MA, USA
- Broad Institute of Harvard and MIT, Cambridge, MA, USA and
| | - Sekar Kathiresan
- Center for Human Genetic Research and Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA, USA
- Harvard Medical School, Boston, MA, USA
- Broad Institute of Harvard and MIT, Cambridge, MA, USA and
| | - Gonçalo R. Abecasis
- Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI, USA
| |
Collapse
|