1
|
Pan Y, Karagiannis K, Zhang H, Dingerdissen H, Shamsaddini A, Wan Q, Simonyan V, Mazumder R. Human germline and pan-cancer variomes and their distinct functional profiles. Nucleic Acids Res 2014; 42:11570-88. [PMID: 25232094 PMCID: PMC4191387 DOI: 10.1093/nar/gku772] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
Identification of non-synonymous single nucleotide variations (nsSNVs) has exponentially increased due to advances in Next-Generation Sequencing technologies. The functional impacts of these variations have been difficult to ascertain because the corresponding knowledge about sequence functional sites is quite fragmented. It is clear that mapping of variations to sequence functional features can help us better understand the pathophysiological role of variations. In this study, we investigated the effect of nsSNVs on more than 17 common types of post-translational modification (PTM) sites, active sites and binding sites. Out of 1 705 285 distinct nsSNVs on 259 216 functional sites we identified 38 549 variations that significantly affect 10 major functional sites. Furthermore, we found distinct patterns of site disruptions due to germline and somatic nsSNVs. Pan-cancer analysis across 12 different cancer types led to the identification of 51 genes with 106 nsSNV affected functional sites found in 3 or more cancer types. 13 of the 51 genes overlap with previously identified Significantly Mutated Genes (Nature. 2013 Oct 17;502(7471)). 62 mutations in these 13 genes affecting functional sites such as DNA, ATP binding and various PTM sites occur across several cancers and can be prioritized for additional validation and investigations.
Collapse
Affiliation(s)
- Yang Pan
- The Department of Biochemistry & Molecular Medicine, George Washington University Medical Center, Washington, DC 20037, USA
| | - Konstantinos Karagiannis
- The Department of Biochemistry & Molecular Medicine, George Washington University Medical Center, Washington, DC 20037, USA
| | - Haichen Zhang
- The Department of Biochemistry & Molecular Medicine, George Washington University Medical Center, Washington, DC 20037, USA
| | - Hayley Dingerdissen
- The Department of Biochemistry & Molecular Medicine, George Washington University Medical Center, Washington, DC 20037, USA
| | - Amirhossein Shamsaddini
- The Department of Biochemistry & Molecular Medicine, George Washington University Medical Center, Washington, DC 20037, USA
| | - Quan Wan
- The Department of Biochemistry & Molecular Medicine, George Washington University Medical Center, Washington, DC 20037, USA
| | - Vahan Simonyan
- Center for Biologics Evaluation and Research, US Food and Drug Administration, 10903 New Hampshire Avenue, Silver Spring, MD 20993, USA
| | - Raja Mazumder
- The Department of Biochemistry & Molecular Medicine, George Washington University Medical Center, Washington, DC 20037, USA McCormick Genomic and Proteomic Center, George Washington University, Washington, DC 20037, USA
| |
Collapse
|
2
|
Santana-Quintero L, Dingerdissen H, Thierry-Mieg J, Mazumder R, Simonyan V. HIVE-hexagon: high-performance, parallelized sequence alignment for next-generation sequencing data analysis. PLoS One 2014; 9:e99033. [PMID: 24918764 PMCID: PMC4053384 DOI: 10.1371/journal.pone.0099033] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2014] [Accepted: 05/09/2014] [Indexed: 12/31/2022] Open
Abstract
UNLABELLED Due to the size of Next-Generation Sequencing data, the computational challenge of sequence alignment has been vast. Inexact alignments can take up to 90% of total CPU time in bioinformatics pipelines. High-performance Integrated Virtual Environment (HIVE), a cloud-based environment optimized for storage and analysis of extra-large data, presents an algorithmic solution: the HIVE-hexagon DNA sequence aligner. HIVE-hexagon implements novel approaches to exploit both characteristics of sequence space and CPU, RAM and Input/Output (I/O) architecture to quickly compute accurate alignments. Key components of HIVE-hexagon include non-redundification and sorting of sequences; floating diagonals of linearized dynamic programming matrices; and consideration of cross-similarity to minimize computations. AVAILABILITY https://hive.biochemistry.gwu.edu/hive/
Collapse
Affiliation(s)
- Luis Santana-Quintero
- Center for Biologics Evaluation and Research, US Food and Drug Administration, Rockville, Maryland, United States of America
| | - Hayley Dingerdissen
- Center for Biologics Evaluation and Research, US Food and Drug Administration, Rockville, Maryland, United States of America
- Department of Biochemistry and Molecular Biology, George Washington University Medical Center, Washington, DC, United States of America
| | - Jean Thierry-Mieg
- National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Raja Mazumder
- Department of Biochemistry and Molecular Biology, George Washington University Medical Center, Washington, DC, United States of America
- * E-mail: (RM); (VS)
| | - Vahan Simonyan
- Center for Biologics Evaluation and Research, US Food and Drug Administration, Rockville, Maryland, United States of America
- * E-mail: (RM); (VS)
| |
Collapse
|
3
|
Dingerdissen H, Weaver DS, Karp PD, Pan Y, Simonyan V, Mazumder R. A framework for application of metabolic modeling in yeast to predict the effects of nsSNV in human orthologs. Biol Direct 2014; 9:9. [PMID: 24894379 PMCID: PMC4057618 DOI: 10.1186/1745-6150-9-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2014] [Accepted: 05/19/2014] [Indexed: 01/16/2023] Open
Abstract
BACKGROUND We have previously suggested a method for proteome wide analysis of variation at functional residues wherein we identified the set of all human genes with nonsynonymous single nucleotide variation (nsSNV) in the active site residue of the corresponding proteins. 34 of these proteins were shown to have a 1:1:1 enzyme:pathway:reaction relationship, making these proteins ideal candidates for laboratory validation through creation and observation of specific yeast active site knock-outs and downstream targeted metabolomics experiments. Here we present the next step in the workflow toward using yeast metabolic modeling to predict human metabolic behavior resulting from nsSNV. RESULTS For the previously identified candidate proteins, we used the reciprocal best BLAST hits method followed by manual alignment and pathway comparison to identify 6 human proteins with yeast orthologs which were suitable for flux balance analysis (FBA). 5 of these proteins are known to be associated with diseases, including ribose 5-phosphate isomerase deficiency, myopathy with lactic acidosis and sideroblastic anaemia, anemia due to disorders of glutathione metabolism, and two porphyrias, and we suspect the sixth enzyme to have disease associations which are not yet classified or understood based on the work described herein. CONCLUSIONS Preliminary findings using the Yeast 7.0 FBA model show lack of growth for only one enzyme, but augmentation of the Yeast 7.0 biomass function to better simulate knockout of certain genes suggested physiological relevance of variations in three additional proteins. Thus, we suggest the following four proteins for laboratory validation: delta-aminolevulinic acid dehydratase, ferrochelatase, ribose-5 phosphate isomerase and mitochondrial tyrosyl-tRNA synthetase. This study indicates that the predictive ability of this method will improve as more advanced, comprehensive models are developed. Moreover, these findings will be useful in the development of simple downstream biochemical or mass-spectrometric assays to corroborate these predictions and detect presence of certain known nsSNVs with deleterious outcomes. Results may also be useful in predicting as yet unknown outcomes of active site nsSNVs for enzymes that are not yet well classified or annotated.
Collapse
Affiliation(s)
- Hayley Dingerdissen
- Department of Biochemistry and Molecular Biology, The George Washington University Medical Center, Ross Hall, Room 540, 2300 Eye Street NW, Washington, DC 20037, USA
| | - Daniel S Weaver
- Bioinformatics Research Group, Artificial Intelligence Center, SRI International Menlo Park, Menlo Park, CA 94025, USA
| | - Peter D Karp
- Bioinformatics Research Group, Artificial Intelligence Center, SRI International Menlo Park, Menlo Park, CA 94025, USA
| | - Yang Pan
- Department of Biochemistry and Molecular Biology, The George Washington University Medical Center, Ross Hall, Room 540, 2300 Eye Street NW, Washington, DC 20037, USA
| | - Vahan Simonyan
- Center for Biologics Evaluation and Research, US Food and Drug Administration, 1451 Rockville Pike, Rockville, MD 20852, USA
| | - Raja Mazumder
- Department of Biochemistry and Molecular Biology, The George Washington University Medical Center, Ross Hall, Room 540, 2300 Eye Street NW, Washington, DC 20037, USA
- McCormick Genomic and Proteomic Center, George Washington University, Washington, DC 20037, USA
| |
Collapse
|
4
|
Wu TJ, Shamsaddini A, Pan Y, Smith K, Crichton DJ, Simonyan V, Mazumder R. A framework for organizing cancer-related variations from existing databases, publications and NGS data using a High-performance Integrated Virtual Environment (HIVE). DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2014; 2014:bau022. [PMID: 24667251 PMCID: PMC3965850 DOI: 10.1093/database/bau022] [Citation(s) in RCA: 57] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
Years of sequence feature curation by UniProtKB/Swiss-Prot, PIR-PSD, NCBI-CDD, RefSeq and other database biocurators has led to a rich repository of information on functional sites of genes and proteins. This information along with variation-related annotation can be used to scan human short sequence reads from next-generation sequencing (NGS) pipelines for presence of non-synonymous single-nucleotide variations (nsSNVs) that affect functional sites. This and similar workflows are becoming more important because thousands of NGS data sets are being made available through projects such as The Cancer Genome Atlas (TCGA), and researchers want to evaluate their biomarkers in genomic data. BioMuta, an integrated sequence feature database, provides a framework for automated and manual curation and integration of cancer-related sequence features so that they can be used in NGS analysis pipelines. Sequence feature information in BioMuta is collected from the Catalogue of Somatic Mutations in Cancer (COSMIC), ClinVar, UniProtKB and through biocuration of information available from publications. Additionally, nsSNVs identified through automated analysis of NGS data from TCGA are also included in the database. Because of the petabytes of data and information present in NGS primary repositories, a platform HIVE (High-performance Integrated Virtual Environment) for storing, analyzing, computing and curating NGS data and associated metadata has been developed. Using HIVE, 31 979 nsSNVs were identified in TCGA-derived NGS data from breast cancer patients. All variations identified through this process are stored in a Curated Short Read archive, and the nsSNVs from the tumor samples are included in BioMuta. Currently, BioMuta has 26 cancer types with 13 896 small-scale and 308 986 large-scale study-derived variations. Integration of variation data allows identifications of novel or common nsSNVs that can be prioritized in validation studies. Database URL: BioMuta: http://hive.biochemistry.gwu.edu/tools/biomuta/index.php; CSR: http://hive.biochemistry.gwu.edu/dna.cgi?cmd=csr; HIVE: http://hive.biochemistry.gwu.edu
Collapse
Affiliation(s)
- Tsung-Jung Wu
- Department of Biochemistry and Molecular Medicine, George Washington University, Washington, DC 20037, USA, Data Systems and Technology Jet Propulsion Laboratory 4800 Oak Grove Drive Pasadena, CA 91109 Center for Biologics Evaluation and Research, Food and Drug Administration, Rockville, MD 20852, USA and McCormick Genomic and Proteomic Center, George Washington University, Washington, DC 20037, USA
| | | | | | | | | | | | | |
Collapse
|
5
|
Cole C, Krampis K, Karagiannis K, Almeida JS, Faison WJ, Motwani M, Wan Q, Golikov A, Pan Y, Simonyan V, Mazumder R. Non-synonymous variations in cancer and their effects on the human proteome: workflow for NGS data biocuration and proteome-wide analysis of TCGA data. BMC Bioinformatics 2014; 15:28. [PMID: 24467687 PMCID: PMC3916084 DOI: 10.1186/1471-2105-15-28] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2013] [Accepted: 01/22/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Next-generation sequencing (NGS) technologies have resulted in petabytes of scattered data, decentralized in archives, databases and sometimes in isolated hard-disks which are inaccessible for browsing and analysis. It is expected that curated secondary databases will help organize some of this Big Data thereby allowing users better navigate, search and compute on it. RESULTS To address the above challenge, we have implemented a NGS biocuration workflow and are analyzing short read sequences and associated metadata from cancer patients to better understand the human variome. Curation of variation and other related information from control (normal tissue) and case (tumor) samples will provide comprehensive background information that can be used in genomic medicine research and application studies. Our approach includes a CloudBioLinux Virtual Machine which is used upstream of an integrated High-performance Integrated Virtual Environment (HIVE) that encapsulates Curated Short Read archive (CSR) and a proteome-wide variation effect analysis tool (SNVDis). As a proof-of-concept, we have curated and analyzed control and case breast cancer datasets from the NCI cancer genomics program - The Cancer Genome Atlas (TCGA). Our efforts include reviewing and recording in CSR available clinical information on patients, mapping of the reads to the reference followed by identification of non-synonymous Single Nucleotide Variations (nsSNVs) and integrating the data with tools that allow analysis of effect nsSNVs on the human proteome. Furthermore, we have also developed a novel phylogenetic analysis algorithm that uses SNV positions and can be used to classify the patient population. The workflow described here lays the foundation for analysis of short read sequence data to identify rare and novel SNVs that are not present in dbSNP and therefore provides a more comprehensive understanding of the human variome. Variation results for single genes as well as the entire study are available from the CSR website (http://hive.biochemistry.gwu.edu/dna.cgi?cmd=csr). CONCLUSIONS Availability of thousands of sequenced samples from patients provides a rich repository of sequence information that can be utilized to identify individual level SNVs and their effect on the human proteome beyond what the dbSNP database provides.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | - Raja Mazumder
- Department of Biochemistry and Molecular Medicine, George Washington University Medical Center, Washington, DC 20037, USA.
| |
Collapse
|
6
|
Sun HY, Ji FQ, Fu LY, Wang ZY, Zhang HY. Structural and Energetic Analyses of SNPs in Drug Targets and Implications for Drug Therapy. J Chem Inf Model 2013; 53:3343-51. [DOI: 10.1021/ci400457v] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Affiliation(s)
- Hui-Yong Sun
- National
Key Laboratory of Crop Genetic Improvement, Center for Bioinformatics, College of Life
Science and Technology, Huazhong Agricultural University, Wuhan 430070, P.R. China
- School
of Life Sciences, Shandong University of Technology, Zibo 255049, P.R. China
| | - Feng-Qin Ji
- National
Key Laboratory of Crop Genetic Improvement, Center for Bioinformatics, College of Life
Science and Technology, Huazhong Agricultural University, Wuhan 430070, P.R. China
| | - Liang-Yu Fu
- National
Key Laboratory of Crop Genetic Improvement, Center for Bioinformatics, College of Life
Science and Technology, Huazhong Agricultural University, Wuhan 430070, P.R. China
| | - Zhong-Yi Wang
- National
Key Laboratory of Crop Genetic Improvement, Center for Bioinformatics, College of Life
Science and Technology, Huazhong Agricultural University, Wuhan 430070, P.R. China
| | - Hong-Yu Zhang
- National
Key Laboratory of Crop Genetic Improvement, Center for Bioinformatics, College of Life
Science and Technology, Huazhong Agricultural University, Wuhan 430070, P.R. China
| |
Collapse
|
7
|
Horvatovich P, Franke L, Bischoff R. Proteomic studies related to genetic determinants of variability in protein concentrations. J Proteome Res 2013; 13:5-14. [PMID: 24237071 DOI: 10.1021/pr400765y] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
Genetic variation has multiple effects on the proteome. It may influence the expression level of proteins, modify their sequences through single nucleotide polymorphisms, the occurrence of allelic variants, or alternative splicing (ASP) events. This perspective paper summarizes the major effects of genetic variability on protein expression and isoforms and provides an overview of proteomics techniques and methods that allow studying the effects of genetic variability at different levels of the proteome. The paper provides an overview of recent quantitative trait loci studies performed to explore the effect of genetic variation on protein expression (pQTL). Finally it gives a perspective view on advances in proteomics technology and the role of the Chromosome-Centric Human Proteome Project (C-HPP) by creating large-scale resources that may facilitate performing more comprehensive pQTL experiments in the future.
Collapse
Affiliation(s)
- Péter Horvatovich
- Analytical Biochemistry, Department of Pharmacy, University of Groningen , A. Deusinglaan 1, 9713 AV Groningen, The Netherlands
| | | | | |
Collapse
|