1
|
Pozo F, Rodriguez JM, Martínez Gómez L, Vázquez J, Tress ML. APPRIS principal isoforms and MANE Select transcripts define reference splice variants. Bioinformatics 2022; 38:ii89-ii94. [PMID: 36124785 PMCID: PMC9486585 DOI: 10.1093/bioinformatics/btac473] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
MOTIVATION Selecting the splice variant that best represents a coding gene is a crucial first step in many experimental analyses, and vital for mapping clinically relevant variants. This study compares the longest isoforms, MANE Select transcripts, APPRIS principal isoforms, and expression data, and aims to determine which method is best for selecting biological important reference splice variants for large-scale analyses. RESULTS Proteomics analyses and human genetic variation data suggest that most coding genes have a single main protein isoform. We show that APPRIS principal isoforms and MANE Select transcripts best describe these main cellular isoforms, and find that using the longest splice variant as the representative is a poor strategy. Exons unique to the longest splice isoforms are not under selective pressure, and so are unlikely to be functionally relevant. Expression data are also a poor means of selecting the main splice variant. APPRIS principal and MANE Select exons are under purifying selection, while exons specific to alternative transcripts are not. There are MANE and APPRIS representatives for almost 95% of genes, and where they agree they are particularly effective, coinciding with the main proteomics isoform for over 98.2% of genes. AVAILABILITY AND IMPLEMENTATION APPRIS principal isoforms for human, mouse and other model species can be downloaded from the APPRIS database (https://appris.bioinfo.cnio.es), GENCODE genes (https://www.gencodegenes.org/) and the Ensembl website (https://www.ensembl.org). MANE Select transcripts for the human reference set are available from the Ensembl, GENCODE and RefSeq databases (https://www.ncbi.nlm.nih.gov/refseq/). Lists of splice variants where MANE and APPRIS coincide are available from the APPRIS database. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Fernando Pozo
- Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO), 28029 Madrid, Spain
| | - José Manuel Rodriguez
- Cardiovascular Proteomics Laboratory, Centro Nacional de Investigaciones Cardiovasculares Carlos III (CNIC), 28029 Madrid, Spain
| | - Laura Martínez Gómez
- Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO), 28029 Madrid, Spain
| | - Jesús Vázquez
- Cardiovascular Proteomics Laboratory, Centro Nacional de Investigaciones Cardiovasculares Carlos III (CNIC), 28029 Madrid, Spain,CIBER de Investigaciones Cardiovasculares (CIBERCV), 28029 Madrid, Spain
| | | |
Collapse
|
2
|
Lin CX, Li HD, Deng C, Liu W, Erhardt S, Wu FX, Zhao XM, Guan Y, Wang J, Wang D, Hu B, Wang J. An integrated brain-specific network identifies genes associated with neuropathologic and clinical traits of Alzheimer's disease. Brief Bioinform 2022; 23:bbab522. [PMID: 34953465 PMCID: PMC8769916 DOI: 10.1093/bib/bbab522] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2021] [Revised: 10/26/2021] [Accepted: 11/13/2021] [Indexed: 09/24/2024] Open
Abstract
Alzheimer's disease (AD) has a strong genetic predisposition. However, its risk genes remain incompletely identified. We developed an Alzheimer's brain gene network-based approach to predict AD-associated genes by leveraging the functional pattern of known AD-associated genes. Our constructed network outperformed existing networks in predicting AD genes. We then systematically validated the predictions using independent genetic, transcriptomic, proteomic data, neuropathological and clinical data. First, top-ranked genes were enriched in AD-associated pathways. Second, using external gene expression data from the Mount Sinai Brain Bank study, we found that the top-ranked genes were significantly associated with neuropathological and clinical traits, including the Consortium to Establish a Registry for Alzheimer's Disease score, Braak stage score and clinical dementia rating. The analysis of Alzheimer's brain single-cell RNA-seq data revealed cell-type-specific association of predicted genes with early pathology of AD. Third, by interrogating proteomic data in the Religious Orders Study and Memory and Aging Project and Baltimore Longitudinal Study of Aging studies, we observed a significant association of protein expression level with cognitive function and AD clinical severity. The network, method and predictions could become a valuable resource to advance the identification of risk genes for AD.
Collapse
Affiliation(s)
- Cui-Xiang Lin
- School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, P. R. China
- Hunan Provincial Key Lab of Bioinformatics, Central South University, Changsha, Hunan 410083, P. R. China
| | - Hong-Dong Li
- School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, P. R. China
- Hunan Provincial Key Lab of Bioinformatics, Central South University, Changsha, Hunan 410083, P. R. China
| | - Chao Deng
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, P. R. China
- Hunan Provincial Key Lab of Bioinformatics, Central South University, Changsha, Hunan 410083, P. R. China
| | - Weisheng Liu
- School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, P. R. China
- Hunan Provincial Key Lab of Bioinformatics, Central South University, Changsha, Hunan 410083, P. R. China
| | - Shannon Erhardt
- Department of Pediatrics, McGovern Medical School, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Fang-Xiang Wu
- Division of Biomedical Engineering, University of Saskatchewan, Saskatoon, SKS7N5A9, Canada
| | - Xing-Ming Zhao
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
| | - Yuanfang Guan
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States
| | - Jun Wang
- Department of Pediatrics, McGovern Medical School, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Daifeng Wang
- Department of Biostatistics and Medical Informatics and Waisman Center, University of Wisconsin-Madison, Madison, Wisconsin, USA
| | - Bin Hu
- Institute of Engineering Medicine, Beijing Institute of Technology, Beijing, China
| | - Jianxin Wang
- School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, P. R. China
- Hunan Provincial Key Lab of Bioinformatics, Central South University, Changsha, Hunan 410083, P. R. China
| |
Collapse
|
3
|
Rodriguez JM, Pozo F, Cerdán-Vélez D, Di Domenico T, Vázquez J, Tress M. APPRIS: selecting functionally important isoforms. Nucleic Acids Res 2022; 50:D54-D59. [PMID: 34755885 PMCID: PMC8728124 DOI: 10.1093/nar/gkab1058] [Citation(s) in RCA: 36] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2021] [Revised: 10/14/2021] [Accepted: 10/20/2021] [Indexed: 12/20/2022] Open
Abstract
APPRIS (https://appris.bioinfo.cnio.es) is a well-established database housing annotations for protein isoforms for a range of species. APPRIS selects principal isoforms based on protein structure and function features and on cross-species conservation. Most coding genes produce a single main protein isoform and the principal isoforms chosen by the APPRIS database best represent this main cellular isoform. Human genetic data, experimental protein evidence and the distribution of clinical variants all support the relevance of APPRIS principal isoforms. APPRIS annotations and principal isoforms have now been expanded to 10 model organisms. In this paper we highlight the most recent updates to the database. APPRIS annotations have been generated for two new species, cow and chicken, the protein structural information has been augmented with reliable models from the EMBL-EBI AlphaFold database, and we have substantially expanded the confirmatory proteomics evidence available for the human genome. The most significant change in APPRIS has been the implementation of TRIFID functional isoform scores. TRIFID functional scores are assigned to all splice isoforms, and APPRIS uses the TRIFID functional scores and proteomics evidence to determine principal isoforms when core methods cannot.
Collapse
Affiliation(s)
- Jose Manuel Rodriguez
- Cardiovascular Proteomics Laboratory, Centro Nacional de Investigaciones Cardiovasculares Carlos III (CNIC), 28029 Madrid, Spain
| | - Fernando Pozo
- Bioinformatics Institute, Spanish National Cancer Research Centre (CNIO), Madrid, 28029, Spain
| | - Daniel Cerdán-Vélez
- Bioinformatics Institute, Spanish National Cancer Research Centre (CNIO), Madrid, 28029, Spain
| | - Tomás Di Domenico
- Bioinformatics Institute, Spanish National Cancer Research Centre (CNIO), Madrid, 28029, Spain
| | - Jesús Vázquez
- Cardiovascular Proteomics Laboratory, Centro Nacional de Investigaciones Cardiovasculares Carlos III (CNIC), 28029 Madrid, Spain
- CIBER de Enfermedades Cardiovasculares (CIBERCV), 28029 Madrid, Spain
| | - Michael L Tress
- Bioinformatics Institute, Spanish National Cancer Research Centre (CNIO), Madrid, 28029, Spain
| |
Collapse
|
4
|
Lin CX, Li HD, Deng C, Guan Y, Wang J. TissueNexus: a database of human tissue functional gene networks built with a large compendium of curated RNA-seq data. Nucleic Acids Res 2021; 50:D710-D718. [PMID: 34850130 PMCID: PMC8728275 DOI: 10.1093/nar/gkab1133] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2021] [Revised: 10/10/2021] [Accepted: 11/18/2021] [Indexed: 01/02/2023] Open
Abstract
Mapping gene interactions within tissues/cell types plays a crucial role in understanding the genetic basis of human physiology and disease. Tissue functional gene networks (FGNs) are essential models for mapping complex gene interactions. We present TissueNexus, a database of 49 human tissue/cell line FGNs constructed by integrating heterogeneous genomic data. We adopted an advanced machine learning approach for data integration because Bayesian classifiers, which is the main approach used for constructing existing tissue gene networks, cannot capture the interaction and nonlinearity of genomic features well. A total of 1,341 RNA-seq datasets containing 52,087 samples were integrated for all of these networks. Because the tissue label for RNA-seq data may be annotated with different names or be missing, we performed intensive hand-curation to improve quality. We further developed a user-friendly database for network search, visualization, and functional analysis. We illustrate the application of TissueNexus in prioritizing disease genes. The database is publicly available at https://www.diseaselinks.com/TissueNexus/.
Collapse
Affiliation(s)
- Cui-Xiang Lin
- School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, P.R. China.,Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha, Hunan 410083, P.R. China
| | - Hong-Dong Li
- School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, P.R. China.,Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha, Hunan 410083, P.R. China
| | - Chao Deng
- School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, P.R. China.,Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha, Hunan 410083, P.R. China
| | - Yuanfang Guan
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Jianxin Wang
- School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, P.R. China.,Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha, Hunan 410083, P.R. China
| |
Collapse
|
5
|
Li HD, Xu Y, Zhu X, Liu Q, Omenn GS, Wang J. ClusterMine: A knowledge-integrated clustering approach based on expression profiles of gene sets. J Bioinform Comput Biol 2021; 18:2040009. [PMID: 32698720 DOI: 10.1142/s0219720020400090] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
Clustering analysis of gene expression data is essential for understanding complex biological data, and is widely used in important biological applications such as the identification of cell subpopulations and disease subtypes. In commonly used methods such as hierarchical clustering (HC) and consensus clustering (CC), holistic expression profiles of all genes are often used to assess the similarity between samples for clustering. While these methods have been proven successful in identifying sample clusters in many areas, they do not provide information about which gene sets (functions) contribute most to the clustering, thus limiting the interpretability of the resulting cluster. We hypothesize that integrating prior knowledge of annotated gene sets would not only achieve satisfactory clustering performance but also, more importantly, enable potential biological interpretation of clusters. Here we report ClusterMine, an approach that identifies clusters by assessing functional similarity between samples through integrating known annotated gene sets in functional annotation databases such as Gene Ontology. In addition to the cluster membership of each sample as provided by conventional approaches, it also outputs gene sets that most likely contribute to the clustering, thus facilitating biological interpretation. We compare ClusterMine with conventional approaches on nine real-world experimental datasets that represent different application scenarios in biology. We find that ClusterMine achieves better performances and that the gene sets prioritized by our method are biologically meaningful. ClusterMine is implemented as an R package and is freely available at: www.genemine.org/clustermine.php.
Collapse
Affiliation(s)
- Hong-Dong Li
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 400083, P. R. China
| | - Yunpei Xu
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 400083, P. R. China
| | - Xiaoshu Zhu
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 400083, P. R. China.,School of Computer Science and Engineering, Yulin Normal University, Yulin, Guangxi, P. R. China
| | - Quan Liu
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 400083, P. R. China
| | - Gilbert S Omenn
- Departments of Computational Medicine and Bioinformatics, Internal Medicine, Human Genetics and School of Public Health, University of Michigan, Ann Arbor, MI 48109-2218, USA
| | - Jianxin Wang
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 400083, P. R. China
| |
Collapse
|
6
|
Li H, Funk CC, McFarland K, Dammer EB, Allen M, Carrasquillo MM, Levites Y, Chakrabarty P, Burgess JD, Wang X, Dickson D, Seyfried NT, Duong DM, Lah JJ, Younkin SG, Levey AI, Omenn GS, Ertekin‐Taner N, Golde TE, Price ND. Integrative functional genomic analysis of intron retention in human and mouse brain with Alzheimer's disease. Alzheimers Dement 2021; 17:984-1004. [PMID: 33480174 PMCID: PMC8248162 DOI: 10.1002/alz.12254] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2019] [Revised: 10/08/2020] [Accepted: 10/17/2020] [Indexed: 12/21/2022]
Abstract
Intron retention (IR) has been implicated in the pathogenesis of complex diseases such as cancers; its association with Alzheimer's disease (AD) remains unexplored. We performed genome-wide analysis of IR through integrating genetic, transcriptomic, and proteomic data of AD subjects and mouse models from the Accelerating Medicines Partnership-Alzheimer's Disease project. We identified 4535 and 4086 IR events in 2173 human and 1736 mouse genes, respectively. Quantitation of IR enabled the identification of differentially expressed genes that conventional exon-level approaches did not reveal. There were significant correlations of intron expression within innate immune genes, like HMBOX1, with AD in humans. Peptides with a high probability of translation from intron-retained mRNAs were identified using mass spectrometry. Further, we established AD-specific intron expression Quantitative Trait Loci, and identified splicing-related genes that may regulate IR. Our analysis provides a novel resource for the search for new AD biomarkers and pathological mechanisms.
Collapse
Affiliation(s)
- Hong‐Dong Li
- Hunan Provincial Key Lab on BioinformaticsSchool of Computer Science and EngineeringCentral South UniversityChangshaHunanP.R. China
- Institute for Systems BiologySeattleWashingtonUSA
| | - Cory C. Funk
- Institute for Systems BiologySeattleWashingtonUSA
| | - Karen McFarland
- Department of Neuroscience and NeurologyCenter for Translational Research in Neurodegenerative diseaseand McKnight Brain InstituteUniversity of FloridaGainesvilleFloridaUSA
| | - Eric B. Dammer
- Department of BiochemistryEmory UniversityAtlantaGeorgiaUSA
| | - Mariet Allen
- Mayo ClinicDepartment ofNeuroscienceJacksonvilleFloridaUSA
| | | | - Yona Levites
- Department of Neuroscience and NeurologyCenter for Translational Research in Neurodegenerative diseaseand McKnight Brain InstituteUniversity of FloridaGainesvilleFloridaUSA
| | - Paramita Chakrabarty
- Department of Neuroscience and NeurologyCenter for Translational Research in Neurodegenerative diseaseand McKnight Brain InstituteUniversity of FloridaGainesvilleFloridaUSA
| | | | - Xue Wang
- Mayo ClinicDepartment of Health Sciences ResearchJacksonvilleFloridaUSA
| | - Dennis Dickson
- Mayo ClinicDepartment ofNeuroscienceJacksonvilleFloridaUSA
| | - Nicholas T. Seyfried
- Department of BiochemistryEmory UniversityAtlantaGeorgiaUSA
- Department of NeurologyEmory UniversityAtlantaGeorgiaUSA
| | - Duc M. Duong
- Department of BiochemistryEmory UniversityAtlantaGeorgiaUSA
| | - James J. Lah
- Department of NeurologyEmory UniversityAtlantaGeorgiaUSA
| | | | - Allan I. Levey
- Department of NeurologyEmory UniversityAtlantaGeorgiaUSA
| | - Gilbert S. Omenn
- Institute for Systems BiologySeattleWashingtonUSA
- Department of Computational Medicine and BioinformaticsUniversity of MichiganAnn ArborMichiganUSA
| | - Nilüfer Ertekin‐Taner
- Mayo ClinicDepartment ofNeuroscienceJacksonvilleFloridaUSA
- Mayo ClinicDepartment of NeurologyJacksonvilleFloridaUSA
| | - Todd E. Golde
- Department of Neuroscience and NeurologyCenter for Translational Research in Neurodegenerative diseaseand McKnight Brain InstituteUniversity of FloridaGainesvilleFloridaUSA
| | | |
Collapse
|
7
|
Chen H, Shaw D, Bu D, Jiang T. FINER: enhancing the prediction of tissue-specific functions of isoforms by refining isoform interaction networks. NAR Genom Bioinform 2021; 3:lqab057. [PMID: 34169280 PMCID: PMC8219044 DOI: 10.1093/nargab/lqab057] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2021] [Revised: 05/18/2021] [Accepted: 06/03/2021] [Indexed: 12/24/2022] Open
Abstract
Annotating the functions of gene products is a mainstay in biology. A variety of databases have been established to record functional knowledge at the gene level. However, functional annotations at the isoform resolution are in great demand in many biological applications. Although critical information in biological processes such as protein-protein interactions (PPIs) is often used to study gene functions, it does not directly help differentiate the functions of isoforms, as the 'proteins' in the existing PPIs generally refer to 'genes'. On the other hand, the prediction of isoform functions and prediction of isoform-isoform interactions, though inherently intertwined, have so far been treated as independent computational problems in the literature. Here, we present FINER, a unified framework to jointly predict isoform functions and refine PPIs from the gene level to the isoform level, enabling both tasks to benefit from each other. Extensive computational experiments on human tissue-specific data demonstrate that FINER is able to gain at least 5.16% in AUC and 15.1% in AUPRC for functional prediction across multiple tissues by refining noisy PPIs, resulting in significant improvement over the state-of-the-art methods. Some in-depth analyses reveal consistency between FINER's predictions and the tissue specificity as well as subcellular localization of isoforms.
Collapse
Affiliation(s)
- Hao Chen
- Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA
| | - Dipan Shaw
- Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA
| | - Dongbo Bu
- Key Lab of Intelligent Information Process, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Tao Jiang
- Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA
- Bioinformatics Division, BNRIST/Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
| |
Collapse
|
8
|
Pozo F, Martinez-Gomez L, Walsh TA, Rodriguez JM, Di Domenico T, Abascal F, Vazquez J, Tress ML. Assessing the functional relevance of splice isoforms. NAR Genom Bioinform 2021; 3:lqab044. [PMID: 34046593 PMCID: PMC8140736 DOI: 10.1093/nargab/lqab044] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2020] [Revised: 04/22/2021] [Accepted: 05/17/2021] [Indexed: 12/20/2022] Open
Abstract
Alternative splicing of messenger RNA can generate an array of mature transcripts, but it is not clear how many go on to produce functionally relevant protein isoforms. There is only limited evidence for alternative proteins in proteomics analyses and data from population genetic variation studies indicate that most alternative exons are evolving neutrally. Determining which transcripts produce biologically important isoforms is key to understanding isoform function and to interpreting the real impact of somatic mutations and germline variations. Here we have developed a method, TRIFID, to classify the functional importance of splice isoforms. TRIFID was trained on isoforms detected in large-scale proteomics analyses and distinguishes these biologically important splice isoforms with high confidence. Isoforms predicted as functionally important by the algorithm had measurable cross species conservation and significantly fewer broken functional domains. Additionally, exons that code for these functionally important protein isoforms are under purifying selection, while exons from low scoring transcripts largely appear to be evolving neutrally. TRIFID has been developed for the human genome, but it could in principle be applied to other well-annotated species. We believe that this method will generate valuable insights into the cellular importance of alternative splicing.
Collapse
Affiliation(s)
- Fernando Pozo
- Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | - Laura Martinez-Gomez
- Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | - Thomas A Walsh
- Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | - José Manuel Rodriguez
- Cardiovascular Proteomics Laboratory, Centro Nacional de Investigaciones Cardiovasculares (CNIC), Madrid, Spain
| | - Tomas Di Domenico
- Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | - Federico Abascal
- Somatic Evolution Group, Wellcome Sanger Institute, Hinxton CB10 1SA, UK
| | - Jesús Vazquez
- Cardiovascular Proteomics Laboratory, Centro Nacional de Investigaciones Cardiovasculares (CNIC), Madrid, Spain
| | - Michael L Tress
- Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| |
Collapse
|
9
|
Comprehensive expression-based isoform biomarkers predictive of drug responses based on isoform co-expression networks and clinical data. Genomics 2020; 112:647-658. [DOI: 10.1016/j.ygeno.2019.04.017] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2018] [Revised: 03/21/2019] [Accepted: 04/23/2019] [Indexed: 11/19/2022]
|
10
|
Li H, Siddiqui O, Zhang H, Guan Y. Joint learning improves protein abundance prediction in cancers. BMC Biol 2019; 17:107. [PMID: 31870366 PMCID: PMC6929375 DOI: 10.1186/s12915-019-0730-9] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2019] [Accepted: 12/04/2019] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND The classic central dogma in biology is the information flow from DNA to mRNA to protein, yet complicated regulatory mechanisms underlying protein translation often lead to weak correlations between mRNA and protein abundances. This is particularly the case in cancer samples and when evaluating the same gene across multiple samples. RESULTS Here, we report a method for predicting proteome from transcriptome, using a training dataset provided by NCI-CPTAC and TCGA, consisting of transcriptome and proteome data from 77 breast and 105 ovarian cancer samples. First, we establish a generic model capturing the correlation between mRNA and protein abundance of a single gene. Second, we build a gene-specific model capturing the interdependencies among multiple genes in a regulatory network. Third, we create a cross-tissue model by joint learning the information of shared regulatory networks and pathways across cancer tissues. Our method ranked first in the NCI-CPTAC DREAM Proteogenomics Challenge, and the predictive performance is close to the accuracy of experimental replicates. Key functional pathways and network modules controlling the proteomic abundance in cancers were revealed, in particular metabolism-related genes. CONCLUSIONS We present a method to predict proteome from transcriptome, leveraging data from different cancer tissues to build a trans-tissue model, and suggest how to integrate information from multiple cancers to provide a foundation for further research.
Collapse
Affiliation(s)
- Hongyang Li
- Department of Computational Medicine and Bioinformatics, University of Michigan, 100 Washtenaw Avenue, Ann Arbor, MI, 48109, USA.
| | - Omer Siddiqui
- Department of Computational Medicine and Bioinformatics, University of Michigan, 100 Washtenaw Avenue, Ann Arbor, MI, 48109, USA
| | - Hongjiu Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, 100 Washtenaw Avenue, Ann Arbor, MI, 48109, USA
| | - Yuanfang Guan
- Department of Computational Medicine and Bioinformatics, University of Michigan, 100 Washtenaw Avenue, Ann Arbor, MI, 48109, USA. .,Department of Internal Medicine, University of Michigan, 100 Washtenaw Avenue, Ann Arbor, MI, 48109, USA.
| |
Collapse
|
11
|
Abstract
Alternative Splicing produces multiple mRNA isoforms of genes which have important diverse roles such as regulation of gene expression, human heritable diseases, and response to environmental stresses. However, little has been done to assign functions at the mRNA isoform level. Functional networks, where the interactions are quantified by their probability of being involved in the same biological process are typically generated at the gene level. We use a diverse array of tissue-specific RNA-seq datasets and sequence information to train random forest models that predict the functional networks. Since there is no mRNA isoform-level gold standard, we use single isoform genes co-annotated to Gene Ontology biological process annotations, Kyoto Encyclopedia of Genes and Genomes pathways, BioCyc pathways and protein-protein interactions as functionally related (positive pair). To generate the non-functional pairs (negative pair), we use the Gene Ontology annotations tagged with "NOT" qualifier. We describe 17 Tissue-spEcific mrNa iSoform functIOnal Networks (TENSION) following a leave-one-tissue-out strategy in addition to an organism level reference functional network for mouse. We validate our predictions by comparing its performance with previous methods, randomized positive and negative class labels, updated Gene Ontology annotations, and by literature evidence. We demonstrate the ability of our networks to reveal tissue-specific functional differences of the isoforms of the same genes. All scripts and data from TENSION are available at: https://doi.org/10.25380/iastate.c.4275191 .
Collapse
Affiliation(s)
- Gaurav Kandoi
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA, USA
- Department of Electrical and Computer Engineering, Iowa State University, Ames, IA, USA
| | - Julie A Dickerson
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA, USA.
- Department of Electrical and Computer Engineering, Iowa State University, Ames, IA, USA.
| |
Collapse
|
12
|
Rodriguez JM, Rodriguez-Rivas J, Di Domenico T, Vázquez J, Valencia A, Tress ML. APPRIS 2017: principal isoforms for multiple gene sets. Nucleic Acids Res 2019; 46:D213-D217. [PMID: 29069475 PMCID: PMC5753224 DOI: 10.1093/nar/gkx997] [Citation(s) in RCA: 86] [Impact Index Per Article: 17.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2017] [Accepted: 10/19/2017] [Indexed: 01/23/2023] Open
Abstract
The APPRIS database (http://appris-tools.org) uses protein structural and functional features and information from cross-species conservation to annotate splice isoforms in protein-coding genes. APPRIS selects a single protein isoform, the ‘principal’ isoform, as the reference for each gene based on these annotations. A single main splice isoform reflects the biological reality for most protein coding genes and APPRIS principal isoforms are the best predictors of these main proteins isoforms. Here, we present the updates to the database, new developments that include the addition of three new species (chimpanzee, Drosophila melangaster and Caenorhabditis elegans), the expansion of APPRIS to cover the RefSeq gene set and the UniProtKB proteome for six species and refinements in the core methods that make up the annotation pipeline. In addition APPRIS now provides a measure of reliability for individual principal isoforms and updates with each release of the GENCODE/Ensembl and RefSeq reference sets. The individual GENCODE/Ensembl, RefSeq and UniProtKB reference gene sets for six organisms have been merged to produce common sets of splice variants.
Collapse
Affiliation(s)
- Jose Manuel Rodriguez
- Spanish National Bioinformatics Institute (INB), Spanish National Cancer Research Centre (CNIO), Madrid 28029, Spain
| | - Juan Rodriguez-Rivas
- Structural Biology and Biocomputing Programme, Spanish National Cancer Research Centre (CNIO), Madrid 28029, Spain
| | - Tomás Di Domenico
- Structural Biology and Biocomputing Programme, Spanish National Cancer Research Centre (CNIO), Madrid 28029, Spain
| | - Jesús Vázquez
- Cardiovascular Proteomics Laboratory, Centro Nacional de Investigaciones Cardiovasculares Carlos III (CNIC), 28029 Madrid, Spain.,CIBER de Enfermedades Cardiovasculares (CIBERCV), 28029 Madrid, Spain
| | - Alfonso Valencia
- Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona E-08010, Spain.,Life Sciences Department, Barcelona Supercomputing Centre (BSC-CNS), Barcelona E-08034, Spain
| | - Michael L Tress
- Structural Biology and Biocomputing Programme, Spanish National Cancer Research Centre (CNIO), Madrid 28029, Spain
| |
Collapse
|
13
|
Ma J, Wang J, Ghoraie LS, Men X, Haibe-Kains B, Dai P. Network-based approach to identify principal isoforms among four cancer types. Mol Omics 2019; 15:117-129. [PMID: 30720033 DOI: 10.1039/c8mo00234g] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Protein isoforms are structurally similar proteins produced by alternative splicing of a single gene or genes from the same family. Isoforms of a protein can perform the same, similar, or even opposite biological functions. A previous study identified principal isoforms of proteins based on the extent of interactions per isoform in a functional relationship network, focusing on data from normal tissues. Additionally, the expression levels of specific isoforms of various genes associated with tumorigenesis and prognosis are frequently altered in tumors compared with those in normal tissues. In this study, we aimed to identify higher degree isoforms (HDIs) of multi-isoform genes (MIGs) in cancer by applying a meta-analytical framework to calculate co-expression between each pair of isoforms in two large datasets of RNA-seq profiles from breast cancer, lung cancer, leukemia, and colon cancer cell lines. Then, we compared HDIs with isoforms identified by proteomic data and prognostic and predictive evidence in various cancers. In addition, we separately analyzed the associations between HDIs and non-HDIs (nHDIs) of the same genes according to transcript expression and drug responses in various cancer type cell lines. Collectively, these results indicated the complex properties of HDIs per gene identified by cancer type-based isoform-isoform co-expression networks and showed the potential of HDIs as novel therapeutic targets for cancer treatment.
Collapse
Affiliation(s)
- Jun Ma
- National Engineering Research Center for Miniaturized Detection Systems, Northwest University, Xi'an, P. R. China. and Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
| | - Jenny Wang
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
| | - Laleh Soltan Ghoraie
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
| | - Xin Men
- Microbiology Institute of Shaanxi, China and National Engineering Research Center for Miniaturized Detection Systems, Northwest University, Xi'an, P. R. China.
| | - Benjamin Haibe-Kains
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
| | - Penggao Dai
- National Engineering Research Center for Miniaturized Detection Systems, Northwest University, Xi'an, P. R. China.
| |
Collapse
|
14
|
Guan Y, Li T, Zhang H, Zhu F, Omenn GS. Prioritizing predictive biomarkers for gene essentiality in cancer cells with mRNA expression data and DNA copy number profile. Bioinformatics 2018; 34:3975-3982. [PMID: 29912344 PMCID: PMC6247930 DOI: 10.1093/bioinformatics/bty467] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2017] [Revised: 05/15/2018] [Accepted: 06/12/2018] [Indexed: 01/02/2023] Open
Abstract
Motivation Finding driver genes that are responsible for the aberrant proliferation rate of cancer cells is informative for both cancer research and the development of targeted drugs. The established experimental and computational methods are labor-intensive. To make algorithms feasible in real clinical settings, methods that can predict driver genes using less experimental data are urgently needed. Results We designed an effective feature selection method and used Support Vector Machines (SVM) to predict the essentiality of the potential driver genes in cancer cell lines with only 10 genes as features. The accuracy of our predictions was the highest in the Broad-DREAM Gene Essentiality Prediction Challenge. We also found a set of genes whose essentiality could be predicted much more accurately than others, which we called Accurately Predicted (AP) genes. Our method can serve as a new way of assessing the essentiality of genes in cancer cells. Availability and implementation The raw data that support the findings of this study are available at Synapse. https://www.synapse.org/#! Synapse: syn2384331/wiki/62825. Source code is available at GitHub. https://github.com/GuanLab/DREAM-Gene-Essentiality-Challenge. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yuanfang Guan
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Tingyang Li
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Hongjiu Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Fan Zhu
- Key Laboratory of Big Data and Intelligent Computing, Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences, Chongqing, China
| | - Gilbert S Omenn
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
- Departments of Internal Medicine and Human Genetics and School of Public Health, University of Michigan, Ann Arbor, MI, USA
| |
Collapse
|
15
|
Jeong SK, Kim CY, Paik YK. ASV-ID, a Proteogenomic Workflow To Predict Candidate Protein Isoforms on the Basis of Transcript Evidence. J Proteome Res 2018; 17:4235-4242. [PMID: 30289715 DOI: 10.1021/acs.jproteome.8b00548] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
One of the goals of the Chromosome-Centric Human Proteome Project (C-HPP) is to map and characterize the functions of protein isoforms produced by alternative splicing of genes. However, identifying alternative splice variants (ASVs) via mass spectrometry remains a major challenge, because ASVs usually contain highly homologous peptide sequences. A routine protein sequence analysis suggests that more than half of the investigated proteins do not generate two or more uniquely mapping peptides that would enable their isoforms to be distinguished. Here, we develop a new proteogenomics method, named "ASV-ID" (alternative splicing variants identification), which enables identification of ASVs by using a cell type-specific protein sequence database that is supported by RNA-Seq data. Using this workflow, we identify 1935 distinct proteins under highly stringent conditions. In fact, transcript evidence on these 841 proteins helps us distinguish them from other isoforms, despite the fact that these proteins are not predicted to make 2 or more uniquely mapping peptides. We also demonstrate that ASV-ID enables detection of 19 differently expressed isoforms present in several cell lines. Thus, a new workflow using ASV-ID has the potential to map yet-to-be-identified difficult protein isoforms in a simple and robust way.
Collapse
|
16
|
Annotation of Alternatively Spliced Proteins and Transcripts with Protein-Folding Algorithms and Isoform-Level Functional Networks. Methods Mol Biol 2017; 1558:415-436. [PMID: 28150250 DOI: 10.1007/978-1-4939-6783-4_20] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Tens of thousands of splice isoforms of proteins have been catalogued as predicted sequences from transcripts in humans and other species. Relatively few have been characterized biochemically or structurally. With the extensive development of protein bioinformatics, the characterization and modeling of isoform features, isoform functions, and isoform-level networks have advanced notably. Here we present applications of the I-TASSER family of algorithms for folding and functional predictions and the IsoFunc, MIsoMine, and Hisonet data resources for isoform-level analyses of network and pathway-based functional predictions and protein-protein interactions. Hopefully, predictions and insights from protein bioinformatics will stimulate many experimental validation studies.
Collapse
|
17
|
Omenn GS, Lane L, Lundberg EK, Beavis RC, Overall CM, Deutsch EW. Metrics for the Human Proteome Project 2016: Progress on Identifying and Characterizing the Human Proteome, Including Post-Translational Modifications. J Proteome Res 2016; 15:3951-3960. [PMID: 27487407 PMCID: PMC5129622 DOI: 10.1021/acs.jproteome.6b00511] [Citation(s) in RCA: 63] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
Abstract
The HUPO Human Proteome Project (HPP) has two overall goals: (1) stepwise completion of the protein parts list-the draft human proteome including confidently identifying and characterizing at least one protein product from each protein-coding gene, with increasing emphasis on sequence variants, post-translational modifications (PTMs), and splice isoforms of those proteins; and (2) making proteomics an integrated counterpart to genomics throughout the biomedical and life sciences community. PeptideAtlas and GPMDB reanalyze all major human mass spectrometry data sets available through ProteomeXchange with standardized protocols and stringent quality filters; neXtProt curates and integrates mass spectrometry and other findings to present the most up to date authorative compendium of the human proteome. The HPP Guidelines for Mass Spectrometry Data Interpretation version 2.1 were applied to manuscripts submitted for this 2016 C-HPP-led special issue [ www.thehpp.org/guidelines ]. The Human Proteome presented as neXtProt version 2016-02 has 16,518 confident protein identifications (Protein Existence [PE] Level 1), up from 13,664 at 2012-12, 15,646 at 2013-09, and 16,491 at 2014-10. There are 485 proteins that would have been PE1 under the Guidelines v1.0 from 2012 but now have insufficient evidence due to the agreed-upon more stringent Guidelines v2.0 to reduce false positives. neXtProt and PeptideAtlas now both require two non-nested, uniquely mapping (proteotypic) peptides of at least 9 aa in length. There are 2,949 missing proteins (PE2+3+4) as the baseline for submissions for this fourth annual C-HPP special issue of Journal of Proteome Research. PeptideAtlas has 14,629 canonical (plus 1187 uncertain and 1755 redundant) entries. GPMDB has 16,190 EC4 entries, and the Human Protein Atlas has 10,475 entries with supportive evidence. neXtProt, PeptideAtlas, and GPMDB are rich resources of information about post-translational modifications (PTMs), single amino acid variants (SAAVSs), and splice isoforms. Meanwhile, the Biology- and Disease-driven (B/D)-HPP has created comprehensive SRM resources, generated popular protein lists to guide targeted proteomics assays for specific diseases, and launched an Early Career Researchers initiative.
Collapse
Affiliation(s)
- Gilbert S. Omenn
- Department of Computational Medicine and Bioinformatics, University of Michigan, 100 Washtenaw Avenue, Ann Arbor, Michigan 48109-2218, United States
| | - Lydie Lane
- CALIPHO Group, SIB Swiss Institute of Bioinformatics and Department of Human Protein Science, University of Geneva, CMU, Michel-Servet 1, 1211 Geneva 4, Switzerland
| | - Emma K. Lundberg
- SciLifeLab Stockholm and School of Biotechnology, KTH, Karolinska Institutet Science Park, Tomtebodavägen 23, SE-171 65 Solna, Sweden
| | - Ronald C. Beavis
- Biochemistry & Medical Genetics, University of Manitoba, Winnipeg, MB R3T 2N2, Canada
| | - Christopher M. Overall
- Biochemistry and Molecular Biology, and Oral Biological and Medical Sciences University of British Columbia, 2350 Health Sciences Mall, Room 4.401, Vancouver, BC V6T 1Z3, Canada
| | - Eric W. Deutsch
- Institute for Systems Biology, 401 Terry Avenue North, Seattle, Washington 98109-5263, United States
| |
Collapse
|
18
|
Tress ML, Abascal F, Valencia A. Alternative Splicing May Not Be the Key to Proteome Complexity. Trends Biochem Sci 2016; 42:98-110. [PMID: 27712956 DOI: 10.1016/j.tibs.2016.08.008] [Citation(s) in RCA: 229] [Impact Index Per Article: 28.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2016] [Revised: 05/19/2016] [Accepted: 08/15/2016] [Indexed: 12/21/2022]
Abstract
Alternative splicing is commonly believed to be a major source of cellular protein diversity. However, although many thousands of alternatively spliced transcripts are routinely detected in RNA-seq studies, reliable large-scale mass spectrometry-based proteomics analyses identify only a small fraction of annotated alternative isoforms. The clearest finding from proteomics experiments is that most human genes have a single main protein isoform, while those alternative isoforms that are identified tend to be the most biologically plausible: those with the most cross-species conservation and those that do not compromise functional domains. Indeed, most alternative exons do not seem to be under selective pressure, suggesting that a large majority of predicted alternative transcripts may not even be translated into proteins.
Collapse
Affiliation(s)
- Michael L Tress
- Structural Biology and Bioinformatics Programme, Spanish National Cancer Research Centre (CNIO), Melchor Fernández Almagro, 3, 28029 Madrid, Spain
| | - Federico Abascal
- Structural Biology and Bioinformatics Programme, Spanish National Cancer Research Centre (CNIO), Melchor Fernández Almagro, 3, 28029 Madrid, Spain; Human Genetics Department, Sandhu Group, Wellcome Trust Sanger Institute, Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - Alfonso Valencia
- Structural Biology and Bioinformatics Programme, Spanish National Cancer Research Centre (CNIO), Melchor Fernández Almagro, 3, 28029 Madrid, Spain; National Bioinformatics Institute (INB), Spanish National Cancer Research Centre (CNIO), Melchor Fernández Almagro, 3, 28029 Madrid, Spain.
| |
Collapse
|
19
|
Büchler R, Wendler S, Muckova P, Großkreutz J, Rhode H. The intricacy of biomarker complexity-the identification of a genuine proteomic biomarker is more complicated than believed. Proteomics Clin Appl 2016; 10:1073-1076. [PMID: 27377180 DOI: 10.1002/prca.201600067] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2016] [Revised: 06/22/2016] [Accepted: 06/29/2016] [Indexed: 11/10/2022]
Abstract
Several reasons have been put forward to explain the irreproducibility of proteomic biomarker search. However, these reasons pertain to almost every part of biomarker search across the entire analytical workflow but are entirely experimental or methodological. However, in this article we point out that there is a further cause of such irreproducibility. This is not an additional methodological or experimental cause but arises directly from the biology of protein expression. It arises from the fact that disease changes the diversity within protein families. This cause of irreproducibility has been very little studied in relation to proteomic biomarker search. Gene expression is highly variable even in healthy people. Therefore, multiple proteoforms are also to be expected when gene expression is disrupted by disease, proteoforms that may be differently altered by pathology. In consequence, it is illogical to expect that the whole protein family produces a reliably usable biomarker. It is more reasonable to expect that a specific proteoform fulfills this role. Appropriate sample pre-fractionation methods and data analyses could help to identify this version, carrying the modification or the epitope required.
Collapse
Affiliation(s)
- Rita Büchler
- Institute of Biochemistry I, University Hospital Jena, Jena, Germany
| | - Sindy Wendler
- Institute of Biochemistry I, University Hospital Jena, Jena, Germany
| | - Petra Muckova
- Institute of Biochemistry I, University Hospital Jena, Jena, Germany.,Clinic of Neurology, University Hospital Jena, Jena, Germany
| | | | - Heidrun Rhode
- Institute of Biochemistry I, University Hospital Jena, Jena, Germany
| |
Collapse
|
20
|
Paik YK, Omenn GS, Overall CM, Deutsch EW, Hancock WS. Recent Advances in the Chromosome-Centric Human Proteome Project: Missing Proteins in the Spot Light. J Proteome Res 2016; 14:3409-14. [PMID: 26337862 DOI: 10.1021/acs.jproteome.5b00785] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Young-Ki Paik
- Yonsei Proteome Research Center, Yonsei University , Seoul 120-749, Korea
| | - Gilbert S Omenn
- Center for Computational Medicine and Bioinformatics, University of Michigan , Ann Arbor, Michigan 48109, United States.,Yonsei Proteome Research Center, Yonsei University , Seoul 120-749, Korea
| | - Christopher M Overall
- Department of Biochemistry and Molecular Biology, University of British Columbia , Vancouver, British Columbia V6T 1Z3, Canada.,Yonsei Proteome Research Center, Yonsei University , Seoul 120-749, Korea
| | - Eric W Deutsch
- Institute for Systems Biology , Seattle, Washington 98109, United States.,Yonsei Proteome Research Center, Yonsei University , Seoul 120-749, Korea
| | - William S Hancock
- Department of Chemical Biology, Northeastern University , Boston, Massachusetts 02115, United States.,Yonsei Proteome Research Center, Yonsei University , Seoul 120-749, Korea
| |
Collapse
|
21
|
Tavares R, Wajnberg G, Scherer NDM, Pauletti BA, Cassoli JS, Ferreira CG, Paes Leme AF, de Araujo-Souza PS, Martins-de-Souza D, Passetti F. Unveiling alterative splice diversity from human oligodendrocyte proteome data. J Proteomics 2016; 151:293-301. [PMID: 27222040 DOI: 10.1016/j.jprot.2016.05.023] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2015] [Revised: 05/14/2016] [Accepted: 05/20/2016] [Indexed: 10/21/2022]
Abstract
Oligodendrocytes produce and maintain the myelin sheath of axons in the central nervous system. Because misassembled myelin sheaths have been associated with brain disorders such as multiple sclerosis and schizophrenia, recent advances have been made towards the description of the oligodendrocyte proteome. The identification of splice variants represented in the proteome is as important as determining the level of oligodendrocyte-associated proteins. Here, we used an oligodendrocyte proteome dataset deposited in ProteomeXchange to search against a customized protein sequence file containing computationally predicted splice variants. Our approach resulted in the identification of 39 splice variants, including one variant from the GTPase KRAS gene and another from the human glutaminase gene family. We also detected the mRNA expression of five selected splice variants and demonstrated that a fraction of these have their canonical proteins participating in direct protein-protein interactions. In conclusion, we believe our findings contribute to the molecular characterization of oligodendrocytes and may encourage other research groups working with central nervous system disorders to investigate the biological significance of these splice variants. The splice variants identified in this study may encode proteins that could be targeted in novel treatment strategies and diagnostic methods. SIGNIFICANCE Several disorders of the central nervous system (CNS) are associated with misassembled myelin sheaths, which are produced and maintained by oligodendrocytes (OL). Recently, the OL proteome has been explored to identify key proteins and molecular functions associated with CNS disorders. We developed an innovative approach to select, with a higher level of confidence, a relevant list of splice variants from a proteome dataset and detected the mRNA expression of five selected variants: EEF1D, KRAS, MFF, SDR39U1, and SUGT1. We also described splice variants extracted from OL proteome data. Among the splice variants identified, some are from genes previously linked to CNS and related disorders. Our findings may contribute to oligodendrocyte characterization and encourage other research groups to investigate the biological role of splice variants and to improve current treatments and diagnostic methods for CNS disorders.
Collapse
Affiliation(s)
- Raphael Tavares
- Laboratory of Functional Genomics and Bioinformatics, Oswaldo Cruz Institute, Fundação Oswaldo Cruz (FIOCRUZ), Rio de Janeiro, RJ, Brazil; Bioinformatics Unit, Clinical Research Coordination, Instituto Nacional de Câncer (INCA), Rio de Janeiro, RJ, Brazil
| | - Gabriel Wajnberg
- Laboratory of Functional Genomics and Bioinformatics, Oswaldo Cruz Institute, Fundação Oswaldo Cruz (FIOCRUZ), Rio de Janeiro, RJ, Brazil; Bioinformatics Unit, Clinical Research Coordination, Instituto Nacional de Câncer (INCA), Rio de Janeiro, RJ, Brazil
| | - Nicole de Miranda Scherer
- Bioinformatics Unit, Clinical Research Coordination, Instituto Nacional de Câncer (INCA), Rio de Janeiro, RJ, Brazil
| | - Bianca Alves Pauletti
- Laboratório de Espectrometria de Massas, Laboratório Nacional de Biociências (LNBio), CNPEM, Campinas, SP, Brazil
| | - Juliana S Cassoli
- Laboratory of Neuroproteomics, Department of Biochemistry and Tissue Biology, Institute of Biology, University of Campinas (UNICAMP), Campinas, SP, Brazil
| | - Carlos Gil Ferreira
- Clinical Research Coordination, Instituto Nacional de Câncer (INCA), Rio de Janeiro, RJ, Brazil
| | - Adriana Franco Paes Leme
- Laboratório de Espectrometria de Massas, Laboratório Nacional de Biociências (LNBio), CNPEM, Campinas, SP, Brazil
| | - Patricia Savio de Araujo-Souza
- Department of Immunobiology, Fluminense Federal University (UFF), Niterói, RJ, Brazil; Program of Cellular Biology, Instituto Nacional de Câncer (INCA), Rio de Janeiro, RJ, Brazil
| | - Daniel Martins-de-Souza
- Laboratory of Neuroproteomics, Department of Biochemistry and Tissue Biology, Institute of Biology, University of Campinas (UNICAMP), Campinas, SP, Brazil
| | - Fabio Passetti
- Laboratory of Functional Genomics and Bioinformatics, Oswaldo Cruz Institute, Fundação Oswaldo Cruz (FIOCRUZ), Rio de Janeiro, RJ, Brazil; Bioinformatics Unit, Clinical Research Coordination, Instituto Nacional de Câncer (INCA), Rio de Janeiro, RJ, Brazil.
| |
Collapse
|
22
|
Panwar B, Menon R, Eksi R, Li HD, Omenn GS, Guan Y. Genome-Wide Functional Annotation of Human Protein-Coding Splice Variants Using Multiple Instance Learning. J Proteome Res 2016; 15:1747-53. [PMID: 27142340 DOI: 10.1021/acs.jproteome.5b00883] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
The vast majority of human multiexon genes undergo alternative splicing and produce a variety of splice variant transcripts and proteins, which can perform different functions. These protein-coding splice variants (PCSVs) greatly increase the functional diversity of proteins. Most functional annotation algorithms have been developed at the gene level; the lack of isoform-level gold standards is an important intellectual limitation for currently available machine learning algorithms. The accumulation of a large amount of RNA-seq data in the public domain greatly increases our ability to examine the functional annotation of genes at isoform level. In the present study, we used a multiple instance learning (MIL)-based approach for predicting the function of PCSVs. We used transcript-level expression values and gene-level functional associations from the Gene Ontology database. A support vector machine (SVM)-based 5-fold cross-validation technique was applied. Comparatively, genes with multiple PCSVs performed better than single PCSV genes, and performance also improved when more examples were available to train the models. We demonstrated our predictions using literature evidence of ADAM15, LMNA/C, and DMXL2 genes. All predictions have been implemented in a web resource called "IsoFunc", which is freely available for the global scientific community through http://guanlab.ccmb.med.umich.edu/isofunc .
Collapse
Affiliation(s)
- Bharat Panwar
- Department of Computational Medicine and Bioinformatics, ‡Department of Internal Medicine, §Department of Human Genetics and School of Public Health, and ∥Department of Electrical Engineering and Computer Science, University of Michigan , Ann Arbor, Michigan 48109, United States
| | - Rajasree Menon
- Department of Computational Medicine and Bioinformatics, ‡Department of Internal Medicine, §Department of Human Genetics and School of Public Health, and ∥Department of Electrical Engineering and Computer Science, University of Michigan , Ann Arbor, Michigan 48109, United States
| | - Ridvan Eksi
- Department of Computational Medicine and Bioinformatics, ‡Department of Internal Medicine, §Department of Human Genetics and School of Public Health, and ∥Department of Electrical Engineering and Computer Science, University of Michigan , Ann Arbor, Michigan 48109, United States
| | - Hong-Dong Li
- Department of Computational Medicine and Bioinformatics, ‡Department of Internal Medicine, §Department of Human Genetics and School of Public Health, and ∥Department of Electrical Engineering and Computer Science, University of Michigan , Ann Arbor, Michigan 48109, United States
| | - Gilbert S Omenn
- Department of Computational Medicine and Bioinformatics, ‡Department of Internal Medicine, §Department of Human Genetics and School of Public Health, and ∥Department of Electrical Engineering and Computer Science, University of Michigan , Ann Arbor, Michigan 48109, United States
| | - Yuanfang Guan
- Department of Computational Medicine and Bioinformatics, ‡Department of Internal Medicine, §Department of Human Genetics and School of Public Health, and ∥Department of Electrical Engineering and Computer Science, University of Michigan , Ann Arbor, Michigan 48109, United States
| |
Collapse
|
23
|
Abstract
The laboratory mouse is the primary mammalian species used for studying alternative splicing events. Recent studies have generated computational models to predict functions for splice isoforms in the mouse. However, the functional relationship network, describing the probability of splice isoforms participating in the same biological process or pathway, has not yet been studied in the mouse. Here we describe a rich genome-wide resource of mouse networks at the isoform level, which was generated using a unique framework that was originally developed to infer isoform functions. This network was built through integrating heterogeneous genomic and protein data, including RNA-seq, exon array, protein docking and pseudo-amino acid composition. Through simulation and cross-validation studies, we demonstrated the accuracy of the algorithm in predicting isoform-level functional relationships. We showed that this network enables the users to reveal functional differences of the isoforms of the same gene, as illustrated by literature evidence with Anxa6 (annexin a6) as an example. We expect this work will become a useful resource for the mouse genetics community to understand gene functions. The network is publicly available at: http://guanlab.ccmb.med.umich.edu/isoformnetwork.
Collapse
|
24
|
Li HD, Omenn GS, Guan Y. A proteogenomic approach to understand splice isoform functions through sequence and expression-based computational modeling. Brief Bioinform 2016; 17:1024-1031. [PMID: 26740460 DOI: 10.1093/bib/bbv109] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2015] [Revised: 11/03/2015] [Indexed: 01/23/2023] Open
Abstract
The products of multi-exon genes are a mixture of alternatively spliced isoforms, from which the translated proteins can have similar, different or even opposing functions. It is therefore essential to differentiate and annotate functions for individual isoforms. Computational approaches provide an efficient complement to expensive and time-consuming experimental studies. The input data of these methods range from DNA sequence, to RNA selection pressure, to expressed sequence tags, to full-length complementary DNA, to exon array, to RNA-seq expression, to proteomic data. Notably, RNA-seq technology generates quantitative profiling of transcript expression at the genome scale, with an unprecedented amount of expression data available for developing isoform function prediction methods. Integrative analysis of these data at different molecular levels enables a proteogenomic approach to systematically interrogate isoform functions. Here, we briefly review the state-of-the-art methods according to their input data sources, discuss their advantages and limitations and point out potential ways to improve prediction accuracies.
Collapse
|