1
|
Han K, Liu X, Sun G, Wang Z, Shi C, Liu W, Huang M, Liu S, Guo Q. Enhancing subcellular protein localization mapping analysis using Sc2promap utilizing attention mechanisms. Biochim Biophys Acta Gen Subj 2024:130601. [PMID: 38522679 DOI: 10.1016/j.bbagen.2024.130601] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2023] [Revised: 02/17/2024] [Accepted: 03/15/2024] [Indexed: 03/26/2024]
Abstract
BACKGROUND Aberrant protein localization is a prominent feature in many human diseases and can have detrimental effects on the function of specific tissues and organs. High-throughput technologies, which continue to advance with iterations of automated equipment and the development of bioinformatics, enable the acquisition of large-scale data that are more pattern-rich, allowing for the use of a wider range of methods to extract useful patterns and knowledge from them. METHODS The proposed sc2promap (Spatial and Channel for SubCellular Protein Localization Mapping) model, designed to proficiently extract meaningful features from a vast repository of single-channel grayscale protein images for the purposes of protein localization analysis and clustering. Sc2promap incorporates a prediction head component enriched with supplementary protein annotations, along with the integration of a spatial-channel attention mechanism within the encoder to enables the generation of high-resolution protein localization maps that encapsulate the fundamental characteristics of cells, including elemental cellular localizations such as nuclear and non-nuclear domains. RESULTS Qualitative and quantitative comparisons were conducted across internal and external clustering evaluation metrics, as well as various facets of the clustering results. The study also explored different components of the model. The research outcomes conclusively indicate that, in comparison to previous methods, Sc2promap exhibits superior performance. CONCLUSIONS The amalgamation of the attention mechanism and prediction head components has led the model to excel in protein localization clustering and analysis tasks. GENERAL SIGNIFICANCE The model effectively enhances the capability to extract features and knowledge from protein fluorescence images.
Collapse
Affiliation(s)
- Kaitai Han
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing 102617, China
| | - Xi Liu
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing 102617, China
| | - Guocheng Sun
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing 102617, China
| | - Zijun Wang
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing 102617, China
| | - Chaojing Shi
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing 102617, China
| | - Wu Liu
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing 102617, China
| | - Mengyuan Huang
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing 102617, China
| | - Shitou Liu
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing 102617, China
| | - Qianjin Guo
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing 102617, China.
| |
Collapse
|
2
|
Wang RH, Luo T, Zhang HL, Du PF. PLA-GNN: Computational inference of protein subcellular location alterations under drug treatments with deep graph neural networks. Comput Biol Med 2023; 157:106775. [PMID: 36921458 DOI: 10.1016/j.compbiomed.2023.106775] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2023] [Revised: 02/21/2023] [Accepted: 03/09/2023] [Indexed: 03/12/2023]
Abstract
The aberrant protein sorting has been observed in many conditions, including complex diseases, drug treatments, and environmental stresses. It is important to systematically identify protein mis-localization events in a given condition. Experimental methods for finding mis-localized proteins are always costly and time consuming. Predicting protein subcellular localizations has been studied for many years. However, only a handful of existing works considered protein subcellular location alterations. We proposed a computational method for identifying alterations of protein subcellular locations under drug treatments. We took three drugs, including TSA (trichostain A), bortezomib and tacrolimus, as instances for this study. By introducing dynamic protein-protein interaction networks, graph neural network algorithms were applied to aggregate topological information under different conditions. We systematically reported potential protein mis-localization events under drug treatments. As far as we know, this is the first attempt to find protein mis-localization events computationally in drug treatment conditions. Literatures validated that a number of proteins, which are highly related to pharmacological mechanisms of these drugs, may undergo protein localization alterations. We name our method as PLA-GNN (Protein Localization Alteration by Graph Neural Networks). It can be extended to other drugs and other conditions. All datasets and codes of this study has been deposited in a GitHub repository (https://github.com/quinlanW/PLA-GNN).
Collapse
Affiliation(s)
- Ren-Hua Wang
- College of Intelligence and Computing, Tianjin University, Tianjin, 300350, China.
| | - Tao Luo
- College of Intelligence and Computing, Tianjin University, Tianjin, 300350, China.
| | - Han-Lin Zhang
- College of Intelligence and Computing, Tianjin University, Tianjin, 300350, China.
| | - Pu-Feng Du
- College of Intelligence and Computing, Tianjin University, Tianjin, 300350, China.
| |
Collapse
|
3
|
Kumar R, Dhanda SK. Bird Eye View of Protein Subcellular Localization Prediction. Life (Basel) 2020; 10:E347. [PMID: 33327400 PMCID: PMC7764902 DOI: 10.3390/life10120347] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2020] [Revised: 12/11/2020] [Accepted: 12/11/2020] [Indexed: 12/12/2022] Open
Abstract
Proteins are made up of long chain of amino acids that perform a variety of functions in different organisms. The activity of the proteins is determined by the nucleotide sequence of their genes and by its 3D structure. In addition, it is essential for proteins to be destined to their specific locations or compartments to perform their structure and functions. The challenge of computational prediction of subcellular localization of proteins is addressed in various in silico methods. In this review, we reviewed the progress in this field and offered a bird eye view consisting of a comprehensive listing of tools, types of input features explored, machine learning approaches employed, and evaluation matrices applied. We hope the review will be useful for the researchers working in the field of protein localization predictions.
Collapse
Affiliation(s)
- Ravindra Kumar
- Biometric Research Program, Division of Cancer Treatment and Diagnosis, National Cancer Institute, NIH, 9609 Medical Center Drive, Rockville, MD 20850, USA
| | - Sandeep Kumar Dhanda
- Department of Oncology, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA
| |
Collapse
|
4
|
Li GP, Du PF, Shen ZA, Liu HY, Luo T. DPPN-SVM: Computational Identification of Mis-Localized Proteins in Cancers by Integrating Differential Gene Expressions With Dynamic Protein-Protein Interaction Networks. Front Genet 2020; 11:600454. [PMID: 33193746 PMCID: PMC7644922 DOI: 10.3389/fgene.2020.600454] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2020] [Accepted: 10/07/2020] [Indexed: 12/29/2022] Open
Abstract
Eukaryotic cells contain numerous components, which are known as subcellular compartments or subcellular organelles. Proteins must be sorted to proper subcellular compartments to carry out their molecular functions. Mis-localized proteins are related to various cancers. Identifying mis-localized proteins is important in understanding the pathology of cancers and in developing therapies. However, experimental methods, which are used to determine protein subcellular locations, are always costly and time-consuming. We tried to identify cancer-related mis-localized proteins in three different cancers using computational approaches. By integrating gene expression profiles and dynamic protein-protein interaction networks, we established DPPN-SVM (Dynamic Protein-Protein Network with Support Vector Machine), a predictive model using the SVM classifier with diffusion kernels. With this predictive model, we identified a number of mis-localized proteins. Since we introduced the dynamic protein-protein network, which has never been considered in existing works, our model is capable of identifying more mis-localized proteins than existing studies. As far as we know, this is the first study to incorporate dynamic protein-protein interaction network in identifying mis-localized proteins in cancers.
Collapse
Affiliation(s)
- Guang-Ping Li
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Pu-Feng Du
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Zi-Ang Shen
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Hang-Yu Liu
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Tao Luo
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| |
Collapse
|
5
|
Bouziane H, Chouarfia A. Use of Chou's 5-steps rule to predict the subcellular localization of gram-negative and gram-positive bacterial proteins by multi-label learning based on gene ontology annotation and profile alignment. J Integr Bioinform 2020; 18:51-79. [PMID: 32598314 PMCID: PMC8035964 DOI: 10.1515/jib-2019-0091] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2019] [Accepted: 04/08/2020] [Indexed: 12/31/2022] Open
Abstract
To date, many proteins generated by large-scale genome sequencing projects are still uncharacterized and subject to intensive investigations by both experimental and computational means. Knowledge of protein subcellular localization (SCL) is of key importance for protein function elucidation. However, it remains a challenging task, especially for multiple sites proteins known to shuttle between cell compartments to perform their proper biological functions and proteins which do not have significant homology to proteins of known subcellular locations. Due to their low-cost and reasonable accuracy, machine learning-based methods have gained much attention in this context with the availability of a plethora of biological databases and annotated proteins for analysis and benchmarking. Various predictive models have been proposed to tackle the SCL problem, using different protein sequence features pertaining to the subcellular localization, however, the overwhelming majority of them focuses on single localization and cover very limited cellular locations. The prediction was basically established on sorting signals, amino acids compositions, and homology. To improve the prediction quality, focus is actually on knowledge information extracted from annotation databases, such as protein-protein interactions and Gene Ontology (GO) functional domains annotation which has been recently a widely adopted and essential information for learning systems. To deal with such problem, in the present study, we considered SCL prediction task as a multi-label learning problem and tried to label both single site and multiple sites unannotated bacterial protein sequences by mining proteins homology relationships using both GO terms of protein homologs and PSI-BLAST profiles. The experiments using 5-fold cross-validation tests on the benchmark datasets showed a significant improvement on the results obtained by the proposed consensus multi-label prediction model which discriminates six compartments for Gram-negative and five compartments for Gram-positive bacterial proteins.
Collapse
Affiliation(s)
- Hafida Bouziane
- Département d’Informatique, Université des Sciences et de la Technologie d’Oran Mohamed Boudiaf, USTO-MB BP 1505, El M’Naouer, 31000, Oran, Algeria
| | - Abdallah Chouarfia
- Département d’Informatique, Université des Sciences et de la Technologie d’Oran Mohamed Boudiaf, USTO-MB BP 1505, El M’Naouer, 31000, Oran, Algeria
| |
Collapse
|
6
|
Garapati HS, Male G, Mishra K. Predicting subcellular localization of proteins using protein-protein interaction data. Genomics 2020; 112:2361-2368. [PMID: 31945465 DOI: 10.1016/j.ygeno.2020.01.007] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2019] [Revised: 01/01/2020] [Accepted: 01/11/2020] [Indexed: 10/25/2022]
Abstract
The knowledge of subcellular localization of proteins can provide useful clues about their functions. The conventional methods to determine the subcellular localization are unable to keep pace with the rate at which the new data is being generated. Thus, though sequence information is available, the localization and function of a number of proteins remains unknown. In this study, we have developed a script that makes use of the physical interactors of a protein and their localization data to predict the subcellular localization. We used the script to predict the localization of yeast proteins for which there is no localization data. Further, we experimentally verified the predicted localization for six arbitrarily chosen proteins and found our predictions to be correct for five of the proteins.
Collapse
Affiliation(s)
- Hita Sony Garapati
- Department of Biochemistry, School of Life Sciences, University of Hyderabad, Hyderabad 500046, India
| | - Gurranna Male
- Department of Biochemistry, School of Life Sciences, University of Hyderabad, Hyderabad 500046, India
| | - Krishnaveni Mishra
- Department of Biochemistry, School of Life Sciences, University of Hyderabad, Hyderabad 500046, India.
| |
Collapse
|
7
|
Kim M, Tagkopoulos I. Data integration and predictive modeling methods for multi-omics datasets. Mol Omics 2018; 14:8-25. [DOI: 10.1039/c7mo00051k] [Citation(s) in RCA: 56] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
We provide an overview of opportunities and challenges in multi-omics predictive analytics with particular emphasis on data integration and machine learning methods.
Collapse
Affiliation(s)
- Minseung Kim
- Department of Computer Science
- University of California
- Davis
- USA
- Genome Center
| | - Ilias Tagkopoulos
- Department of Computer Science
- University of California
- Davis
- USA
- Genome Center
| |
Collapse
|
8
|
Zhu L, Deng SP, You ZH, Huang DS. Identifying Spurious Interactions in the Protein-Protein Interaction Networks Using Local Similarity Preserving Embedding. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:345-352. [PMID: 28368812 DOI: 10.1109/tcbb.2015.2407393] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
In recent years, a remarkable amount of protein-protein interaction (PPI) data are being available owing to the advance made in experimental high-throughput technologies. However, the experimentally detected PPI data usually contain a large amount of spurious links, which could contaminate the analysis of the biological significance of protein links and lead to incorrect biological discoveries, thereby posing new challenges to both computational and biological scientists. In this paper, we develop a new embedding algorithm called local similarity preserving embedding (LSPE) to rank the interaction possibility of protein links. By going beyond limitations of current geometric embedding methods for network denoising and emphasizing the local information of PPI networks, LSPE can avoid the unstableness of previous methods. We demonstrate experimental results on benchmark PPI networks and show that LSPE was the overall leader, outperforming the state-of-the-art methods in topological false links elimination problems.
Collapse
|
9
|
Hasan MAM, Ahmad S, Molla MKI. Protein subcellular localization prediction using multiple kernel learning based support vector machine. MOLECULAR BIOSYSTEMS 2017; 13:785-795. [DOI: 10.1039/c6mb00860g] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
An efficient multi-label protein subcellular localization prediction system was developed by introducing multiple kernel learning (MKL) based support vector machine (SVM).
Collapse
Affiliation(s)
- Md. Al Mehedi Hasan
- Department of Computer Science & Engineering
- University of Rajshahi
- Rajshahi
- Bangladesh
| | - Shamim Ahmad
- Department of Computer Science & Engineering
- University of Rajshahi
- Rajshahi
- Bangladesh
| | | |
Collapse
|
10
|
Sharma R, Dehzangi A, Lyons J, Paliwal K, Tsunoda T, Sharma A. Predict Gram-Positive and Gram-Negative Subcellular Localization via Incorporating Evolutionary Information and Physicochemical Features Into Chou's General PseAAC. IEEE Trans Nanobioscience 2015; 14:915-26. [DOI: 10.1109/tnb.2015.2500186] [Citation(s) in RCA: 71] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
11
|
Liu Z, Hu J. Mislocalization-related disease gene discovery using gene expression based computational protein localization prediction. Methods 2015; 93:119-27. [PMID: 26416496 DOI: 10.1016/j.ymeth.2015.09.022] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2015] [Revised: 09/17/2015] [Accepted: 09/21/2015] [Indexed: 01/09/2023] Open
Abstract
Protein sorting is an important mechanism for transporting proteins to their target subcellular locations after their synthesis. Mutations on genes may disrupt the well regulated protein sorting process, leading to a variety of mislocation related diseases. This paper proposes a methodology to discover such disease genes based on gene expression data and computational protein localization prediction. A kernel logistic regression based algorithm is used to successfully identify several candidate cancer genes which may cause cancers due to their mislocation within the cell. Our results also showed that compared to the gene co-expression network defined on Pearson correlation coefficients, the nonlinear Maximum Correlation Coefficients (MIC) based co-expression network give better results for subcellular localization prediction.
Collapse
Affiliation(s)
- Zhonghao Liu
- Department of Computer Science & Engineering, University of South Carolina, 301 Main Street, Columbia, SC 29208, United States
| | - Jianjun Hu
- Department of Computer Science & Engineering, University of South Carolina, 301 Main Street, Columbia, SC 29208, United States.
| |
Collapse
|
12
|
Zheng W, Blake C. Using distant supervised learning to identify protein subcellular localizations from full-text scientific articles. J Biomed Inform 2015. [PMID: 26220461 DOI: 10.1016/j.jbi.2015.07.013] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
Databases of curated biomedical knowledge, such as the protein-locations reflected in the UniProtKB database, provide an accurate and useful resource to researchers and decision makers. Our goal is to augment the manual efforts currently used to curate knowledge bases with automated approaches that leverage the increased availability of full-text scientific articles. This paper describes experiments that use distant supervised learning to identify protein subcellular localizations, which are important to understand protein function and to identify candidate drug targets. Experiments consider Swiss-Prot, the manually annotated subset of the UniProtKB protein knowledge base, and 43,000 full-text articles from the Journal of Biological Chemistry that contain just under 11.5 million sentences. The system achieves 0.81 precision and 0.49 recall at sentence level and an accuracy of 57% on held-out instances in a test set. Moreover, the approach identifies 8210 instances that are not in the UniProtKB knowledge base. Manual inspection of the 50 most likely relations showed that 41 (82%) were valid. These results have immediate benefit to researchers interested in protein function, and suggest that distant supervision should be explored to complement other manual data curation efforts.
Collapse
Affiliation(s)
- Wu Zheng
- Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign, USA.
| | - Catherine Blake
- Graduate School of Library and Information Science and Medical Information Science, Center for Informatics Research in Science and Scholarship (CIRSS), University of Illinois at Urbana-Champaign, USA.
| |
Collapse
|
13
|
Abstract
Protein location and function can change dynamically depending on many factors, including environmental stress, disease state, age, developmental stage, and cell type. Here, we describe an integrative computational framework, called the conditional function predictor (CoFP; http://nbm.ajou.ac.kr/cofp/), for predicting changes in subcellular location and function on a proteome-wide scale. The essence of the CoFP approach is to cross-reference general knowledge about a protein and its known network of physical interactions, which typically pool measurements from diverse environments, against gene expression profiles that have been measured under specific conditions of interest. Using CoFP, we predict condition-specific subcellular locations, biological processes, and molecular functions of the yeast proteome under 18 specified conditions. In addition to highly accurate retrieval of previously known gold standard protein locations and functions, CoFP predicts previously unidentified condition-dependent locations and functions for nearly all yeast proteins. Many of these predictions can be confirmed using high-resolution cellular imaging. We show that, under DNA-damaging conditions, Tsr1, Caf120, Dip5, Skg6, Lte1, and Nnf2 change subcellular location and RNA polymerase I subunit A43, Ino2, and Ids2 show changes in DNA binding. Beyond specific predictions, this work reveals a global landscape of changing protein location and function, highlighting a surprising number of proteins that translocate from the mitochondria to the nucleus or from endoplasmic reticulum to Golgi apparatus under stress.
Collapse
|
14
|
Yu CS, Cheng CW, Su WC, Chang KC, Huang SW, Hwang JK, Lu CH. CELLO2GO: a web server for protein subCELlular LOcalization prediction with functional gene ontology annotation. PLoS One 2014; 9:e99368. [PMID: 24911789 PMCID: PMC4049835 DOI: 10.1371/journal.pone.0099368] [Citation(s) in RCA: 270] [Impact Index Per Article: 27.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2013] [Accepted: 05/14/2014] [Indexed: 01/15/2023] Open
Abstract
CELLO2GO (http://cello.life.nctu.edu.tw/cello2go/) is a publicly available, web-based system for screening various properties of a targeted protein and its subcellular localization. Herein, we describe how this platform is used to obtain a brief or detailed gene ontology (GO)-type categories, including subcellular localization(s), for the queried proteins by combining the CELLO localization-predicting and BLAST homology-searching approaches. Given a query protein sequence, CELLO2GO uses BLAST to search for homologous sequences that are GO annotated in an in-house database derived from the UniProt KnowledgeBase database. At the same time, CELLO attempts predict at least one subcellular localization on the basis of the species in which the protein is found. When homologs for the query sequence have been identified, the number of terms found for each of their GO categories, i.e., cellular compartment, molecular function, and biological process, are summed and presented as pie charts representing possible functional annotations for the queried protein. Although the experimental subcellular localization of a protein may not be known, and thus not annotated, CELLO can confidentially suggest a subcellular localization. CELLO2GO should be a useful tool for research involving complex subcellular systems because it combines CELLO and BLAST into one platform and its output is easily manipulated such that the user-specific questions may be readily addressed.
Collapse
Affiliation(s)
- Chin-Sheng Yu
- Department of Information Engineering and Computer Science, Feng Chia University, Taichung, Taiwan
- Master's Program in Biomedical Informatics and Biomedical Engineering, Feng Chia University, Taichung, Taiwan
| | - Chih-Wen Cheng
- Department of Information Engineering and Computer Science, Feng Chia University, Taichung, Taiwan
| | - Wen-Chi Su
- Department of Information Engineering and Computer Science, Feng Chia University, Taichung, Taiwan
| | - Kuei-Chung Chang
- Department of Information Engineering and Computer Science, Feng Chia University, Taichung, Taiwan
| | - Shao-Wei Huang
- Department of Medical Informatics, Tzu Chi University, Hualien, Taiwan
| | - Jenn-Kang Hwang
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan
- Center of Bioinformatics Research, National Chiao Tung University, Hsinchu, Taiwan
| | - Chih-Hao Lu
- Graduate Institute of Basic Medical Science, China Medical University, Taichung, Taiwan
- * E-mail:
| |
Collapse
|
15
|
Du P, Gu S, Jiao Y. PseAAC-General: fast building various modes of general form of Chou's pseudo-amino acid composition for large-scale protein datasets. Int J Mol Sci 2014; 15:3495-506. [PMID: 24577312 PMCID: PMC3975349 DOI: 10.3390/ijms15033495] [Citation(s) in RCA: 242] [Impact Index Per Article: 24.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2014] [Revised: 02/13/2014] [Accepted: 02/14/2014] [Indexed: 11/16/2022] Open
Abstract
The general form pseudo-amino acid composition (PseAAC) has been widely used to represent protein sequences in predicting protein structural and functional attributes. We developed the program PseAAC-General to generate various different modes of Chou’s general PseAAC, such as the gene ontology mode, the functional domain mode, and the sequential evolution mode. This program allows the users to define their own desired modes. In every mode, 544 physicochemical properties of the amino acids are available for choosing. The computing efficiency is at least 100 times that of existing programs, which makes it able to facilitate the extensive studies on proteins and peptides. The PseAAC-General is freely available via SourceForge. It runs on both Linux and Windows.
Collapse
Affiliation(s)
- Pufeng Du
- School of Computer Science and Technology, Tianjin University, Tianjin 300072, China.
| | - Shuwang Gu
- School of Computer Science and Technology, Tianjin University, Tianjin 300072, China.
| | - Yasen Jiao
- School of Computer Science and Technology, Tianjin University, Tianjin 300072, China.
| |
Collapse
|
16
|
Stekhoven DJ, Omasits U, Quebatte M, Dehio C, Ahrens CH. Proteome-wide identification of predominant subcellular protein localizations in a bacterial model organism. J Proteomics 2014; 99:123-37. [PMID: 24486812 DOI: 10.1016/j.jprot.2014.01.015] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2013] [Revised: 01/12/2014] [Accepted: 01/15/2014] [Indexed: 01/04/2023]
Abstract
UNLABELLED Proteomics data provide unique insights into biological systems, including the predominant subcellular localization (SCL) of proteins, which can reveal important clues about their functions. Here we analyzed data of a complete prokaryotic proteome expressed under two conditions mimicking interaction of the emerging pathogen Bartonella henselae with its mammalian host. Normalized spectral count data from cytoplasmic, total membrane, inner and outer membrane fractions allowed us to identify the predominant SCL for 82% of the identified proteins. The spectral count proportion of total membrane versus cytoplasmic fractions indicated the propensity of cytoplasmic proteins to co-fractionate with the inner membrane, and enabled us to distinguish cytoplasmic, peripheral inner membrane and bona fide inner membrane proteins. Principal component analysis and k-nearest neighbor classification training on selected marker proteins or predominantly localized proteins, allowed us to determine an extensive catalog of at least 74 expressed outer membrane proteins, and to extend the SCL assignment to 94% of the identified proteins, including 18% where in silico methods gave no prediction. Suitable experimental proteomics data combined with straightforward computational approaches can thus identify the predominant SCL on a proteome-wide scale. Finally, we present a conceptual approach to identify proteins potentially changing their SCL in a condition-dependent fashion. BIOLOGICAL SIGNIFICANCE The work presented here describes the first prokaryotic proteome-wide subcellular localization (SCL) dataset for the emerging pathogen B. henselae (Bhen). The study indicates that suitable subcellular fractionation experiments combined with straight-forward computational analysis approaches assessing the proportion of spectral counts observed in different subcellular fractions are powerful for determining the predominant SCL of a large percentage of the experimentally observed proteins. This includes numerous cases where in silico prediction methods do not provide any prediction. Avoiding a treatment with harsh conditions, cytoplasmic proteins tend to co-fractionate with proteins of the inner membrane fraction, indicative of close functional interactions. The spectral count proportion (SCP) of total membrane versus cytoplasmic fractions allowed us to obtain a good indication about the relative proximity of individual protein complex members to the inner membrane. Using principal component analysis and k-nearest neighbor approaches, we were able to extend the percentage of proteins with a predominant experimental localization to over 90% of all expressed proteins and identified a set of at least 74 outer membrane (OM) proteins. In general, OM proteins represent a rich source of candidates for the development of urgently needed new therapeutics in combat of resurgence of infectious disease and multi-drug resistant bacteria. Finally, by comparing the data from two infection biology relevant conditions, we conceptually explore methods to identify and visualize potential candidates that may partially change their SCL in these different conditions. The data are made available to researchers as a SCL compendium for Bhen and as an assistance in further improving in silico SCL prediction algorithms.
Collapse
Affiliation(s)
- Daniel J Stekhoven
- Quantitative Model Organism Proteomics (Q-MOP), Institute of Molecular Life Sciences, University of Zurich, Winterthurerstrasse 190, 8057 Zurich, Switzerland.
| | - Ulrich Omasits
- Quantitative Model Organism Proteomics (Q-MOP), Institute of Molecular Life Sciences, University of Zurich, Winterthurerstrasse 190, 8057 Zurich, Switzerland; Institute of Molecular Systems Biology, ETH Zurich, Auguste-Piccard-Hof 1, 8093 Zurich, Switzerland
| | - Maxime Quebatte
- Biozentrum, University of Basel, Klingelbergstrasse 50/70, 4056 Basel, Switzerland
| | - Christoph Dehio
- Biozentrum, University of Basel, Klingelbergstrasse 50/70, 4056 Basel, Switzerland
| | - Christian H Ahrens
- Quantitative Model Organism Proteomics (Q-MOP), Institute of Molecular Life Sciences, University of Zurich, Winterthurerstrasse 190, 8057 Zurich, Switzerland.
| |
Collapse
|
17
|
Predicting human protein subcellular locations by the ensemble of multiple predictors via protein-protein interaction network with edge clustering coefficients. PLoS One 2014; 9:e86879. [PMID: 24466278 PMCID: PMC3900678 DOI: 10.1371/journal.pone.0086879] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2013] [Accepted: 12/18/2013] [Indexed: 12/14/2022] Open
Abstract
One of the fundamental tasks in biology is to identify the functions of all proteins to reveal the primary machinery of a cell. Knowledge of the subcellular locations of proteins will provide key hints to reveal their functions and to understand the intricate pathways that regulate biological processes at the cellular level. Protein subcellular location prediction has been extensively studied in the past two decades. A lot of methods have been developed based on protein primary sequences as well as protein-protein interaction network. In this paper, we propose to use the protein-protein interaction network as an infrastructure to integrate existing sequence based predictors. When predicting the subcellular locations of a given protein, not only the protein itself, but also all its interacting partners were considered. Unlike existing methods, our method requires neither the comprehensive knowledge of the protein-protein interaction network nor the experimentally annotated subcellular locations of most proteins in the protein-protein interaction network. Besides, our method can be used as a framework to integrate multiple predictors. Our method achieved 56% on human proteome in absolute-true rate, which is higher than the state-of-the-art methods.
Collapse
|
18
|
Du P, Xu C. Predicting multisite protein subcellular locations: progress and challenges. Expert Rev Proteomics 2014; 10:227-37. [DOI: 10.1586/epr.13.16] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
19
|
Lei C, Tamim S, Bishop AJ, Ruan J. Fully automated protein complex prediction based on topological similarity and community structure. Proteome Sci 2013; 11:S9. [PMID: 24564887 PMCID: PMC3908383 DOI: 10.1186/1477-5956-11-s1-s9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
To understand the function of protein complexes and their association with biological processes, a lot of studies have been done towards analyzing the protein-protein interaction (PPI) networks. However, the advancement in high-throughput technology has resulted in a humongous amount of data for analysis. Moreover, high level of noise, sparseness, and skewness in degree distribution of PPI networks limits the performance of many clustering algorithms and further analysis of their interactions. In addressing and solving these problems we present a novel random walk based algorithm that converts the incomplete and binary PPI network into a protein-protein topological similarity matrix (PP-TS matrix). We believe that if two proteins share some high-order topological similarities they are likely to be interacting with each other. Using the obtained PP-TS matrix, we constructed and used weighted networks to further study and analyze the interaction among proteins. Specifically, we applied a fully automated community structure finding algorithm (Auto-HQcut) on the obtained weighted network to cluster protein complexes. We then analyzed the protein complexes for significance in biological processes. To help visualize and analyze these protein complexes we also developed an interface that displays the resulting complexes as well as the characteristics associated with each complex. Applying our approach to a yeast protein-protein interaction network, we found that the predicted protein-protein interaction pairs with high topological similarities have more significant biological relevance than the original protein-protein interactions pairs. When we compared our PPI network reconstruction algorithm with other existing algorithms using gene ontology and gene co-expression, our algorithm produced the highest similarity scores. Also, our predicted protein complexes showed higher accuracy measure compared to the other protein complex predictions.
Collapse
|
20
|
Mei S. SVM ensemble based transfer learning for large-scale membrane proteins discrimination. J Theor Biol 2013; 340:105-10. [PMID: 24050851 DOI: 10.1016/j.jtbi.2013.09.007] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2013] [Revised: 09/04/2013] [Accepted: 09/06/2013] [Indexed: 11/16/2022]
Abstract
Membrane proteins play important roles in molecular trans-membrane transport, ligand-receptor recognition, cell-cell interaction, enzyme catalysis, host immune defense response and infectious disease pathways. Up to present, discriminating membrane proteins remains a challenging problem from the viewpoints of biological experimental determination and computational modeling. This work presents SVM ensemble based transfer learning model for membrane proteins discrimination (SVM-TLM). To reduce the data constraints on computational modeling, this method investigates the effectiveness of transferring the homolog knowledge to the target membrane proteins under the framework of probability weighted ensemble learning. As compared to multiple kernel learning based transfer learning model, the method takes the advantages of sparseness based SVM optimization on large data, thus more computationally efficient for large protein data analysis. The experiments on large membrane protein benchmark dataset show that SVM-TLM achieves significantly better cross validation performance than the baseline model.
Collapse
Affiliation(s)
- Suyu Mei
- Software College, Shenyang Normal University, Shenyang, China.
| |
Collapse
|
21
|
Lee K, Byun K, Hong W, Chuang HY, Pack CG, Bayarsaikhan E, Paek SH, Kim H, Shin HY, Ideker T, Lee B. Proteome-wide discovery of mislocated proteins in cancer. Genome Res 2013; 23:1283-94. [PMID: 23674306 PMCID: PMC3730102 DOI: 10.1101/gr.155499.113] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Several studies have sought systematically to identify protein subcellular locations, but an even larger task is to map which of these proteins conditionally relocates in disease (the mislocalizome). Here, we report an integrative computational framework for mapping conditional location and mislocation of proteins on a proteome-wide scale, called a conditional location predictor (CoLP). Using CoLP, we mapped the locations of over 10,000 proteins in normal human brain and in glioma. The prediction showed 0.9 accuracy using 100 location tests of 20 randomly selected proteins. Of the 10,000 proteins, over 150 have a strong likelihood of mislocation under glioma, which is striking considering that few mislocation events have been identified in this disease previously. Using immunofluorescence and Western blotting in both primary cells and tissues, we successfully experimentally confirmed 15 mislocations. The most common type of mislocation occurs between the endoplasmic reticulum and the nucleus; for example, for RNF138, TLX3, and NFRKB. In particular, we found that the gene for the mislocating protein GFRA4 had a nonsynonymous point mutation in exon 2. Moreover, redirection of GFRA4 to its normal location, the plasma membrane, led to marked reductions in phospho-STAT3 and proliferation of glioma cells. This framework has the potential to track changes in protein location in many human diseases.
Collapse
Affiliation(s)
- KiYoung Lee
- Department of Biomedical Informatics, Ajou University School of Medicine, Suwon 443-749, Korea.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
22
|
Integrative analysis of congenital muscular torticollis: from gene expression to clinical significance. BMC Med Genomics 2013; 6 Suppl 2:S10. [PMID: 23819832 PMCID: PMC3654872 DOI: 10.1186/1755-8794-6-s2-s10] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Background Congenital muscular torticollis (CMT) is characterized by thickening and/or tightness of the unilateral sternocleidomastoid muscle (SCM), ending up with torticollis. Our aim was to identify differentially expressed genes (DEGs) and novel protein interaction network modules of CMT, and to discover the relationship between gene expressions and clinical severity of CMT. Results Twenty-eight sternocleidomastoid muscles (SCMs) from 23 subjects with CMT and 5 SCMs without CMT were allocated for microarray, MRI, or imunohistochemical studies. We first identified 269 genes as the DEGs in CMT. Gene ontology enrichment analysis revealed that the main function of the DEGs is for extracellular region part during developmental processes. Five CMT-related protein network modules were identified, which showed that the important pathway is fibrosis related with collagen and elastin fibrillogenesis with an evidence of DNA repair mechanism. Interestingly, the expression levels of the 8 DEGs called CMT signature genes whose mRNA expression was double-confirmed by quantitative real time PCR showed good correlation with the severity of CMT which was measured with the pre-operational MRI images (R2 ranging from 0.82 to 0.21). Moreover, the protein expressions of ELN, ASPN and CHD3 which were identified from the CMT-related protein network modules demonstrated the differential expression between the CMT and normal SCM. Conclusions We here provided an integrative analysis of CMT from gene expression to clinical significance, which showed good correlation with clinical severity of CMT. Furthermore, the CMT-related protein network modules were identified, which provided more in-depth understanding of pathophysiology of CMT.
Collapse
|
23
|
Lei C, Ruan J. A novel link prediction algorithm for reconstructing protein-protein interaction networks by topological similarity. ACTA ACUST UNITED AC 2012; 29:355-64. [PMID: 23235927 DOI: 10.1093/bioinformatics/bts688] [Citation(s) in RCA: 75] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
MOTIVATION Recent advances in technology have dramatically increased the availability of protein-protein interaction (PPI) data and stimulated the development of many methods for improving the systems level understanding the cell. However, those efforts have been significantly hindered by the high level of noise, sparseness and highly skewed degree distribution of PPI networks. Here, we present a novel algorithm to reduce the noise present in PPI networks. The key idea of our algorithm is that two proteins sharing some higher-order topological similarities, measured by a novel random walk-based procedure, are likely interacting with each other and may belong to the same protein complex. RESULTS Applying our algorithm to a yeast PPI network, we found that the edges in the reconstructed network have higher biological relevance than in the original network, assessed by multiple types of information, including gene ontology, gene expression, essentiality, conservation between species and known protein complexes. Comparison with existing methods shows that the network reconstructed by our method has the highest quality. Using two independent graph clustering algorithms, we found that the reconstructed network has resulted in significantly improved prediction accuracy of protein complexes. Furthermore, our method is applicable to PPI networks obtained with different experimental systems, such as affinity purification, yeast two-hybrid (Y2H) and protein-fragment complementation assay (PCA), and evidence shows that the predicted edges are likely bona fide physical interactions. Finally, an application to a human PPI network increased the coverage of the network by at least 100%. AVAILABILITY www.cs.utsa.edu/∼jruan/RWS/.
Collapse
Affiliation(s)
- Chengwei Lei
- Department of Computer Science, The University of Texas at San Antonio, San Antonio, TX 78249, USA
| | | |
Collapse
|
24
|
Subcellular localization prediction for human internal and organelle membrane proteins with projected gene ontology scores. J Theor Biol 2012; 313:61-7. [DOI: 10.1016/j.jtbi.2012.08.016] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2012] [Revised: 07/05/2012] [Accepted: 08/15/2012] [Indexed: 11/15/2022]
|
25
|
Predicting plant protein subcellular multi-localization by Chou's PseAAC formulation based multi-label homolog knowledge transfer learning. J Theor Biol 2012; 310:80-7. [DOI: 10.1016/j.jtbi.2012.06.028] [Citation(s) in RCA: 98] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2012] [Revised: 05/12/2012] [Accepted: 06/18/2012] [Indexed: 11/21/2022]
|
26
|
Embryonic Stem Cell Interactomics: The Beginning of a Long Road to Biological Function. Stem Cell Rev Rep 2012; 8:1138-54. [DOI: 10.1007/s12015-012-9400-9] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
|
27
|
Lin JR, Mondal AM, Liu R, Hu J. Minimalist ensemble algorithms for genome-wide protein localization prediction. BMC Bioinformatics 2012; 13:157. [PMID: 22759391 PMCID: PMC3426488 DOI: 10.1186/1471-2105-13-157] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2011] [Accepted: 07/03/2012] [Indexed: 01/09/2023] Open
Abstract
BACKGROUND Computational prediction of protein subcellular localization can greatly help to elucidate its functions. Despite the existence of dozens of protein localization prediction algorithms, the prediction accuracy and coverage are still low. Several ensemble algorithms have been proposed to improve the prediction performance, which usually include as many as 10 or more individual localization algorithms. However, their performance is still limited by the running complexity and redundancy among individual prediction algorithms. RESULTS This paper proposed a novel method for rational design of minimalist ensemble algorithms for practical genome-wide protein subcellular localization prediction. The algorithm is based on combining a feature selection based filter and a logistic regression classifier. Using a novel concept of contribution scores, we analyzed issues of algorithm redundancy, consensus mistakes, and algorithm complementarity in designing ensemble algorithms. We applied the proposed minimalist logistic regression (LR) ensemble algorithm to two genome-wide datasets of Yeast and Human and compared its performance with current ensemble algorithms. Experimental results showed that the minimalist ensemble algorithm can achieve high prediction accuracy with only 1/3 to 1/2 of individual predictors of current ensemble algorithms, which greatly reduces computational complexity and running time. It was found that the high performance ensemble algorithms are usually composed of the predictors that together cover most of available features. Compared to the best individual predictor, our ensemble algorithm improved the prediction accuracy from AUC score of 0.558 to 0.707 for the Yeast dataset and from 0.628 to 0.646 for the Human dataset. Compared with popular weighted voting based ensemble algorithms, our classifier-based ensemble algorithms achieved much better performance without suffering from inclusion of too many individual predictors. CONCLUSIONS We proposed a method for rational design of minimalist ensemble algorithms using feature selection and classifiers. The proposed minimalist ensemble algorithm based on logistic regression can achieve equal or better prediction performance while using only half or one-third of individual predictors compared to other ensemble algorithms. The results also suggested that meta-predictors that take advantage of a variety of features by combining individual predictors tend to achieve the best performance. The LR ensemble server and related benchmark datasets are available at http://mleg.cse.sc.edu/LRensemble/cgi-bin/predict.cgi.
Collapse
Affiliation(s)
- Jhih-Rong Lin
- Department of Computer Science and Engineering, University of South Carolina, Columbia, SC 29208, USA
| | | | | | | |
Collapse
|
28
|
Jiang JQ, Wu M. Predicting multiplex subcellular localization of proteins using protein-protein interaction network: a comparative study. BMC Bioinformatics 2012; 13 Suppl 10:S20. [PMID: 22759426 PMCID: PMC3314587 DOI: 10.1186/1471-2105-13-s10-s20] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Proteins that interact in vivo tend to reside within the same or "adjacent" subcellular compartments. This observation provides opportunities to reveal protein subcellular localization in the context of the protein-protein interaction (PPI) network. However, so far, only a few efforts based on heuristic rules have been made in this regard. RESULTS We systematically and quantitatively validate the hypothesis that proteins physically interacting with each other probably share at least one common subcellular localization. With the result, for the first time, four graph-based semi-supervised learning algorithms, Majority, χ2-score, GenMultiCut and FunFlow originally proposed for protein function prediction, are introduced to assign "multiplex localization" to proteins. We analyze these approaches by performing a large-scale cross validation on a Saccharomyces cerevisiae proteome compiled from BioGRID and comparing their predictions for 22 protein subcellular localizations. Furthermore, we build an ensemble classifier to associate 529 unlabeled and 137 ambiguously-annotated proteins with subcellular localizations, most of which have been verified in the previous experimental studies. CONCLUSIONS Physical interaction of proteins has actually provided an essential clue for their co-localization. Compared to the local approaches, the global algorithms consistently achieve a superior performance.
Collapse
Affiliation(s)
- Jonathan Q Jiang
- School of Life Science and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, PR China
| | | |
Collapse
|
29
|
Proteome-wide prediction of protein-protein interactions from high-throughput data. Protein Cell 2012; 3:508-20. [PMID: 22729399 DOI: 10.1007/s13238-012-2945-1] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2012] [Accepted: 05/30/2012] [Indexed: 12/15/2022] Open
Abstract
In this paper, we present a brief review of the existing computational methods for predicting proteome-wide protein-protein interaction networks from high-throughput data. The availability of various types of omics data provides great opportunity and also unprecedented challenge to infer the interactome in cells. Reconstructing the interactome or interaction network is a crucial step for studying the functional relationship among proteins and the involved biological processes. The protein interaction network will provide valuable resources and alternatives to decipher the mechanisms of these functionally interacting elements as well as the running system of cellular operations. In this paper, we describe the main steps of predicting protein-protein interaction networks and categorize the available approaches to couple the physical and functional linkages. The future topics and the analyses beyond prediction are also discussed and concluded.
Collapse
|
30
|
Mei S. Multi-label multi-kernel transfer learning for human protein subcellular localization. PLoS One 2012; 7:e37716. [PMID: 22719847 PMCID: PMC3374840 DOI: 10.1371/journal.pone.0037716] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2011] [Accepted: 04/28/2012] [Indexed: 11/19/2022] Open
Abstract
Recent years have witnessed much progress in computational modelling for protein subcellular localization. However, the existing sequence-based predictive models demonstrate moderate or unsatisfactory performance, and the gene ontology (GO) based models may take the risk of performance overestimation for novel proteins. Furthermore, many human proteins have multiple subcellular locations, which renders the computational modelling more complicated. Up to the present, there are far few researches specialized for predicting the subcellular localization of human proteins that may reside in multiple cellular compartments. In this paper, we propose a multi-label multi-kernel transfer learning model for human protein subcellular localization (MLMK-TLM). MLMK-TLM proposes a multi-label confusion matrix, formally formulates three multi-labelling performance measures and adapts one-against-all multi-class probabilistic outputs to multi-label learning scenario, based on which to further extends our published work GO-TLM (gene ontology based transfer learning model for protein subcellular localization) and MK-TLM (multi-kernel transfer learning based on Chou's PseAAC formulation for protein submitochondria localization) for multiplex human protein subcellular localization. With the advantages of proper homolog knowledge transfer, comprehensive survey of model performance for novel protein and multi-labelling capability, MLMK-TLM will gain more practical applicability. The experiments on human protein benchmark dataset show that MLMK-TLM significantly outperforms the baseline model and demonstrates good multi-labelling ability for novel human proteins. Some findings (predictions) are validated by the latest Swiss-Prot database. The software can be freely downloaded at http://soft.synu.edu.cn/upload/msy.rar.
Collapse
Affiliation(s)
- Suyu Mei
- Software College, Shenyang Normal University, Shenyang, China.
| |
Collapse
|
31
|
Ben-Hamo R, Efroni S. Biomarker robustness reveals the PDGF network as driving disease outcome in ovarian cancer patients in multiple studies. BMC SYSTEMS BIOLOGY 2012; 6:3. [PMID: 22236809 PMCID: PMC3298526 DOI: 10.1186/1752-0509-6-3] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/13/2011] [Accepted: 01/11/2012] [Indexed: 12/27/2022]
Abstract
Background Ovarian cancer causes more deaths than any other gynecological cancer. Identifying the molecular mechanisms that drive disease progress in ovarian cancer is a critical step in providing therapeutics, improving diagnostics, and affiliating clinical behavior with disease etiology. Identification of molecular interactions that stratify prognosis is key in facilitating a clinical-molecular perspective. Results The Cancer Genome Atlas has recently made available the molecular characteristics of more than 500 patients. We used the TCGA multi-analysis study, and two additional datasets and a set of computational algorithms that we developed. The computational algorithms are based on methods that identify network alterations and quantify network behavior through gene expression. We identify a network biomarker that significantly stratifies survival rates in ovarian cancer patients. Interestingly, expression levels of single or sets of genes do not explain the prognostic stratification. The discovered biomarker is composed of the network around the PDGF pathway. The biomarker enables prognosis stratification. Conclusion The work presented here demonstrates, through the power of gene-expression networks, the criticality of the PDGF network in driving disease course. In uncovering the specific interactions within the network, that drive the phenotype, we catalyze targeted treatment, facilitate prognosis and offer a novel perspective into hidden disease heterogeneity.
Collapse
|
32
|
Mei S. Multi-kernel transfer learning based on Chou's PseAAC formulation for protein submitochondria localization. J Theor Biol 2012; 293:121-30. [DOI: 10.1016/j.jtbi.2011.10.015] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2011] [Revised: 10/09/2011] [Accepted: 10/13/2011] [Indexed: 10/16/2022]
|
33
|
Yoon D, Kim H, Suh-Kim H, Park RW, Lee K. Differentially co-expressed interacting protein pairs discriminate samples under distinct stages of HIV type 1 infection. BMC SYSTEMS BIOLOGY 2011; 5 Suppl 2:S1. [PMID: 22784566 PMCID: PMC3287475 DOI: 10.1186/1752-0509-5-s2-s1] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
Background Microarray analyses based on differentially expressed genes (DEGs) have been widely used to distinguish samples across different cellular conditions. However, studies based on DEGs have not been able to clearly determine significant differences between samples of pathophysiologically similar HIV-1 stages, e.g., between acute and chronic progressive (or AIDS) or between uninfected and clinically latent stages. We here suggest a novel approach to allow such discrimination based on stage-specific genetic features of HIV-1 infection. Our approach is based on co-expression changes of genes known to interact. The method can identify a genetic signature for a single sample as contrasted with existing protein-protein-based analyses with correlational designs. Methods Our approach distinguishes each sample using differentially co-expressed interacting protein pairs (DEPs) based on co-expression scores of individual interacting pairs within a sample. The co-expression score has positive value if two genes in a sample are simultaneously up-regulated or down-regulated. And the score has higher absolute value if expression-changing ratios are similar between the two genes. We compared characteristics of DEPs with that of DEGs by evaluating their usefulness in separation of HIV-1 stage. And we identified DEP-based network-modules and their gene-ontology enrichment to find out the HIV-1 stage-specific gene signature. Results Based on the DEP approach, we observed clear separation among samples from distinct HIV-1 stages using clustering and principal component analyses. Moreover, the discrimination power of DEPs on the samples (70–100% accuracy) was much higher than that of DEGs (35–45%) using several well-known classifiers. DEP-based network analysis also revealed the HIV-1 stage-specific network modules; the main biological processes were related to “translation,” “RNA splicing,” “mRNA, RNA, and nucleic acid transport,” and “DNA metabolism.” Through the HIV-1 stage-related modules, changing stage-specific patterns of protein interactions could be observed. Conclusions DEP-based method discriminated the HIV-1 infection stages clearly, and revealed a HIV-1 stage-specific gene signature. The proposed DEP-based method might complement existing DEG-based approaches in various microarray expression analyses.
Collapse
Affiliation(s)
- Dukyong Yoon
- Department of Biomedical Informatics, Ajou University School of Medicine, Suwon 443-749, Korea
| | | | | | | | | |
Collapse
|
34
|
Fan GL, Li QZ. Predicting protein submitochondria locations by combining different descriptors into the general form of Chou's pseudo amino acid composition. Amino Acids 2011; 43:545-55. [PMID: 22102053 DOI: 10.1007/s00726-011-1143-4] [Citation(s) in RCA: 61] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2011] [Accepted: 10/27/2011] [Indexed: 10/15/2022]
Abstract
Knowledge of the submitochondria location of protein is integral to understanding its function and a necessity in the proteomics era. In this work, a new submitochondria data set is constructed, and an approach for predicting protein submitochondria locations is proposed by combining the amino acid composition, dipeptide composition, reduced physicochemical properties, gene ontology, evolutionary information, and pseudo-average chemical shift. The overall prediction accuracy is 93.57% for the submitochondria location and 97.79% for the three membrane protein types in the mitochondria inner membrane using the algorithm of the increment of diversity combined with the support vector machine. The performance of the pseudo-average chemical shift is excellent. For contrast, the method is also used to predict submitochondria locations in the data set constructed by Du and Li; an accuracy of 94.95% is obtained by our method, which is better than that of other existing methods.
Collapse
Affiliation(s)
- Guo-Liang Fan
- Department of Physics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| | | |
Collapse
|
35
|
Lin TH, Bar-Joseph Z, Murphy RF. Learning cellular sorting pathways using protein interactions and sequence motifs. J Comput Biol 2011; 18:1709-22. [PMID: 21999284 DOI: 10.1089/cmb.2011.0193] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Proper subcellular localization is critical for proteins to perform their roles in cellular functions. Proteins are transported by different cellular sorting pathways, some of which take a protein through several intermediate locations until reaching its final destination. The pathway a protein is transported through is determined by carrier proteins that bind to specific sequence motifs. In this article, we present a new method that integrates protein interaction and sequence motif data to model how proteins are sorted through these sorting pathways. We use a hidden Markov model (HMM) to represent protein sorting pathways. The model is able to determine intermediate sorting states and to assign carrier proteins and motifs to the sorting pathways. In simulation studies, we show that the method can accurately recover an underlying sorting model. Using data for yeast, we show that our model leads to accurate prediction of subcellular localization. We also show that the pathways learned by our model recover many known sorting pathways and correctly assign proteins to the path they utilize. The learned model identified new pathways and their putative carriers and motifs and these may represent novel protein sorting mechanisms. Supplementary results and software implementation are available from http://murphylab.web.cmu.edu/software/2010_RECOMB_pathways/.
Collapse
Affiliation(s)
- Tien-Ho Lin
- Language Technology Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | | | | |
Collapse
|
36
|
Mei S, Fei W, Zhou S. Gene ontology based transfer learning for protein subcellular localization. BMC Bioinformatics 2011; 12:44. [PMID: 21284890 PMCID: PMC3039576 DOI: 10.1186/1471-2105-12-44] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2010] [Accepted: 02/02/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Prediction of protein subcellular localization generally involves many complex factors, and using only one or two aspects of data information may not tell the true story. For this reason, some recent predictive models are deliberately designed to integrate multiple heterogeneous data sources for exploiting multi-aspect protein feature information. Gene ontology, hereinafter referred to as GO, uses a controlled vocabulary to depict biological molecules or gene products in terms of biological process, molecular function and cellular component. With the rapid expansion of annotated protein sequences, gene ontology has become a general protein feature that can be used to construct predictive models in computational biology. Existing models generally either concatenated the GO terms into a flat binary vector or applied majority-vote based ensemble learning for protein subcellular localization, both of which can not estimate the individual discriminative abilities of the three aspects of gene ontology. RESULTS In this paper, we propose a Gene Ontology Based Transfer Learning Model (GO-TLM) for large-scale protein subcellular localization. The model transfers the signature-based homologous GO terms to the target proteins, and further constructs a reliable learning system to reduce the adverse affect of the potential false GO terms that are resulted from evolutionary divergence. We derive three GO kernels from the three aspects of gene ontology to measure the GO similarity of two proteins, and derive two other spectrum kernels to measure the similarity of two protein sequences. We use simple non-parametric cross validation to explicitly weigh the discriminative abilities of the five kernels, such that the time & space computational complexities are greatly reduced when compared to the complicated semi-definite programming and semi-indefinite linear programming. The five kernels are then linearly merged into one single kernel for protein subcellular localization. We evaluate GO-TLM performance against three baseline models: MultiLoc, MultiLoc-GO and Euk-mPLoc on the benchmark datasets the baseline models adopted. 5-fold cross validation experiments show that GO-TLM achieves substantial accuracy improvement against the baseline models: 80.38% against model Euk-mPLoc 67.40% with 12.98% substantial increase; 96.65% and 96.27% against model MultiLoc-GO 89.60% and 89.60%, with 7.05% and 6.67% accuracy increase on dataset MultiLoc plant and dataset MultiLoc animal, respectively; 97.14%, 95.90% and 96.85% against model MultiLoc-GO 83.70%, 90.10% and 85.70%, with accuracy increase 13.44%, 5.8% and 11.15% on dataset BaCelLoc plant, dataset BaCelLoc fungi and dataset BaCelLoc animal respectively. For BaCelLoc independent sets, GO-TLM achieves 81.25%, 80.45% and 79.46% on dataset BaCelLoc plant holdout, dataset BaCelLoc plant holdout and dataset BaCelLoc animal holdout, respectively, as compared against baseline model MultiLoc-GO 76%, 60.00% and 73.00%, with accuracy increase 5.25%, 20.45% and 6.46%, respectively. CONCLUSIONS Since direct homology-based GO term transfer may be prone to introducing noise and outliers to the target protein, we design an explicitly weighted kernel learning system (called Gene Ontology Based Transfer Learning Model, GO-TLM) to transfer to the target protein the known knowledge about related homologous proteins, which can reduce the risk of outliers and share knowledge between homologous proteins, and thus achieve better predictive performance for protein subcellular localization. Cross validation and independent test experimental results show that the homology-based GO term transfer and explicitly weighing the GO kernels substantially improve the prediction performance.
Collapse
Affiliation(s)
- Suyu Mei
- Software College, Shenyang Normal University, Shenyang, PR China.
| | | | | |
Collapse
|
37
|
Scott MS, Boisvert FM, Lamond AI, Barton GJ. PNAC: a protein nucleolar association classifier. BMC Genomics 2011; 12:74. [PMID: 21272300 PMCID: PMC3038921 DOI: 10.1186/1471-2164-12-74] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2010] [Accepted: 01/27/2011] [Indexed: 01/11/2023] Open
Abstract
Background Although primarily known as the site of ribosome subunit production, the nucleolus is involved in numerous and diverse cellular processes. Recent large-scale proteomics projects have identified thousands of human proteins that associate with the nucleolus. However, in most cases, we know neither the fraction of each protein pool that is nucleolus-associated nor whether their association is permanent or conditional. Results To describe the dynamic localisation of proteins in the nucleolus, we investigated the extent of nucleolar association of proteins by first collating an extensively curated literature-derived dataset. This dataset then served to train a probabilistic predictor which integrates gene and protein characteristics. Unlike most previous experimental and computational studies of the nucleolar proteome that produce large static lists of nucleolar proteins regardless of their extent of nucleolar association, our predictor models the fluidity of the nucleolus by considering different classes of nucleolar-associated proteins. The new method predicts all human proteins as either nucleolar-enriched, nucleolar-nucleoplasmic, nucleolar-cytoplasmic or non-nucleolar. Leave-one-out cross validation tests reveal sensitivity values for these four classes ranging from 0.72 to 0.90 and positive predictive values ranging from 0.63 to 0.94. The overall accuracy of the classifier was measured to be 0.85 on an independent literature-based test set and 0.74 using a large independent quantitative proteomics dataset. While the three nucleolar-association groups display vastly different Gene Ontology biological process signatures and evolutionary characteristics, they collectively represent the most well characterised nucleolar functions. Conclusions Our proteome-wide classification of nucleolar association provides a novel representation of the dynamic content of the nucleolus. This model of nucleolar localisation thus increases the coverage while providing accurate and specific annotations of the nucleolar proteome. It will be instrumental in better understanding the central role of the nucleolus in the cell and its interaction with other subcellular compartments.
Collapse
Affiliation(s)
- Michelle S Scott
- Division of Biological Chemistry and Drug Discovery, College of Life Sciences, University of Dundee, Dow Street, Dundee DD1 5EH, UK.
| | | | | | | |
Collapse
|
38
|
Abstract
Since the proposal of the signal hypothesis on protein subcellular sorting, a number of computational analyses have been performed in this field. A typical example is the development of prediction algorithms for the subcellular localization sites of input protein sequences. In this review, we mainly focus on the biological grounds of the prediction methods rather than the algorithmic issues because we believe the former will be more fruitful for future development. Recent advances on the study of protein sorting signals will hopefully be incorporated into future prediction methods. Unfortunately, many of the state-of-the-art methods are published without sufficient objective tests. In fact, a simple test employed in this article shows that the performance of specifically developed predictors is not significantly better than that of a homology search. We suspect that this is a general problem associated with the interpretation of genome sequences, which have evolved through gene duplication and speciation.
Collapse
Affiliation(s)
- Kenichiro Imai
- Computational Biology Research Center, AIST, Tokyo, Japan
| | | |
Collapse
|
39
|
Zakeri P, Moshiri B, Sadeghi M. Prediction of protein submitochondria locations based on data fusion of various features of sequences. J Theor Biol 2010; 269:208-16. [PMID: 21040732 DOI: 10.1016/j.jtbi.2010.10.026] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2010] [Revised: 10/16/2010] [Accepted: 10/22/2010] [Indexed: 01/16/2023]
Abstract
In this study, the predictors are developed for protein submitochondria locations based on various features of sequences. Information about the submitochondria location for a mitochondria protein can provide much better understanding about its function. We use ten representative models of protein samples such as pseudo amino acid composition, dipeptide composition, functional domain composition, the combining discrete model based on prediction of solvent accessibility and secondary structure elements, the discrete model of pairwise sequence similarity, etc. We construct a predictor based on support vector machines (SVMs) for each representative model. The overall prediction accuracy by the leave-one-out cross validation test obtained by the predictor which is based on the discrete model of pairwise sequence similarity is 1% better than the best computational system that exists for this problem. Moreover, we develop a method based on ordered weighted averaging (OWA) which is one of the fusion data operators. Therefore, OWA is applied on the 11 best SVM-based classifiers that are constructed based on various features of sequence. This method is called Mito-Loc. The overall leave-one-out cross validation accuracy obtained by Mito-Loc is about 95%. This indicates that our proposed approach (Mito-Loc) is superior to the result of the best existing approach which has already been reported.
Collapse
Affiliation(s)
- Pooya Zakeri
- Department of Electrical and Computer Engineering, Isfahan University of Technology, Isfahan, Iran
| | | | | |
Collapse
|
40
|
Abstract
Background Understanding cellular systems requires the knowledge of a protein's subcellular localization (SCL). Although experimental and predicted data for protein SCL are archived in various databases, SCL prediction remains a non-trivial problem in genome annotation. Current SCL prediction tools use amino-acid sequence features and text mining approaches. A comprehensive analysis of protein SCL in human PPI and metabolic networks for various subcellular compartments is necessary for developing a robust SCL prediction methodology. Results Based on protein-protein interaction (PPI) and metabolite-linked protein interaction (MLPI) networks of proteins, we have compared, contrasted and analysed the statistical properties across different subcellular compartments. We integrated PPI and metabolic datasets with SCL information of human proteins from LOCATE and GOA (Gene Ontology Annotation) and estimated three statistical properties: Chi-square (χ2) test, Paired Localisation Correlation Profile (PLCP) and network topological measures. For the PPI network, Pearson's chi-square test shows that for the same SCL category, twice as many interacting protein pairs are observed than estimated when compared to non-interacting protein pairs (χ2 = 1270.19, P-value < 2.2 × 10-16), whereas for MLPI, metabolite-linked protein pairs having the same SCL are observed 20% more than expected, compared to non-metabolite linked proteins (χ2 = 110.02, P-value < 2.2 x10-16). To address the issue of proteins with multiple SCLs, we have specifically used the PLCP (Pair Localization Correlation Profile) measure. PLCP analysis revealed that protein interactions are majorly restricted to the same SCL, though significant cross-compartment interactions are seen for nuclear proteins. Metabolite-linked protein pairs are restricted to specific compartments such as the mitochondrion (P-value < 6.0e-07), the lysosome (P-value < 4.7e-05) and the Golgi apparatus (P-value < 1.0e-15). These findings indicate that the metabolic network adds value to the information in the PPI network for the localisation process of proteins in human subcellular compartments. Conclusions The MLPI network differs significantly from the PPI network in its SCL distribution. The PPI network shows passive protein interaction, possibly due to its high false positive rate, across different subcellular compartments, which seem to be absent in the MLPI network, as the MLPI network has evolved to maintain high substrate specificity for proteins.
Collapse
Affiliation(s)
- Gaurav Kumar
- ARC Centre of Excellence in Bioinformatics and Department of Chemistry and Biomolecular Sciences, Macquarie University, Sydney NSW, Australia.
| | | |
Collapse
|
41
|
Sun C, Zhao XM, Tang W, Chen L. FGsub: Fusarium graminearum protein subcellular localizations predicted from primary structures. BMC SYSTEMS BIOLOGY 2010; 4 Suppl 2:S12. [PMID: 20840726 PMCID: PMC2982686 DOI: 10.1186/1752-0509-4-s2-s12] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
Abstract
Background The fungal pathogen Fusarium graminearum (telomorph Gibberella zeae) is the causal agent of several destructive crop diseases, where a set of genes usually work in concert to cause diseases to crops. To function appropriately, the F. graminearum proteins inside one cell should be assigned to different compartments, i.e. subcellular localizations. Therefore, the subcellular localizations of F. graminearum proteins can provide insights into protein functions and pathogenic mechanisms of this destructive pathogen fungus. Unfortunately, there are no subcellular localization information for F. graminearum proteins available now. Computational approaches provide an alternative way to predicting F. graminearum protein subcellular localizations due to the expensive and time-consuming biological experiments in lab. Results In this paper, we developed a novel predictor, namely FGsub, to predict F. graminearum protein subcellular localizations from the primary structures. First, a non-redundant fungi data set with subcellular localization annotation is collected from UniProtKB database and used as training set, where the subcellular locations are classified into 10 groups. Subsequently, Support Vector Machine (SVM) is trained on the training set and used to predict F. graminearum protein subcellular localizations for those proteins that do not have significant sequence similarity to those in training set. The performance of SVMs on training set with 10-fold cross-validation demonstrates the efficiency and effectiveness of the proposed method. In addition, for F. graminearum proteins that have significant sequence similarity to those in training set, BLAST is utilized to transfer annotations of homologous proteins to uncharacterized F. graminearum proteins so that the F. graminearum proteins are annotated more comprehensively. Conclusions In this work, we present FGsub to predict F. graminearum protein subcellular localizations in a comprehensive manner. We make four fold contributions to this filed. First, we present a new algorithm to cope with imbalance problem that arises in protein subcellular localization prediction, which can solve imbalance problem and avoid false positive results. Second, we design an ensemble classifier which employs feature selection to further improve prediction accuracy. Third, we use BLAST to complement machine learning based methods, which enlarges our prediction coverage. Last and most important, we predict the subcellular localizations of 12786 F. graminearum proteins, which provide insights into protein functions and pathogenic mechanisms of this destructive pathogen fungus.
Collapse
Affiliation(s)
- Chenglei Sun
- Institute of Systems Biology, Shanghai University, Shanghai, China.
| | | | | | | |
Collapse
|
42
|
Ho H, Milenković T, Memisević V, Aruri J, Przulj N, Ganesan AK. Protein interaction network topology uncovers melanogenesis regulatory network components within functional genomics datasets. BMC SYSTEMS BIOLOGY 2010; 4:84. [PMID: 20550706 PMCID: PMC2904735 DOI: 10.1186/1752-0509-4-84] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/04/2009] [Accepted: 06/15/2010] [Indexed: 12/11/2022]
Abstract
Background RNA-mediated interference (RNAi)-based functional genomics is a systems-level approach to identify novel genes that control biological phenotypes. Existing computational approaches can identify individual genes from RNAi datasets that regulate a given biological process. However, currently available methods cannot identify which RNAi screen "hits" are novel components of well-characterized biological pathways known to regulate the interrogated phenotype. In this study, we describe a method to identify genes from RNAi datasets that are novel components of known biological pathways. We experimentally validate our approach in the context of a recently completed RNAi screen to identify novel regulators of melanogenesis. Results In this study, we utilize a PPI network topology-based approach to identify targets within our RNAi dataset that may be components of known melanogenesis regulatory pathways. Our computational approach identifies a set of screen targets that cluster topologically in a human PPI network with the known pigment regulator Endothelin receptor type B (EDNRB). Validation studies reveal that these genes impact pigment production and EDNRB signaling in pigmented melanoma cells (MNT-1) and normal melanocytes. Conclusions We present an approach that identifies novel components of well-characterized biological pathways from functional genomics datasets that could not have been identified by existing statistical and computational approaches.
Collapse
Affiliation(s)
- Hsiang Ho
- Department of Biological Chemistry, University of California, Irvine, 92697-1700, USA
| | | | | | | | | | | |
Collapse
|
43
|
Lee K, Thorneycroft D, Achuthan P, Hermjakob H, Ideker T. Mapping plant interactomes using literature curated and predicted protein-protein interaction data sets. THE PLANT CELL 2010; 22:997-1005. [PMID: 20371643 PMCID: PMC2879763 DOI: 10.1105/tpc.109.072736] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/18/2023]
Abstract
Most cellular processes are enabled by cohorts of interacting proteins that form dynamic networks within the plant proteome. The study of these networks can provide insight into protein function and provide new avenues for research. This article informs the plant science community of the currently available sources of protein interaction data and discusses how they can be useful to researchers. Using our recently curated IntAct Arabidopsis thaliana protein-protein interaction data set as an example, we discuss potentials and limitations of the plant interactomes generated to date. In addition, we present our efforts to add value to the interaction data by using them to seed a proteome-wide map of predicted protein subcellular locations.
Collapse
Affiliation(s)
- KiYoung Lee
- Department of Biomedical Informatics, Ajou University School of Medicine, Suwon 443-749, Korea.
| | | | | | | | | |
Collapse
|
44
|
Briesemeister S, Rahnenführer J, Kohlbacher O. Going from where to why--interpretable prediction of protein subcellular localization. ACTA ACUST UNITED AC 2010; 26:1232-8. [PMID: 20299325 PMCID: PMC2859129 DOI: 10.1093/bioinformatics/btq115] [Citation(s) in RCA: 118] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Protein subcellular localization is pivotal in understanding a protein's function. Computational prediction of subcellular localization has become a viable alternative to experimental approaches. While current machine learning-based methods yield good prediction accuracy, most of them suffer from two key problems: lack of interpretability and dealing with multiple locations. RESULTS We present YLoc, a novel method for predicting protein subcellular localization that addresses these issues. Due to its simple architecture, YLoc can identify the relevant features of a protein sequence contributing to its subcellular localization, e.g. localization signals or motifs relevant to protein sorting. We present several example applications where YLoc identifies the sequence features responsible for protein localization, and thus reveals not only to which location a protein is transported to, but also why it is transported there. YLoc also provides a confidence estimate for the prediction. Thus, the user can decide what level of error is acceptable for a prediction. Due to a probabilistic approach and the use of several thousands of dual-targeted proteins, YLoc is able to predict multiple locations per protein. YLoc was benchmarked using several independent datasets for protein subcellular localization and performs on par with other state-of-the-art predictors. Disregarding low-confidence predictions, YLoc can achieve prediction accuracies of over 90%. Moreover, we show that YLoc is able to reliably predict multiple locations and outperforms the best predictors in this area. AVAILABILITY www.multiloc.org/YLoc.
Collapse
|
45
|
Mei S, Fei W. Amino acid classification based spectrum kernel fusion for protein subnuclear localization. BMC Bioinformatics 2010; 11 Suppl 1:S17. [PMID: 20122188 PMCID: PMC3009488 DOI: 10.1186/1471-2105-11-s1-s17] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
Abstract
BACKGROUND Prediction of protein localization in subnuclear organelles is more challenging than general protein subcelluar localization. There are only three computational models for protein subnuclear localization thus far, to the best of our knowledge. Two models were based on protein primary sequence only. The first model assumed homogeneous amino acid substitution pattern across all protein sequence residue sites and used BLOSUM62 to encode k-mer of protein sequence. Ensemble of SVM based on different k-mers drew the final conclusion, achieving 50% overall accuracy. The simplified assumption did not exploit protein sequence profile and ignored the fact of heterogeneous amino acid substitution patterns across sites. The second model derived the PsePSSM feature representation from protein sequence by simply averaging the profile PSSM and combined the PseAA feature representation to construct a kNN ensemble classifier Nuc-PLoc, achieving 67.4% overall accuracy. The two models based on protein primary sequence only both achieved relatively poor predictive performance. The third model required that GO annotations be available, thus restricting the model's applicability. METHODS In this paper, we only use the amino acid information of protein sequence without any other information to design a widely-applicable model for protein subnuclear localization. We use K-spectrum kernel to exploit the contextual information around an amino acid and the conserved motif information. Besides expanding window size, we adopt various amino acid classification approaches to capture diverse aspects of amino acid physiochemical properties. Each amino acid classification generates a series of spectrum kernels based on different window size. Thus, (I) window expansion can capture more contextual information and cover size-varying motifs; (II) various amino acid classifications can exploit multi-aspect biological information from the protein sequence. Finally, we combine all the spectrum kernels by simple addition into one single kernel called SpectrumKernel+ for protein subnuclear localization. RESULTS We conduct the performance evaluation experiments on two benchmark datasets: Lei and Nuc-PLoc. Experimental results show that SpectrumKernel+ achieves substantial performance improvement against the previous model Nuc-PLoc, with overall accuracy 83.47% against 67.4%; and 71.23% against 50% of Lei SVM Ensemble, against 66.50% of Lei GO SVM Ensemble. CONCLUSION The method SpectrumKernel+ can exploit rich amino acid information of protein sequence by embedding into implicit size-varying motifs the multi-aspect amino acid physiochemical properties captured by amino acid classification approaches. The kernels derived from diverse amino acid classification approaches and different sizes of k-mer are summed together for data integration. Experiments show that the method SpectrumKernel+ significantly outperforms the existing models for protein subnuclear localization.
Collapse
Affiliation(s)
- Suyu Mei
- Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai, PR China
| | - Wang Fei
- Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai, PR China
| |
Collapse
|
46
|
Real-Chicharro A, Ruiz-Mostazo I, Navas-Delgado I, Kerzazi A, Chniber O, Sánchez-Jiménez F, Medina MA, Aldana-Montes JF. Protopia: a protein-protein interaction tool. BMC Bioinformatics 2009; 10 Suppl 12:S17. [PMID: 19828077 PMCID: PMC2762066 DOI: 10.1186/1471-2105-10-s12-s17] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023] Open
Abstract
BACKGROUND Protein-protein interactions can be considered the basic skeleton for living organism self-organization and homeostasis. Impressive quantities of experimental data are being obtained and computational tools are essential to integrate and to organize this information. This paper presents Protopia, a biological tool that offers a way of searching for proteins and their interactions in different Protein Interaction Web Databases, as a part of a multidisciplinary initiative of our institution for the integration of biological data http://asp.uma.es. RESULTS The tool accesses the different Databases (at present, the free version of Transfac, DIP, Hprd, Int-Act and iHop), and results are expressed with biological protein names or databases codes and can be depicted as a vector or a matrix. They can be represented and handled interactively as an organic graph. Comparison among databases is carried out using the Uniprot codes annotated for each protein. CONCLUSION The tool locates and integrates the current information stored in the aforementioned databases, and redundancies among them are detected. Results are compatible with the most important network analysers, so that they can be compared and analysed by other world-wide known tools and platforms. The visualization possibilities help to attain this goal and they are especially interesting for handling multiple-step or complex networks.
Collapse
Affiliation(s)
- Alejandro Real-Chicharro
- University of Malaga and CIBER de Enfermedades Raras, Department of Molecular Biology and Biochemistry, Faculty of Sciences, Instituto de Salud Carlos III, Spain.
| | | | | | | | | | | | | | | |
Collapse
|
47
|
Briesemeister S, Blum T, Brady S, Lam Y, Kohlbacher O, Shatkay H. SherLoc2: A High-Accuracy Hybrid Method for Predicting Subcellular Localization of Proteins. J Proteome Res 2009; 8:5363-6. [PMID: 19764776 DOI: 10.1021/pr900665y] [Citation(s) in RCA: 106] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Sebastian Briesemeister
- Division for Simulation of Biological Systems, Center for Bioinformatics Tübingen, Eberhard-Karls-Universität Tübingen, Germany, and School of Computing, Queen’s University, Kingston, Ontario, Canada
| | - Torsten Blum
- Division for Simulation of Biological Systems, Center for Bioinformatics Tübingen, Eberhard-Karls-Universität Tübingen, Germany, and School of Computing, Queen’s University, Kingston, Ontario, Canada
| | - Scott Brady
- Division for Simulation of Biological Systems, Center for Bioinformatics Tübingen, Eberhard-Karls-Universität Tübingen, Germany, and School of Computing, Queen’s University, Kingston, Ontario, Canada
| | - Yin Lam
- Division for Simulation of Biological Systems, Center for Bioinformatics Tübingen, Eberhard-Karls-Universität Tübingen, Germany, and School of Computing, Queen’s University, Kingston, Ontario, Canada
| | - Oliver Kohlbacher
- Division for Simulation of Biological Systems, Center for Bioinformatics Tübingen, Eberhard-Karls-Universität Tübingen, Germany, and School of Computing, Queen’s University, Kingston, Ontario, Canada
| | - Hagit Shatkay
- Division for Simulation of Biological Systems, Center for Bioinformatics Tübingen, Eberhard-Karls-Universität Tübingen, Germany, and School of Computing, Queen’s University, Kingston, Ontario, Canada
| |
Collapse
|
48
|
Acencio ML, Lemke N. Towards the prediction of essential genes by integration of network topology, cellular localization and biological process information. BMC Bioinformatics 2009; 10:290. [PMID: 19758426 PMCID: PMC2753850 DOI: 10.1186/1471-2105-10-290] [Citation(s) in RCA: 101] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2008] [Accepted: 09/16/2009] [Indexed: 11/21/2022] Open
Abstract
Background The identification of essential genes is important for the understanding of the minimal requirements for cellular life and for practical purposes, such as drug design. However, the experimental techniques for essential genes discovery are labor-intensive and time-consuming. Considering these experimental constraints, a computational approach capable of accurately predicting essential genes would be of great value. We therefore present here a machine learning-based computational approach relying on network topological features, cellular localization and biological process information for prediction of essential genes. Results We constructed a decision tree-based meta-classifier and trained it on datasets with individual and grouped attributes-network topological features, cellular compartments and biological processes-to generate various predictors of essential genes. We showed that the predictors with better performances are those generated by datasets with integrated attributes. Using the predictor with all attributes, i.e., network topological features, cellular compartments and biological processes, we obtained the best predictor of essential genes that was then used to classify yeast genes with unknown essentiality status. Finally, we generated decision trees by training the J48 algorithm on datasets with all network topological features, cellular localization and biological process information to discover cellular rules for essentiality. We found that the number of protein physical interactions, the nuclear localization of proteins and the number of regulating transcription factors are the most important factors determining gene essentiality. Conclusion We were able to demonstrate that network topological features, cellular localization and biological process information are reliable predictors of essential genes. Moreover, by constructing decision trees based on these data, we could discover cellular rules governing essentiality.
Collapse
Affiliation(s)
- Marcio L Acencio
- Department of Physics and Biophysics, São Paulo State University, Botucatu, São Paulo, Brazil.
| | | |
Collapse
|
49
|
Mintz-Oron S, Aharoni A, Ruppin E, Shlomi T. Network-based prediction of metabolic enzymes' subcellular localization. Bioinformatics 2009; 25:i247-52. [PMID: 19477995 PMCID: PMC2687963 DOI: 10.1093/bioinformatics/btp209] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
Motivation: Revealing the subcellular localization of proteins within membrane-bound compartments is of a major importance for inferring protein function. Though current high-throughput localization experiments provide valuable data, they are costly and time-consuming, and due to technical difficulties not readily applicable for many Eukaryotes. Physical characteristics of proteins, such as sequence targeting signals and amino acid composition are commonly used to predict subcellular localizations using computational approaches. Recently it was shown that protein–protein interaction (PPI) networks can be used to significantly improve the prediction accuracy of protein subcellular localization. However, as high-throughput PPI data depend on costly high-throughput experiments and are currently available for only a few organisms, the scope of such methods is yet limited. Results: This study presents a novel constraint-based method for predicting subcellular localization of enzymes based on their embedding metabolic network, relying on a parsimony principle of a minimal number of cross-membrane metabolite transporters. In a cross-validation test of predicting known subcellular localization of yeast enzymes, the method is shown to be markedly robust, providing accurate localization predictions even when only 20% of the known enzyme localizations are given as input. It is shown to outperform pathway enrichment-based methods both in terms of prediction accuracy and in its ability to predict the subcellular localization of entire metabolic pathways when no a-priori pathway-specific localization data is available (and hence enrichment methods are bound to fail). With the number of available metabolic networks already reaching more than 600 and growing fast, the new method may significantly contribute to the identification of enzyme localizations in many different organisms. Contact:shira.mintz@weizmann.ac.il; tomersh@cs.technion.ac.il
Collapse
Affiliation(s)
- Shira Mintz-Oron
- Department of Plant Sciences, Weizmann Institute of Science, Rehovot, Israel.
| | | | | | | |
Collapse
|