1
|
Liu T, Song C, Wang C. NCSP-PLM: An ensemble learning framework for predicting non-classical secreted proteins based on protein language models and deep learning. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2024; 21:1472-1488. [PMID: 38303473 DOI: 10.3934/mbe.2024063] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/03/2024]
Abstract
Non-classical secreted proteins (NCSPs) refer to a group of proteins that are located in the extracellular environment despite the absence of signal peptides and motifs. They usually play different roles in intercellular communication. Therefore, the accurate prediction of NCSPs is a critical step to understanding in depth their associated secretion mechanisms. Since the experimental recognition of NCSPs is often costly and time-consuming, computational methods are desired. In this study, we proposed an ensemble learning framework, termed NCSP-PLM, for the identification of NCSPs by extracting feature embeddings from pre-trained protein language models (PLMs) as input to several fine-tuned deep learning models. First, we compared the performance of nine PLM embeddings by training three neural networks: Multi-layer perceptron (MLP), attention mechanism and bidirectional long short-term memory network (BiLSTM) and selected the best network model for each PLM embedding. Then, four models were excluded due to their below-average accuracies, and the remaining five models were integrated to perform the prediction of NCSPs based on the weighted voting. Finally, the 5-fold cross validation and the independent test were conducted to evaluate the performance of NCSP-PLM on the benchmark datasets. Based on the same independent dataset, the sensitivity and specificity of NCSP-PLM were 91.18% and 97.06%, respectively. Particularly, the overall accuracy of our model achieved 94.12%, which was 7~16% higher than that of the existing state-of-the-art predictors. It indicated that NCSP-PLM could serve as a useful tool for the annotation of NCSPs.
Collapse
Affiliation(s)
- Taigang Liu
- College of Information Technology, Shanghai Ocean University, Shanghai 201306, China
| | - Chen Song
- College of Information Technology, Shanghai Ocean University, Shanghai 201306, China
| | - Chunhua Wang
- College of Information Technology, Shanghai Ocean University, Shanghai 201306, China
| |
Collapse
|
2
|
Wang X, Li F, Xu J, Rong J, Webb GI, Ge Z, Li J, Song J. ASPIRER: a new computational approach for identifying non-classical secreted proteins based on deep learning. Brief Bioinform 2022; 23:bbac031. [PMID: 35176756 PMCID: PMC8921646 DOI: 10.1093/bib/bbac031] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2021] [Revised: 01/10/2022] [Accepted: 01/22/2022] [Indexed: 12/15/2022] Open
Abstract
Protein secretion has a pivotal role in many biological processes and is particularly important for intercellular communication, from the cytoplasm to the host or external environment. Gram-positive bacteria can secrete proteins through multiple secretion pathways. The non-classical secretion pathway has recently received increasing attention among these secretion pathways, but its exact mechanism remains unclear. Non-classical secreted proteins (NCSPs) are a class of secreted proteins lacking signal peptides and motifs. Several NCSP predictors have been proposed to identify NCSPs and most of them employed the whole amino acid sequence of NCSPs to construct the model. However, the sequence length of different proteins varies greatly. In addition, not all regions of the protein are equally important and some local regions are not relevant to the secretion. The functional regions of the protein, particularly in the N- and C-terminal regions, contain important determinants for secretion. In this study, we propose a new hybrid deep learning-based framework, referred to as ASPIRER, which improves the prediction of NCSPs from amino acid sequences. More specifically, it combines a whole sequence-based XGBoost model and an N-terminal sequence-based convolutional neural network model; 5-fold cross-validation and independent tests demonstrate that ASPIRER achieves superior performance than existing state-of-the-art approaches. The source code and curated datasets of ASPIRER are publicly available at https://github.com/yanwu20/ASPIRER/. ASPIRER is anticipated to be a useful tool for improved prediction of novel putative NCSPs from sequences information and prioritization of candidate proteins for follow-up experimental validation.
Collapse
Affiliation(s)
- Xiaoyu Wang
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Fuyi Li
- Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, Victoria, Australia
| | - Jing Xu
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Jia Rong
- Department of Data Science and AI, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Geoffrey I Webb
- Department of Data Science and AI, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Zongyuan Ge
- Monash e-Research Centre and Faculty of Engineering, Monash University, Melbourne, VIC 3800, Australia
| | - Jian Li
- Biomedicine Discovery Institute and Department of Microbiology, Monash University, Melbourne, VIC 3800, Australia
| | - Jiangning Song
- Department of Data Science and AI, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| |
Collapse
|
3
|
Marques da Silva W, Seyffert N, Silva A, Azevedo V. A journey through the Corynebacterium pseudotuberculosis proteome promotes insights into its functional genome. PeerJ 2022; 9:e12456. [PMID: 35036114 PMCID: PMC8710256 DOI: 10.7717/peerj.12456] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2021] [Accepted: 10/18/2021] [Indexed: 11/28/2022] Open
Abstract
Background Corynebacterium pseudotuberculosis is a Gram-positive facultative intracellular pathogen and the etiologic agent of illnesses like caseous lymphadenitis in small ruminants, mastitis in dairy cattle, ulcerative lymphangitis in equines, and oedematous skin disease in buffalos. With the growing advance in high-throughput technologies, genomic studies have been carried out to explore the molecular basis of its virulence and pathogenicity. However, data large-scale functional genomics studies are necessary to complement genomics data and better understating the molecular basis of a given organism. Here we summarize, MS-based proteomics techniques and bioinformatics tools incorporated in genomic functional studies of C. pseudotuberculosis to discover the different patterns of protein modulation under distinct environmental conditions, and antigenic and drugs targets. Methodology In this study we performed an extensive search in Web of Science of original and relevant articles related to methods, strategy, technology, approaches, and bioinformatics tools focused on the functional study of the genome of C. pseudotuberculosis at the protein level. Results Here, we highlight the use of proteomics for understating several aspects of the physiology and pathogenesis of C. pseudotuberculosis at the protein level. The implementation and use of protocols, strategies, and proteomics approach to characterize the different subcellular fractions of the proteome of this pathogen. In addition, we have discussed the immunoproteomics, immunoinformatics and genetic tools employed to identify targets for immunoassays, drugs, and vaccines against C. pseudotuberculosis infection. Conclusion In this review, we showed that the combination of proteomics and bioinformatics studies is a suitable strategy to elucidate the functional aspects of the C. pseudotuberculosis genome. Together, all information generated from these proteomics studies allowed expanding our knowledge about factors related to the pathophysiology of this pathogen.
Collapse
Affiliation(s)
- Wanderson Marques da Silva
- Institute of Agrobiotechnology and Molecular Biology-(INTA/CONICET), Hurlingham, Buenos Aires, Argentina
| | - Nubia Seyffert
- Institute of Health Sciences, Federal University of Bahia, Salvador, Bahia, Brazil
| | - Artur Silva
- Laboratory of Genomics and Bioinformatics, Center of Genomics and Systems Biology, Institute of Biological Sciences, Federal University of Para, Belém, Pará, Brazil
| | - Vasco Azevedo
- Genetics, Ecology and Evolution, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| |
Collapse
|
4
|
Dai W, Li J, Li Q, Cai J, Su J, Stubenrauch C, Wang J. PncsHub: a platform for annotating and analyzing non-classically secreted proteins in Gram-positive bacteria. Nucleic Acids Res 2022; 50:D848-D857. [PMID: 34551435 PMCID: PMC8728121 DOI: 10.1093/nar/gkab814] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2021] [Revised: 08/30/2021] [Accepted: 09/07/2021] [Indexed: 12/28/2022] Open
Abstract
From industry to food to health, bacteria play an important role in all facets of life. Some of the most important bacteria have been purposely engineered to produce commercial quantities of antibiotics and therapeutics, and non-classical secretion systems are at the forefront of these technologies. Unlike the classical Sec or Tat pathways, non-classically secreted proteins share few common characteristics and use much more diverse secretion pathways for protein transport. Systematically categorizing and investigating the non-classically secreted proteins will enable a deeper understanding of their associated secretion mechanisms and provide a landscape of the Gram-positive secretion pathway distribution. We therefore developed PncsHub (https://pncshub.erc.monash.edu/), the first universal platform for comprehensively annotating and analyzing Gram-positive bacterial non-classically secreted proteins. PncsHub catalogs 4,914 non-classically secreted proteins, which are delicately categorized into 8 subtypes (including the 'unknown' subtype) and annotated with data compiled from up to 26 resources and visualisation tools. It incorporates state-of-the-art predictors to identify new and homologous non-classically secreted proteins and includes three analytical modules to visualise the relationships between known and putative non-classically secreted proteins. As such, PncsHub aims to provide integrated services for investigating, predicting and identifying non-classically secreted proteins to promote hypothesis-driven laboratory-based experiments.
Collapse
Affiliation(s)
- Wei Dai
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China
- Infection and Immunity Program, Biomedicine Discovery Institute and Department of Microbiology, Monash University, VIC 3800, Australia
- Wenzhou Institute, University of Chinese Academy of Sciences, Wenzhou 325011, China
| | - Jiahui Li
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China
| | - Qi Li
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China
| | - Jiasheng Cai
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China
| | - Jianzhong Su
- Wenzhou Institute, University of Chinese Academy of Sciences, Wenzhou 325011, China
- School of Ophthalmology & Optometry and Eye Hospital, Wenzhou Medical University, Wenzhou 325027, China
| | - Christopher Stubenrauch
- Infection and Immunity Program, Biomedicine Discovery Institute and Department of Microbiology, Monash University, VIC 3800, Australia
- Centre to Impact AMR, Monash University, VIC 3800, Australia
| | - Jiawei Wang
- Infection and Immunity Program, Biomedicine Discovery Institute and Department of Microbiology, Monash University, VIC 3800, Australia
- Centre to Impact AMR, Monash University, VIC 3800, Australia
| |
Collapse
|
5
|
Wang W, Ye LF, Bao H, Hu MT, Han M, Tang HM, Ren C, Wu X, Shao Y, Wang FH, Zhou ZW, Li YH, Xu RH, Wang DS. Heterogeneity and evolution of tumour immune microenvironment in metastatic gastroesophageal adenocarcinoma. Gastric Cancer 2022; 25:1017-1030. [PMID: 35904677 PMCID: PMC9587966 DOI: 10.1007/s10120-022-01324-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/08/2022] [Accepted: 07/16/2022] [Indexed: 02/07/2023]
Abstract
BACKGROUND Tumour immune microenvironment heterogeneity is prevalent in numerous cancers and can negatively impact immunotherapy response. Immune heterogeneity and evolution in gastroesophageal adenocarcinoma (GEA) have not been studied in the past. METHODS Together with a multi-region sampling of normal, primary and metastatic tissues, we performed whole exome sequencing, TCR sequencing as well as immune cell infiltration estimation through deconvolution of gene expression signals. RESULTS We discovered high TCR repertoire and immune cell infiltration heterogeneity among metastatic sites, while they were homogeneous among primary and normal samples. Metastatic sites shared high levels of abundant TCR clonotypes with blood, indicating immune surveillance via blood. Metastatic sites also had low levels of tumour-eliminating immune cells and were undergoing heavy immunomodulation compared to normal and primary tumour tissues. There was co-evolution of neo-antigen and TCR repertoire, but only in patients with late diverging mutational evolution. Co-evolution of TCR repertoire and immune cell infiltration was seen in all except one patient. CONCLUSIONS Our findings revealed immune heterogeneity and co-evolution in GEA, which may inform immunotherapy decision-making.
Collapse
Affiliation(s)
- Wei Wang
- State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Sun Yat-Sen University Cancer Center, Sun Yat-Sen University, Guangzhou, 510060 People’s Republic of China ,Department of Gastric Surgery, Sun Yat-Sen University Cancer Center, Guangzhou, 510060 People’s Republic of China
| | - Liu-Fang Ye
- State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Sun Yat-Sen University Cancer Center, Sun Yat-Sen University, Guangzhou, 510060 People’s Republic of China ,Research Unit of Precision Diagnosis and Treatment for Gastrointestinal Cancer, Chinese Academy of Medical Sciences, Guangzhou, 510060 People’s Republic of China ,Department of Medical Oncology, Sun Yat-Sen University Cancer Center, 651 Dong feng, East Road, Guangzhou, 510060 People’s Republic of China
| | - Hua Bao
- Geneseeq Research Institute, Nanjing Geneseeq Technology Inc., Nanjing, Jiangsu China
| | - Ming-Tao Hu
- State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Sun Yat-Sen University Cancer Center, Sun Yat-Sen University, Guangzhou, 510060 People’s Republic of China ,Research Unit of Precision Diagnosis and Treatment for Gastrointestinal Cancer, Chinese Academy of Medical Sciences, Guangzhou, 510060 People’s Republic of China ,Department of Medical Oncology, Sun Yat-Sen University Cancer Center, 651 Dong feng, East Road, Guangzhou, 510060 People’s Republic of China
| | - Ming Han
- Geneseeq Research Institute, Nanjing Geneseeq Technology Inc., Nanjing, Jiangsu China
| | - Hai-Meng Tang
- Geneseeq Research Institute, Nanjing Geneseeq Technology Inc., Nanjing, Jiangsu China
| | - Chao Ren
- State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Sun Yat-Sen University Cancer Center, Sun Yat-Sen University, Guangzhou, 510060 People’s Republic of China ,Research Unit of Precision Diagnosis and Treatment for Gastrointestinal Cancer, Chinese Academy of Medical Sciences, Guangzhou, 510060 People’s Republic of China ,Department of Medical Oncology, Sun Yat-Sen University Cancer Center, 651 Dong feng, East Road, Guangzhou, 510060 People’s Republic of China
| | - Xue Wu
- Geneseeq Research Institute, Nanjing Geneseeq Technology Inc., Nanjing, Jiangsu China
| | - Yang Shao
- Geneseeq Research Institute, Nanjing Geneseeq Technology Inc., Nanjing, Jiangsu China ,School of Public Health, Nanjing Medical University, Nanjing, China
| | - Feng-Hua Wang
- State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Sun Yat-Sen University Cancer Center, Sun Yat-Sen University, Guangzhou, 510060 People’s Republic of China ,Research Unit of Precision Diagnosis and Treatment for Gastrointestinal Cancer, Chinese Academy of Medical Sciences, Guangzhou, 510060 People’s Republic of China ,Department of Medical Oncology, Sun Yat-Sen University Cancer Center, 651 Dong feng, East Road, Guangzhou, 510060 People’s Republic of China
| | - Zhi-Wei Zhou
- State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Sun Yat-Sen University Cancer Center, Sun Yat-Sen University, Guangzhou, 510060 People’s Republic of China ,Department of Gastric Surgery, Sun Yat-Sen University Cancer Center, Guangzhou, 510060 People’s Republic of China
| | - Yu-Hong Li
- State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Sun Yat-Sen University Cancer Center, Sun Yat-Sen University, Guangzhou, 510060 People’s Republic of China ,Research Unit of Precision Diagnosis and Treatment for Gastrointestinal Cancer, Chinese Academy of Medical Sciences, Guangzhou, 510060 People’s Republic of China ,Department of Medical Oncology, Sun Yat-Sen University Cancer Center, 651 Dong feng, East Road, Guangzhou, 510060 People’s Republic of China
| | - Rui-Hua Xu
- State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Sun Yat-Sen University Cancer Center, Sun Yat-Sen University, Guangzhou, 510060 People’s Republic of China ,Research Unit of Precision Diagnosis and Treatment for Gastrointestinal Cancer, Chinese Academy of Medical Sciences, Guangzhou, 510060 People’s Republic of China ,Department of Medical Oncology, Sun Yat-Sen University Cancer Center, 651 Dong feng, East Road, Guangzhou, 510060 People’s Republic of China
| | - De-Shen Wang
- State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Sun Yat-Sen University Cancer Center, Sun Yat-Sen University, Guangzhou, 510060 People’s Republic of China ,Research Unit of Precision Diagnosis and Treatment for Gastrointestinal Cancer, Chinese Academy of Medical Sciences, Guangzhou, 510060 People’s Republic of China ,Department of Medical Oncology, Sun Yat-Sen University Cancer Center, 651 Dong feng, East Road, Guangzhou, 510060 People’s Republic of China
| |
Collapse
|
6
|
Abstract
Secreted proteins play important roles in several biological processes such as growth, proliferation differentiation, cell-cell communication, migration, and apoptosis; moreover, these extracellular molecules mediate homeostasis by influencing the cross-talking within the surrounding tissues. Currently, the research area of cell secretome has become of great interest since the profiling of secreted proteins could be essential for the biomarker discovery and for the identification of new therapeutic strategies. Several bioinformatic platforms have been implemented for the in silico characterization of secreted proteins: this chapter describes a typical workflow for the analysis of proteins secreted by cultured cells through bioinformatic approaches. Central issue is related to discrimination between proteins secreted by classical and non-classical pathways. Therefore, specific prediction tools for the classification of candidate secreted proteins are here presented.
Collapse
|
7
|
Wang C, Wu J, Xu L, Zou Q. NonClasGP-Pred: robust and efficient prediction of non-classically secreted proteins by integrating subset-specific optimal models of imbalanced data. Microb Genom 2020; 6:mgen000483. [PMID: 33245691 PMCID: PMC8116686 DOI: 10.1099/mgen.0.000483] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2020] [Accepted: 11/06/2020] [Indexed: 01/01/2023] Open
Abstract
Non-classically secreted proteins (NCSPs) are proteins that are located in the extracellular environment, although there is a lack of known signal peptides or secretion motifs. They usually perform different biological functions in intracellular and extracellular environments, and several of their biological functions are linked to bacterial virulence and cell defence. Accurate protein localization is essential for all living organisms, however, the performance of existing methods developed for NCSP identification has been unsatisfactory and in particular suffer from data deficiency and possible overfitting problems. Further improvement is desirable, especially to address the lack of informative features and mining subset-specific features in imbalanced datasets. In the present study, a new computational predictor was developed for NCSP prediction of gram-positive bacteria. First, to address the possible prediction bias caused by the data imbalance problem, ten balanced subdatasets were generated for ensemble model construction. Then, the F-score algorithm combined with sequential forward search was used to strengthen the feature representation ability for each of the training subdatasets. Third, the subset-specific optimal feature combination process was adopted to characterize the original data from different aspects, and all subdataset-based models were integrated into a unified model, NonClasGP-Pred, which achieved an excellent performance with an accuracy of 93.23 %, a sensitivity of 100 %, a specificity of 89.01 %, a Matthew's correlation coefficient of 87.68 % and an area under the curve value of 0.9975 for ten-fold cross-validation. Based on assessment on the independent test dataset, the proposed model outperformed state-of-the-art available toolkits. For availability and implementation, see: http://lab.malab.cn/~wangchao/softwares/NonClasGP/.
Collapse
Affiliation(s)
- Chao Wang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, PR China
| | - Jin Wu
- School of Management, Shenzhen Polytechnic, Shenzhen, PR China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, PR China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, PR China
- Hainan Key Laboratory for Computational Science and Application, Hainan Normal University, Haikou, PR China
| |
Collapse
|
8
|
Zhang Y, Yu S, Xie R, Li J, Leier A, Marquez-Lago TT, Akutsu T, Smith AI, Ge Z, Wang J, Lithgow T, Song J. PeNGaRoo, a combined gradient boosting and ensemble learning framework for predicting non-classical secreted proteins. Bioinformatics 2020; 36:704-712. [PMID: 31393553 DOI: 10.1093/bioinformatics/btz629] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2019] [Revised: 07/17/2019] [Accepted: 08/07/2019] [Indexed: 12/17/2022] Open
Abstract
MOTIVATION Gram-positive bacteria have developed secretion systems to transport proteins across their cell wall, a process that plays an important role during host infection. These secretion mechanisms have also been harnessed for therapeutic purposes in many biotechnology applications. Accordingly, the identification of features that select a protein for efficient secretion from these microorganisms has become an important task. Among all the secreted proteins, 'non-classical' secreted proteins are difficult to identify as they lack discernable signal peptide sequences and can make use of diverse secretion pathways. Currently, several computational methods have been developed to facilitate the discovery of such non-classical secreted proteins; however, the existing methods are based on either simulated or limited experimental datasets. In addition, they often employ basic features to train the models in a simple and coarse-grained manner. The availability of more experimentally validated datasets, advanced feature engineering techniques and novel machine learning approaches creates new opportunities for the development of improved predictors of 'non-classical' secreted proteins from sequence data. RESULTS In this work, we first constructed a high-quality dataset of experimentally verified 'non-classical' secreted proteins, which we then used to create benchmark datasets. Using these benchmark datasets, we comprehensively analyzed a wide range of features and assessed their individual performance. Subsequently, we developed a two-layer Light Gradient Boosting Machine (LightGBM) ensemble model that integrates several single feature-based models into an overall prediction framework. At this stage, LightGBM, a gradient boosting machine, was used as a machine learning approach and the necessary parameter optimization was performed by a particle swarm optimization strategy. All single feature-based LightGBM models were then integrated into a unified ensemble model to further improve the predictive performance. Consequently, the final ensemble model achieved a superior performance with an accuracy of 0.900, an F-value of 0.903, Matthew's correlation coefficient of 0.803 and an area under the curve value of 0.963, and outperforming previous state-of-the-art predictors on the independent test. Based on our proposed optimal ensemble model, we further developed an accessible online predictor, PeNGaRoo, to serve users' demands. We believe this online web server, together with our proposed methodology, will expedite the discovery of non-classically secreted effector proteins in Gram-positive bacteria and further inspire the development of next-generation predictors. AVAILABILITY AND IMPLEMENTATION http://pengaroo.erc.monash.edu/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yanju Zhang
- Bioinformatics Group, School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China
| | - Sha Yu
- Bioinformatics Group, School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China.,Infection and Immunity Program, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, VIC 3800, Australia
| | - Ruopeng Xie
- Bioinformatics Group, School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China.,Infection and Immunity Program, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, VIC 3800, Australia
| | - Jiahui Li
- Bioinformatics Group, School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China.,Infection and Immunity Program, Biomedicine Discovery Institute and Department of Microbiology, Monash University, Melbourne, VIC 3800, Australia
| | - André Leier
- Department of Genetics, AL, USA.,Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, AL, USA
| | - Tatiana T Marquez-Lago
- Department of Genetics, AL, USA.,Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, AL, USA
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto 611-0011, Japan
| | - A Ian Smith
- Infection and Immunity Program, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, VIC 3800, Australia.,ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, VIC 3800, Australia
| | - Zongyuan Ge
- Monash e-Research Centre and Faculty of Engineering, Monash University, Melbourne, VIC 3800, Australia
| | - Jiawei Wang
- Infection and Immunity Program, Biomedicine Discovery Institute and Department of Microbiology, Monash University, Melbourne, VIC 3800, Australia
| | - Trevor Lithgow
- Infection and Immunity Program, Biomedicine Discovery Institute and Department of Microbiology, Monash University, Melbourne, VIC 3800, Australia
| | - Jiangning Song
- Infection and Immunity Program, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, VIC 3800, Australia.,ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, VIC 3800, Australia
| |
Collapse
|
9
|
Zhang J, Zhang Y, Ma Z. In silico Prediction of Human Secretory Proteins in Plasma Based on Discrete Firefly Optimization and Application to Cancer Biomarkers Identification. Front Genet 2019; 10:542. [PMID: 31244885 PMCID: PMC6563772 DOI: 10.3389/fgene.2019.00542] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2019] [Accepted: 05/21/2019] [Indexed: 12/20/2022] Open
Abstract
The early control and prevention of cancer contributes effectively interventions and cancer therapies. Secretory protein, one of the richest biomarkers, is proved important as molecular signposts of the physiological state of a cell. In this work, we aim to propose a proteomic high-throughput technology platform to facilitate detection of early cancer by means of biomarkers that secreted into the bloodstream. We compile a new benchmark dataset of human secretory proteins in plasma. A series of sequence-derived features, which have been proved involved in the structure and function of the secretory proteins, are collected to mathematically encode these proteins. Considering the influence of potential irrelevant or redundant features, we introduce discrete firefly optimization algorithm to perform feature selection. We evaluate and compare the proposed method SCRIP (Secretory proteins in plasma) with state-of-the-art approaches on benchmark datasets and independent testing datasets. SCRIP achieves the average AUC values of 0.876 and 0.844 in five-fold the cross-validation and independent test, respectively. Besides that, we also test SCRIP on proteins in four types of cancer tissues and successfully detect 66∼77% potential cancer biomarkers.
Collapse
Affiliation(s)
- Jian Zhang
- School of Computer and Information Technology, Xinyang Normal University, Xinyang, China
- Henan Key Laboratory of Education Big Data Analysis and Application, Xinyang, China
| | - Yu Zhang
- Information Engineering College, Huanghuai University, Zhumadian, China
- Henan Key Laboratory of Smart Lighting, Zhumadian, China
| | - Zhiqiang Ma
- Department of Computer Science, College of Humanities & Sciences of Northeast Normal University, Changchun, China
| |
Collapse
|
10
|
Zhang J, Chai H, Guo S, Guo H, Li Y. High-Throughput Identification of Mammalian Secreted Proteins Using Species-Specific Scheme and Application to Human Proteome. Molecules 2018; 23:molecules23061448. [PMID: 29903999 PMCID: PMC6099666 DOI: 10.3390/molecules23061448] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2018] [Revised: 05/29/2018] [Accepted: 05/30/2018] [Indexed: 02/02/2023] Open
Abstract
Secreted proteins are widely spread in living organisms and cells. Since secreted proteins are easy to be detected in body fluids, urine, and saliva in clinical diagnosis, they play important roles in biomarkers for disease diagnosis and vaccine production. In this study, we propose a novel predictor for accurate high-throughput identification of mammalian secreted proteins that is based on sequence-derived features. We combine the features of amino acid composition, sequence motifs, and physicochemical properties to encode collected proteins. Detailed feature analyses prove the effectiveness of the considered features. Based on the differences across various species of secreted proteins, we introduce the species-specific scheme, which is expected to further explore the intrinsic attributes of specific secreted proteins. Experiments on benchmark datasets prove the effectiveness of our proposed method. The test on independent testing dataset also promises a good generalization capability. When compared with the traditional universal model, we experimentally demonstrate that the species-specific scheme is capable of significantly improving the prediction performance. We use our method to make predictions on unreviewed human proteome, and find 272 potential secreted proteins with probabilities that are higher than 99%. A user-friendly web server, named iMSPs (identification of Mammalian Secreted Proteins), which implements our proposed method, is designed and is available for free for academic use at: http://www.inforstation.com/webservers/iMSP/.
Collapse
Affiliation(s)
- Jian Zhang
- School of Computer and Information Technology, Xinyang Normal University, Xinyang 464000, China.
| | - Haiting Chai
- College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow G12 8QQ, UK.
| | - Song Guo
- School of Computer and Information Technology, Xinyang Normal University, Xinyang 464000, China.
| | - Huaping Guo
- School of Computer and Information Technology, Xinyang Normal University, Xinyang 464000, China.
| | - Yanling Li
- School of Computer and Information Technology, Xinyang Normal University, Xinyang 464000, China.
| |
Collapse
|
11
|
Lonsdale A, Davis MJ, Doblin MS, Bacic A. Better Than Nothing? Limitations of the Prediction Tool SecretomeP in the Search for Leaderless Secretory Proteins (LSPs) in Plants. FRONTIERS IN PLANT SCIENCE 2016; 7:1451. [PMID: 27729919 PMCID: PMC5037178 DOI: 10.3389/fpls.2016.01451] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/26/2016] [Accepted: 09/12/2016] [Indexed: 05/14/2023]
Abstract
In proteomic analyses of the plant secretome, the presence of putative leaderless secretory proteins (LSPs) is difficult to confirm due to the possibility of contamination from other sub-cellular compartments. In the absence of a plant-specific tool for predicting LSPs, the mammalian-trained SecretomeP has been applied to plant proteins in multiple studies to identify the most likely LSPs. This study investigates the effectiveness of using SecretomeP on plant proteins, identifies its limitations and provides a benchmark for its use. In the absence of experimentally verified LSPs we exploit the common-feature hypothesis behind SecretomeP and use known classically secreted proteins (CSPs) of plants as a proxy to evaluate its accuracy. We show that, contrary to the common-feature hypothesis, plant CSPs are a poor proxy for evaluating LSP detection due to variation in the SecretomeP prediction scores when the signal peptide (SP) is modified. Removing the SP region from CSPs and comparing the predictive performance against non-secretory proteins indicates that commonly used threshold scores of 0.5 and 0.6 result in false-positive rates in excess of 0.3 when applied to plants proteins. Setting the false-positive rate to 0.05, consistent with the original mammalian performance of SecretomeP, yields only a marginally higher true positive rate compared to false positives. Therefore the use of SecretomeP on plant proteins is not recommended. This study investigates the trade-offs of using SecretomeP on plant proteins and provides insights into predictive features for future development of plant-specific common-feature tools.
Collapse
Affiliation(s)
- Andrew Lonsdale
- ARC Centre of Excellence in Plant Cell Walls, School of BioSciences, The University of MelbourneParkville, VIC, Australia
| | - Melissa J. Davis
- The Walter and Eliza Hall Institute of Medical ResearchParkville, VIC, Australia
- Department of Biochemistry and Molecular Biology, Bio21 Molecular Science and Biotechnology Institute, The University of MelbourneParkville, VIC, Australia
| | - Monika S. Doblin
- ARC Centre of Excellence in Plant Cell Walls, School of BioSciences, The University of MelbourneParkville, VIC, Australia
| | - Antony Bacic
- ARC Centre of Excellence in Plant Cell Walls, School of BioSciences, The University of MelbourneParkville, VIC, Australia
- *Correspondence: Antony Bacic,
| |
Collapse
|
12
|
Huang WL, Tung CW, Liaw C, Huang HL, Ho SY. Rule-based knowledge acquisition method for promoter prediction in human and Drosophila species. ScientificWorldJournal 2014; 2014:327306. [PMID: 24955394 PMCID: PMC3927563 DOI: 10.1155/2014/327306] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2013] [Accepted: 10/10/2013] [Indexed: 01/08/2023] Open
Abstract
The rapid and reliable identification of promoter regions is important when the number of genomes to be sequenced is increasing very speedily. Various methods have been developed but few methods investigate the effectiveness of sequence-based features in promoter prediction. This study proposes a knowledge acquisition method (named PromHD) based on if-then rules for promoter prediction in human and Drosophila species. PromHD utilizes an effective feature-mining algorithm and a reference feature set of 167 DNA sequence descriptors (DNASDs), comprising three descriptors of physicochemical properties (absorption maxima, molecular weight, and molar absorption coefficient), 128 top-ranked descriptors of 4-mer motifs, and 36 global sequence descriptors. PromHD identifies two feature subsets with 99 and 74 DNASDs and yields test accuracies of 96.4% and 97.5% in human and Drosophila species, respectively. Based on the 99- and 74-dimensional feature vectors, PromHD generates several if-then rules by using the decision tree mechanism for promoter prediction. The top-ranked informative rules with high certainty grades reveal that the global sequence descriptor, the length of nucleotide A at the first position of the sequence, and two physicochemical properties, absorption maxima and molecular weight, are effective in distinguishing promoters from non-promoters in human and Drosophila species, respectively.
Collapse
Affiliation(s)
- Wen-Lin Huang
- Department of Management Information System, Asia Pacific Institute of Creativity, Miaoli 351, Taiwan
| | - Chun-Wei Tung
- School of Pharmacy, College of Pharmacy, Kaohsiung Medical University, Kaohsiung 807, Taiwan
| | - Chyn Liaw
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu 300, Taiwan
| | - Hui-Ling Huang
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu 300, Taiwan
- Department of Biological Science and Technology, National Chiao Tung University, Hsinchu 300, Taiwan
| | - Shinn-Ying Ho
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu 300, Taiwan
- Department of Biological Science and Technology, National Chiao Tung University, Hsinchu 300, Taiwan
| |
Collapse
|
13
|
Caccia D, Dugo M, Callari M, Bongarzone I. Bioinformatics tools for secretome analysis. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2013; 1834:2442-53. [PMID: 23395702 DOI: 10.1016/j.bbapap.2013.01.039] [Citation(s) in RCA: 71] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/29/2012] [Revised: 01/23/2013] [Accepted: 01/29/2013] [Indexed: 12/29/2022]
Abstract
Over recent years, analyses of secretomes (complete sets of secreted proteins) have been reported in various organisms, cell types, and pathologies and such studies are quickly gaining popularity. Fungi secrete enzymes can break down potential food sources; plant secreted proteins are primarily parts of the cell wall proteome; and human secreted proteins are involved in cellular immunity and communication, and provide useful information for the discovery of novel biomarkers, such as for cancer diagnosis. Continuous development of methodologies supports the wide identification and quantification of secreted proteins in a given cellular state. The role of secreted factors is also investigated in the context of the regulation of major signaling events, and connectivity maps are built to describe the differential expression and dynamic changes of secretomes. Bioinformatics has become the bridge between secretome data and computational tasks for managing, mining, and retrieving information. Predictions can be made based on this information, contributing to the elucidation of a given organism's physiological state and the determination of the specific malfunction in disease states. Here we provide an overview of the available bioinformatics databases and software that are used to analyze the biological meaning of secretome data, including descriptions of the main functions and limitations of these tools. The important challenges of data analysis are mainly related to the integration of biological information from dissimilar sources. Improvements in databases and developments in software will likely substantially contribute to the usefulness and reliability of secretome studies. This article is part of a Special Issue entitled: An Updated Secretome.
Collapse
Affiliation(s)
- Dario Caccia
- Proteomics Laboratory, Department of Experimental Oncology and Molecular Medicine, Fondazione IRCCS Istituto Nazionale dei Tumori, Milan, Italy
| | | | | | | |
Collapse
|
14
|
Ye L, Zhang T, Wang T, Fang Z. Microbial structures, functions, and metabolic pathways in wastewater treatment bioreactors revealed using high-throughput sequencing. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2012; 46:13244-52. [PMID: 23151157 DOI: 10.1021/es303454k] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
The objective of this study was to explore microbial community structures, functional profiles, and metabolic pathways in a lab-scale and a full-scale wastewater treatment bioreactors. In order to do this, over 12 gigabases of metagenomic sequence data and 600,000 paired-end sequences of bacterial 16S rRNA gene were generated with the Illumina HiSeq 2000 platform, using DNA extracted from activated sludge in the two bioreactors. Three kinds of sequences (16S rRNA gene amplicons, 16S rRNA gene sequences obtained from metagenomic sequencing, and predicted proteins) were used to conduct taxonomic assignments. Specially, relative abundances of ammonia-oxidizing archaea (AOA) and ammonia-oxidizing bacteria (AOB) were analyzed. Compared with quantitative real-time PCR (qPCR), metagenomic sequencing was demonstrated to be a better approach to quantify AOA and AOB in activated sludge samples. It was found that AOB were more abundant than AOA in both reactors. Furthermore, the analysis of the metabolic profiles indicated that the overall patterns of metabolic pathways in the two reactors were quite similar (73.3% of functions shared). However, for some pathways (such as carbohydrate metabolism and membrane transport), the two reactors differed in the number of pathway-specific genes.
Collapse
Affiliation(s)
- Lin Ye
- Environmental Biotechnology Laboratory, The University of Hong Kong , Pokfulam Road, Hong Kong
| | | | | | | |
Collapse
|
15
|
Santos AR, Carneiro A, Gala-García A, Pinto A, Barh D, Barbosa E, Aburjaile F, Dorella F, Rocha F, Guimarães L, Zurita-Turk M, Ramos R, Almeida S, Soares S, Pereira U, Abreu VC, Silva A, Miyoshi A, Azevedo V. The Corynebacterium pseudotuberculosis in silico predicted pan-exoproteome. BMC Genomics 2012; 13 Suppl 5:S6. [PMID: 23095951 PMCID: PMC3476999 DOI: 10.1186/1471-2164-13-s5-s6] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Background Pan-genomic studies aim, for instance, at defining the core, dispensable and unique genes within a species. A pan-genomics study for vaccine design tries to assess the best candidates for a vaccine against a specific pathogen. In this context, rather than studying genes predicted to be exported in a single genome, with pan-genomics it is possible to study genes present in different strains within the same species, such as virulence factors. The target organism of this pan-genomic work here presented is Corynebacterium pseudotuberculosis, the etiologic agent of caseous lymphadenitis (CLA) in goat and sheep, which causes significant economic losses in those herds around the world. Currently, only a few antigens against CLA are known as being the basis of commercial and still ineffective vaccines. In this regard, the here presented work analyses, in silico, five C. pseudotuberculosis genomes and gathers data to predict common exported proteins in all five genomes. These candidates were also compared to two recent C. pseudotuberculosis in vitro exoproteome results. Results The complete genome of five C. pseudotuberculosis strains (1002, C231, I19, FRC41 and PAT10) were submitted to pan-genomics analysis, yielding 306, 59 and 12 gene sets, respectively, representing the core, dispensable and unique in silico predicted exported pan-genomes. These sets bear 150 genes classified as secreted (SEC) and 227 as potentially surface exposed (PSE). Our findings suggest that the main C. pseudotuberculosis in vitro exoproteome could be greater, appended by a fraction of the 35 proteins formerly predicted as making part of the variant in vitro exoproteome. These genomes were manually curated for correct methionine initiation and redeposited with a total of 1885 homogenized genes. Conclusions The in silico prediction of exported proteins has allowed to define a list of putative vaccine candidate genes present in all five complete C. pseudotuberculosis genomes. Moreover, it has also been possible to define the in silico predicted dispensable and unique C. pseudotuberculosis exported proteins. These results provide in silico evidence to further guide experiments in the areas of vaccines, diagnosis and drugs. The work here presented is the first whole C. pseudotuberculosis in silico predicted pan-exoproteome completed till today.
Collapse
Affiliation(s)
- Anderson R Santos
- Molecular and Celular Genetics Laboratory, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
16
|
Hu Y, Li T, Sun J, Tang S, Xiong W, Li D, Chen G, Cong P. Predicting Gram-positive bacterial protein subcellular localization based on localization motifs. J Theor Biol 2012; 308:135-40. [DOI: 10.1016/j.jtbi.2012.05.031] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2012] [Revised: 03/30/2012] [Accepted: 05/29/2012] [Indexed: 10/28/2022]
|
17
|
Renier S, Micheau P, Talon R, Hébraud M, Desvaux M. Subcellular localization of extracytoplasmic proteins in monoderm bacteria: rational secretomics-based strategy for genomic and proteomic analyses. PLoS One 2012; 7:e42982. [PMID: 22912771 PMCID: PMC3415414 DOI: 10.1371/journal.pone.0042982] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2012] [Accepted: 07/13/2012] [Indexed: 11/20/2022] Open
Abstract
Genome-scale prediction of subcellular localization (SCL) is not only useful for inferring protein function but also for supporting proteomic data. In line with the secretome concept, a rational and original analytical strategy mimicking the secretion steps that determine ultimate SCL was developed for Gram-positive (monoderm) bacteria. Based on the biology of protein secretion, a flowchart and decision trees were designed considering (i) membrane targeting, (ii) protein secretion systems, (iii) membrane retention, and (iv) cell-wall retention by domains or post-translocational modifications, as well as (v) incorporation to cell-surface supramolecular structures. Using Listeria monocytogenes as a case study, results were compared with known data set from SCL predictors and experimental proteomics. While in good agreement with experimental extracytoplasmic fractions, the secretomics-based method outperforms other genomic analyses, which were simply not intended to be as inclusive. Compared to all other localization predictors, this method does not only supply a static snapshot of protein SCL but also offers the full picture of the secretion process dynamics: (i) the protein routing is detailed, (ii) the number of distinct SCL and protein categories is comprehensive, (iii) the description of protein type and topology is provided, (iv) the SCL is unambiguously differentiated from the protein category, and (v) the multiple SCL and protein category are fully considered. In that sense, the secretomics-based method is much more than a SCL predictor. Besides a major step forward in genomics and proteomics of protein secretion, the secretomics-based method appears as a strategy of choice to generate in silico hypotheses for experimental testing.
Collapse
Affiliation(s)
- Sandra Renier
- INRA, UR454 Microbiology, Saint-Genès Champanelle, France
| | - Pierre Micheau
- INRA, UR454 Microbiology, Saint-Genès Champanelle, France
| | - Régine Talon
- INRA, UR454 Microbiology, Saint-Genès Champanelle, France
| | - Michel Hébraud
- INRA, UR454 Microbiology, Saint-Genès Champanelle, France
| | - Mickaël Desvaux
- INRA, UR454 Microbiology, Saint-Genès Champanelle, France
- * E-mail:
| |
Collapse
|
18
|
Huang WL. Ranking Gene Ontology terms for predicting non-classical secretory proteins in eukaryotes and prokaryotes. J Theor Biol 2012; 312:105-13. [PMID: 22967952 DOI: 10.1016/j.jtbi.2012.07.027] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2012] [Revised: 05/30/2012] [Accepted: 07/28/2012] [Indexed: 11/24/2022]
Abstract
Protein secretion is an important biological process for both eukaryotes and prokaryotes. Several sequence-based methods mainly rely on utilizing various types of complementary features to design accurate classifiers for predicting non-classical secretory proteins. Gene Ontology (GO) terms are increasing informative in predicting protein functions. However, the number of used GO terms is often very large. For example, there are 60,020 GO terms used in the prediction method Euk-mPLoc 2.0 for subcellular localization. This study proposes a novel approach to identify a small set of m top-ranked GO terms served as the only type of input features to design a support vector machine (SVM) based method Sec-GO to predict non-classical secretory proteins in both eukaryotes and prokaryotes. To evaluate the Sec-GO method, two existing methods and their used datasets are adopted for performance comparisons. The Sec-GO method using m=436 GO terms yields an independent test accuracy of 96.7% on mammalian proteins, much better than the existing method SPRED (82.2%) which uses frequencies of tri-peptides and short peptides, secondary structure, and physicochemical properties as input features of a random forest classifier. Furthermore, when applying to Gram-positive bacterial proteins, the Sec-GO with m=158 GO terms has a test accuracy of 94.5%, superior to NClassG+ (90.0%) which uses SVM with several feature types, comprising amino acid composition, di-peptides, physicochemical properties and the position specific weighting matrix. Analysis of the distribution of secretory proteins in a GO database indicates the percentage of the non-classical secretory proteins annotated by GO is larger than that of classical secretory proteins in both eukaryotes and prokaryotes. Of the m top-ranked GO features, the top-four GO terms are all annotated by such subcellular locations as GO:0005576 (Extracellular region). Additionally, the method Sec-GO is easily implemented and its web tool of prediction is available at iclab.life.nctu.edu.tw/secgo.
Collapse
Affiliation(s)
- Wen-Lin Huang
- Department of Management Information System, Asia Pacific Institute of Creativity, No. 110 XueFu Rd., Tou Fen, Miaoli, Taiwan, ROC.
| |
Collapse
|