1
|
Gillani M, Pollastri G. Protein subcellular localization prediction tools. Comput Struct Biotechnol J 2024; 23:1796-1807. [PMID: 38707539 PMCID: PMC11066471 DOI: 10.1016/j.csbj.2024.04.032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2024] [Revised: 04/11/2024] [Accepted: 04/11/2024] [Indexed: 05/07/2024] Open
Abstract
Protein subcellular localization prediction is of great significance in bioinformatics and biological research. Most of the proteins do not have experimentally determined localization information, computational prediction methods and tools have been acting as an active research area for more than two decades now. Knowledge of the subcellular location of a protein provides valuable information about its functionalities, the functioning of the cell, and other possible interactions with proteins. Fast, reliable, and accurate predictors provides platforms to harness the abundance of sequence data to predict subcellular locations accordingly. During the last decade, there has been a considerable amount of research effort aimed at developing subcellular localization predictors. This paper reviews recent subcellular localization prediction tools in the Eukaryotic, Prokaryotic, and Virus-based categories followed by a detailed analysis. Each predictor is discussed based on its main features, strengths, weaknesses, algorithms used, prediction techniques, and analysis. This review is supported by prediction tools taxonomies that highlight their rele- vant area and examples for uncomplicated categorization and ease of understandability. These taxonomies help users find suitable tools according to their needs. Furthermore, recent research gaps and challenges are discussed to cover areas that need the utmost attention. This survey provides an in-depth analysis of the most recent prediction tools to facilitate readers and can be considered a quick guide for researchers to identify and explore the recent literature advancements.
Collapse
Affiliation(s)
- Maryam Gillani
- School of Computer Science, University College Dublin (UCD), Dublin, D04 V1W8, Ireland
| | - Gianluca Pollastri
- School of Computer Science, University College Dublin (UCD), Dublin, D04 V1W8, Ireland
| |
Collapse
|
2
|
Paul D, Sinnarasan VSP, Das R, Sheikh MMR, Venkatesan A. Machine learning approach to predict blood-secretory proteins and potential biomarkers for liver cancer using omics data. J Proteomics 2024; 309:105298. [PMID: 39216516 DOI: 10.1016/j.jprot.2024.105298] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2024] [Revised: 08/22/2024] [Accepted: 08/29/2024] [Indexed: 09/04/2024]
Abstract
Identifying non-invasive blood-based biomarkers is crucial for early detection and monitoring of liver cancer (LC), thereby improving patient outcomes. This study leveraged computational approaches to predict potential blood-based biomarkers for LC. Machine learning (ML) models were developed using selected features from blood-secretory proteins collected from the curated databases. The logistic regression (LR) model demonstrated the optimal performance. Transcriptome analysis across 7 LC cohorts revealed 231 common differentially expressed genes (DEGs). The encoded proteins of these DEGs were compared with the ML dataset, revealing 29 proteins overlapping with the blood-secretory dataset. The LR model also predicted 29 additional proteins as blood-secretory with the remaining protein-coding genes. As a result, 58 potential blood-secretory proteins were obtained. Among the top 20 genes, 13 common hub genes were identified. Further, area under the receiver operating characteristic curve (ROC AUC) analysis was performed to assess the genes as potential diagnostic blood biomarkers. Six genes, ESM1, FCN2, MDK, GPC3, CTHRC1 and COL6A6, exhibited an AUC value higher than 0.85 and were predicted as blood-secretory. This study highlights the potential of an integrative computational approach for discovering non-invasive blood-based biomarkers in LC, facilitating for further validation and clinical translation. SIGNIFICANCE: Liver cancer is one of the leading causes of premature death worldwide, with its prevalence and mortality rates projected to increase. Although current diagnostic methods are highly sensitive, they are invasive and unsuitable for repeated testing. Blood biomarkers offer a promising non-invasive alternative, but their wide dynamic range of protein concentration poses experimental challenges. Therefore, utilizing available omics data to develop a diagnostic model could provide a potential solution for accurate diagnosis. This study developed a computational method integrating machine learning and bioinformatics analysis to identify potential blood biomarkers. As a result, ESM1, FCN2, MDK, GPC3, CTHRC1 and COL6A6 biomarkers were identified, holding significant promise for improving diagnosis and understanding of liver cancer. The integrated method can be applied to other cancers, offering a possible solution for early detection and improved patient outcomes.
Collapse
Affiliation(s)
- Dahrii Paul
- Department of Bioinformatics, Pondicherry University, Puducherry 605014, India
| | | | - Rajesh Das
- Department of Bioinformatics, Pondicherry University, Puducherry 605014, India
| | | | - Amouda Venkatesan
- Department of Bioinformatics, Pondicherry University, Puducherry 605014, India.
| |
Collapse
|
3
|
Wang Y, Sun H, Sheng N, He K, Hou W, Zhao Z, Yang Q, Huang L. ESMSec: Prediction of Secreted Proteins in Human Body Fluids Using Protein Language Models and Attention. Int J Mol Sci 2024; 25:6371. [PMID: 38928078 PMCID: PMC11204320 DOI: 10.3390/ijms25126371] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2024] [Revised: 06/02/2024] [Accepted: 06/05/2024] [Indexed: 06/28/2024] Open
Abstract
The secreted proteins of human body fluid have the potential to be used as biomarkers for diseases. These biomarkers can be used for early diagnosis and risk prediction of diseases, so the study of secreted proteins of human body fluid has great application value. In recent years, the deep-learning-based transformer language model has transferred from the field of natural language processing (NLP) to the field of proteomics, leading to the development of protein language models (PLMs) for protein sequence representation. Here, we propose a deep learning framework called ESM Predict Secreted Proteins (ESMSec) to predict three types of proteins secreted in human body fluid. The ESMSec is based on the ESM2 model and attention architecture. Specifically, the protein sequence data are firstly put into the ESM2 model to extract the feature information from the last hidden layer, and all the input proteins are encoded into a fixed 1000 × 480 matrix. Secondly, multi-head attention with a fully connected neural network is employed as the classifier to perform binary classification according to whether they are secreted into each body fluid. Our experiment utilized three human body fluids that are important and ubiquitous markers. Experimental results show that ESMSec achieved average accuracy of 0.8486, 0.8358, and 0.8325 on the testing datasets for plasma, cerebrospinal fluid (CSF), and seminal fluid, which on average outperform the state-of-the-art (SOTA) methods. The outstanding performance results of ESMSec demonstrate that the ESM can improve the prediction performance of the model and has great potential to screen the secretion information of human body fluid proteins.
Collapse
Affiliation(s)
- Yan Wang
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China; (Y.W.); (H.S.); (N.S.); (W.H.); (Z.Z.); (Q.Y.)
| | - Huiting Sun
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China; (Y.W.); (H.S.); (N.S.); (W.H.); (Z.Z.); (Q.Y.)
| | - Nan Sheng
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China; (Y.W.); (H.S.); (N.S.); (W.H.); (Z.Z.); (Q.Y.)
| | - Kai He
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48103, USA;
| | - Wenjv Hou
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China; (Y.W.); (H.S.); (N.S.); (W.H.); (Z.Z.); (Q.Y.)
| | - Ziqi Zhao
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China; (Y.W.); (H.S.); (N.S.); (W.H.); (Z.Z.); (Q.Y.)
| | - Qixing Yang
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China; (Y.W.); (H.S.); (N.S.); (W.H.); (Z.Z.); (Q.Y.)
| | - Lan Huang
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China; (Y.W.); (H.S.); (N.S.); (W.H.); (Z.Z.); (Q.Y.)
| |
Collapse
|
4
|
Shang J, Tang X, Sun Y. PhaTYP: predicting the lifestyle for bacteriophages using BERT. Brief Bioinform 2023; 24:bbac487. [PMID: 36659812 PMCID: PMC9851330 DOI: 10.1093/bib/bbac487] [Citation(s) in RCA: 63] [Impact Index Per Article: 31.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2022] [Revised: 10/05/2022] [Accepted: 10/15/2022] [Indexed: 11/24/2022] Open
Abstract
Bacteriophages (or phages), which infect bacteria, have two distinct lifestyles: virulent and temperate. Predicting the lifestyle of phages helps decipher their interactions with their bacterial hosts, aiding phages' applications in fields such as phage therapy. Because experimental methods for annotating the lifestyle of phages cannot keep pace with the fast accumulation of sequenced phages, computational method for predicting phages' lifestyles has become an attractive alternative. Despite some promising results, computational lifestyle prediction remains difficult because of the limited known annotations and the sheer amount of sequenced phage contigs assembled from metagenomic data. In particular, most of the existing tools cannot precisely predict phages' lifestyles for short contigs. In this work, we develop PhaTYP (Phage TYPe prediction tool) to improve the accuracy of lifestyle prediction on short contigs. We design two different training tasks, self-supervised and fine-tuning tasks, to overcome lifestyle prediction difficulties. We rigorously tested and compared PhaTYP with four state-of-the-art methods: DeePhage, PHACTS, PhagePred and BACPHLIP. The experimental results show that PhaTYP outperforms all these methods and achieves more stable performance on short contigs. In addition, we demonstrated the utility of PhaTYP for analyzing the phage lifestyle on human neonates' gut data. This application shows that PhaTYP is a useful means for studying phages in metagenomic data and helps extend our understanding of microbial communities.
Collapse
Affiliation(s)
- Jiayu Shang
- Department of Electrical Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong, China SAR
| | - Xubo Tang
- Department of Electrical Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong, China SAR
| | - Yanni Sun
- Department of Electrical Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong, China SAR
| |
Collapse
|
5
|
DenSec: Secreted Protein Prediction in Cerebrospinal Fluid Based on DenseNet and Transformer. MATHEMATICS 2022. [DOI: 10.3390/math10142490] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
Cerebrospinal fluid (CSF) exists in the surrounding spaces of mammalian central nervous systems (CNS); therefore, there are numerous potential protein biomarkers associated with CNS disease in CSF. Currently, approximately 4300 proteins have been identified in CSF by protein profiling. However, due to the diverse modifications, as well as the existing technical limits, large-scale protein identification in CSF is still considered a challenge. Inspired by computational methods, this paper proposes a deep learning framework, named DenSec, for secreted protein prediction in CSF. In the first phase of DenSec, all input proteins are encoded as a matrix with a fixed size of 1000 × 20 by calculating a position-specific score matrix (PSSM) of protein sequences. In the second phase, a dense convolutional network (DenseNet) is adopted to extract the feature from these PSSMs automatically. After that, Transformer with a fully connected dense layer acts as classifier to perform a binary classification in terms of secretion into CSF or not. According to the experiment results, DenSec achieves a mean accuracy of 86.00% in the test dataset and outperforms the state-of-the-art methods.
Collapse
|
6
|
Shang J, Tang X, Guo R, Sun Y. Accurate identification of bacteriophages from metagenomic data using Transformer. Brief Bioinform 2022; 23:6620872. [PMID: 35769000 PMCID: PMC9294416 DOI: 10.1093/bib/bbac258] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2022] [Revised: 05/22/2022] [Accepted: 06/04/2022] [Indexed: 11/20/2022] Open
Abstract
Motivation Bacteriophages are viruses infecting bacteria. Being key players in microbial communities, they can regulate the composition/function of microbiome by infecting their bacterial hosts and mediating gene transfer. Recently, metagenomic sequencing, which can sequence all genetic materials from various microbiome, has become a popular means for new phage discovery. However, accurate and comprehensive detection of phages from the metagenomic data remains difficult. High diversity/abundance, and limited reference genomes pose major challenges for recruiting phage fragments from metagenomic data. Existing alignment-based or learning-based models have either low recall or precision on metagenomic data. Results In this work, we adopt the state-of-the-art language model, Transformer, to conduct contextual embedding for phage contigs. By constructing a protein-cluster vocabulary, we can feed both the protein composition and the proteins’ positions from each contig into the Transformer. The Transformer can learn the protein organization and associations using the self-attention mechanism and predicts the label for test contigs. We rigorously tested our developed tool named PhaMer on multiple datasets with increasing difficulty, including quality RefSeq genomes, short contigs, simulated metagenomic data, mock metagenomic data and the public IMG/VR dataset. All the experimental results show that PhaMer outperforms the state-of-the-art tools. In the real metagenomic data experiment, PhaMer improves the F1-score of phage detection by 27%.
Collapse
Affiliation(s)
- Jiayu Shang
- Department of Electrical Engineering, City University of Hong Kong, Hong Kong (SAR), China
| | - Xubo Tang
- Department of Electrical Engineering, City University of Hong Kong, Hong Kong (SAR), China
| | - Ruocheng Guo
- School of Data Science, City University of Hong Kong, Hong Kong (SAR), China
| | - Yanni Sun
- Department of Electrical Engineering, City University of Hong Kong, Hong Kong (SAR), China
| |
Collapse
|
7
|
Jain S, Dhall A, Patiyal S, Raghava GPS. IL13Pred: A method for predicting immunoregulatory cytokine IL-13 inducing peptides. Comput Biol Med 2022; 143:105297. [PMID: 35152041 DOI: 10.1016/j.compbiomed.2022.105297] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2021] [Revised: 01/23/2022] [Accepted: 01/23/2022] [Indexed: 11/03/2022]
Abstract
BACKGROUND Interleukin 13 (IL-13) is an immunoregulatory cytokine, primarily released by activated T-helper 2 cells. IL-13 induces the pathogenesis of many allergic diseases, such as airway hyperresponsiveness, glycoprotein hypersecretion, and goblet cell hyperplasia. In addition, IL-13 inhibits tumor immunosurveillance, leading to carcinogenesis. Since elevated IL-13 serum levels are severe in COVID-19 patients, predicting IL-13 inducing peptides or regions in a protein is vital to designing safe protein therapeutics particularly immunotherapeutic. OBJECTIVE The present study describes a method to develop, predict, design, and scan IL-13 inducing peptides. METHODS The dataset experimentally validated 313 IL-13 inducing peptides, and 2908 non-inducing homo-sapiens peptides extracted from the immune epitope database (IEDB). A total of 95 key features using the linear support vector classifier with the L1 penalty (SVC-L1) technique was extracted from the originally generated 9165 features using Pfeature. These key features were ranked based on their prediction ability, and the top 10 features were used to build machine learning prediction models. Various machine learning techniques were deployed to develop models for predicting IL-13 inducing peptides. These models were trained, tested, and evaluated using five-fold cross-validation techniques; the best model was evaluated on an independent dataset. RESULTS Our best model based on XGBoost achieves a maximum AUC of 0.83 and 0.80 on the training and independent dataset, respectively. Our analysis indicates that certain SARS-COV2 variants are more prone to induce IL-13 in COVID-19 patients. CONCLUSION The best performing model was incorporated in web-server and standalone package named 'IL-13Pred' for precise prediction of IL-13 inducing peptides. For large dataset analysis standalone package of IL-13Pred is available at (https://webs.iiitd.edu.in/raghava/il13pred/) webserver and over GitHub link: https://github.com/raghavagps/il13pred.
Collapse
Affiliation(s)
- Shipra Jain
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi, 110020, India.
| | - Anjali Dhall
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi, 110020, India.
| | - Sumeet Patiyal
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi, 110020, India.
| | - Gajendra P S Raghava
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi, 110020, India.
| |
Collapse
|