1
|
Houles T, Yoon SO, Roux PP. The expanding landscape of canonical and non-canonical protein phosphorylation. Trends Biochem Sci 2024:S0968-0004(24)00191-9. [PMID: 39266329 DOI: 10.1016/j.tibs.2024.08.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2024] [Revised: 08/01/2024] [Accepted: 08/14/2024] [Indexed: 09/14/2024]
Abstract
Protein phosphorylation is a crucial regulatory mechanism in cell signaling, acting as a molecular switch that modulates protein function. Catalyzed by protein kinases and reversed by phosphoprotein phosphatases, it is essential in both normal physiological and pathological states. Recent advances have uncovered a vast and intricate landscape of protein phosphorylation that include histidine phosphorylation and more unconventional events, such as pyrophosphorylation and polyphosphorylation. Many questions remain about the true size of the phosphoproteome and, more importantly, its site-specific functional relevance. The involvement of unconventional actors such as pseudokinases and pseudophosphatases adds further complexity to be resolved. This review explores recent discoveries and ongoing challenges, highlighting the need for continued research to fully elucidate the roles and regulation of protein phosphorylation.
Collapse
Affiliation(s)
- Thibault Houles
- Institute for Research in Immunology and Cancer (IRIC), Université de Montréal, Montreal, Quebec, Canada; Institute of Molecular Genetics of Montpellier (IGMM), Université de Montpellier, CNRS, Montpellier, France.
| | - Sang-Oh Yoon
- Department of Physiology and Biophysics, College of Medicine, University of Illinois at Chicago, Chicago, IL 60612, USA
| | - Philippe P Roux
- Institute for Research in Immunology and Cancer (IRIC), Université de Montréal, Montreal, Quebec, Canada; Department of Pathology and Cell Biology, Faculty of Medicine, Université de Montréal, Montreal, Quebec, Canada.
| |
Collapse
|
2
|
Ahmed F, Sharma A, Shatabda S, Dehzangi I. DeepPhoPred: Accurate Deep Learning Model to Predict Microbial Phosphorylation. Proteins 2024. [PMID: 39239684 DOI: 10.1002/prot.26734] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Revised: 06/27/2024] [Accepted: 07/15/2024] [Indexed: 09/07/2024]
Abstract
Phosphorylation is a substantial posttranslational modification of proteins that refers to adding a phosphate group to the amino acid side chain after translation process in the ribosome. It is vital to coordinate cellular functions, such as regulating metabolism, proliferation, apoptosis, subcellular trafficking, and other crucial physiological processes. Phosphorylation prediction in a microbial organism can assist in understanding pathogenesis and host-pathogen interaction, drug and antibody design, and antimicrobial agent development. Experimental methods for predicting phosphorylation sites are costly, slow, and tedious. Hence low-cost and high-speed computational approaches are highly desirable. This paper presents a new deep learning tool called DeepPhoPred for predicting microbial phospho-serine (pS), phospho-threonine (pT), and phospho-tyrosine (pY) sites. DeepPhoPred incorporates a two-headed convolutional neural network architecture with the squeeze and excitation blocks followed by fully connected layers that jointly learn significant features from the peptide's structural and evolutionary information to predict phosphorylation sites. Our empirical results demonstrate that DeepPhoPred significantly outperforms the existing microbial phosphorylation site predictors with its highly efficient deep-learning architecture. DeepPhoPred as a standalone predictor, all its source codes, and our employed datasets are publicly available at https://github.com/faisalahm3d/DeepPhoPred.
Collapse
Affiliation(s)
- Faisal Ahmed
- Digital Health Unit, NVISION Systems and Technologies SL, Barcelona, Spain
- Department of Computer Engineering and Mathematics, Universitat Rovira i Virgili, Tarragona, Spain
| | - Alok Sharma
- Laboratory of Medical Science Mathematics, Department of Biological Sciences, Graduate School of Science, The University of Tokyo, Tokyo, Japan
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Queensland, Australia
- College of Informatics, Korea University, Seoul, South Korea
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Japan
| | - Swakkhar Shatabda
- Department of Computer Science and Engineering, BRAC University, Dhaka, Bangladesh
| | - Iman Dehzangi
- Department of Computer Science, Rutgers University, Camden, New Jersey, USA
- Center for Computational and Integrative Biology (CCIB), Rutgers University, Camden, New Jersey, USA
| |
Collapse
|
3
|
Deng Q, Zhang J, Liu J, Liu Y, Dai Z, Zou X, Li Z. Identifying Protein Phosphorylation Site-Disease Associations Based on Multi-Similarity Fusion and Negative Sample Selection by Convolutional Neural Network. Interdiscip Sci 2024; 16:649-664. [PMID: 38457108 DOI: 10.1007/s12539-024-00615-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2023] [Revised: 01/26/2024] [Accepted: 01/29/2024] [Indexed: 03/09/2024]
Abstract
As one of the most important post-translational modifications (PTMs), protein phosphorylation plays a key role in a variety of biological processes. Many studies have shown that protein phosphorylation is associated with various human diseases. Therefore, identifying protein phosphorylation site-disease associations can help to elucidate the pathogenesis of disease and discover new drug targets. Networks of sequence similarity and Gaussian interaction profile kernel similarity were constructed for phosphorylation sites, as well as networks of disease semantic similarity, disease symptom similarity and Gaussian interaction profile kernel similarity were constructed for diseases. To effectively combine different phosphorylation sites and disease similarity information, random walk with restart algorithm was used to obtain the topology information of the network. Then, the diffusion component analysis method was utilized to obtain the comprehensive phosphorylation site similarity and disease similarity. Meanwhile, the reliable negative samples were screened based on the Euclidean distance method. Finally, a convolutional neural network (CNN) model was constructed to identify potential associations between phosphorylation sites and diseases. Based on tenfold cross-validation, the evaluation indicators were obtained including accuracy of 93.48%, specificity of 96.82%, sensitivity of 90.15%, precision of 96.62%, Matthew's correlation coefficient of 0.8719, area under the receiver operating characteristic curve of 0.9786 and area under the precision-recall curve of 0.9836. Additionally, most of the top 20 predicted disease-related phosphorylation sites (19/20 for Alzheimer's disease; 20/16 for neuroblastoma) were verified by literatures and databases. These results show that the proposed method has an outstanding prediction performance and a high practical value.
Collapse
Affiliation(s)
- Qian Deng
- School of Chemistry and Chemical Engineering, Guangdong Pharmaceutical University, Guangzhou, 510006, China
| | - Jing Zhang
- School of Chemistry and Chemical Engineering, Guangdong Pharmaceutical University, Guangzhou, 510006, China
| | - Jie Liu
- School of Chemistry and Chemical Engineering, Guangdong Pharmaceutical University, Guangzhou, 510006, China
| | - Yuqi Liu
- School of Chemistry and Chemical Engineering, Guangdong Pharmaceutical University, Guangzhou, 510006, China
| | - Zong Dai
- School of Biomedical Engineering, Sun Yat-Sen University, Guangzhou, 510275, China
| | - Xiaoyong Zou
- School of Chemistry, Sun Yat-Sen University, Guangzhou, 510275, China.
| | - Zhanchao Li
- School of Chemistry and Chemical Engineering, Guangdong Pharmaceutical University, Guangzhou, 510006, China.
| |
Collapse
|
4
|
Li W, Lin H, Huang Z, Xie S, Zhou Y, Gong R, Jiang Q, Xiang C, Huang J. DOTAD: A Database of Therapeutic Antibody Developability. Interdiscip Sci 2024; 16:623-634. [PMID: 38530613 DOI: 10.1007/s12539-024-00613-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2023] [Revised: 01/25/2024] [Accepted: 01/27/2024] [Indexed: 03/28/2024]
Abstract
The development of therapeutic antibodies is an important aspect of new drug discovery pipelines. The assessment of an antibody's developability-its suitability for large-scale production and therapeutic use-is a particularly important step in this process. Given that experimental assays to assess antibody developability in large scale are expensive and time-consuming, computational methods have been a more efficient alternative. However, the antibody research community faces significant challenges due to the scarcity of readily accessible data on antibody developability, which is essential for training and validating computational models. To address this gap, DOTAD (Database Of Therapeutic Antibody Developability) has been built as the first database dedicated exclusively to the curation of therapeutic antibody developability information. DOTAD aggregates all available therapeutic antibody sequence data along with various developability metrics from the scientific literature, offering researchers a robust platform for data storage, retrieval, exploration, and downloading. In addition to serving as a comprehensive repository, DOTAD enhances its utility by integrating a web-based interface that features state-of-the-art tools for the assessment of antibody developability. This ensures that users not only have access to critical data but also have the convenience of analyzing and interpreting this information. The DOTAD database represents a valuable resource for the scientific community, facilitating the advancement of therapeutic antibody research. It is freely accessible at http://i.uestc.edu.cn/DOTAD/ , providing an open data platform that supports the continuous growth and evolution of computational methods in the field of antibody development.
Collapse
Affiliation(s)
- Wenzhen Li
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Hongyan Lin
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Ziru Huang
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Shiyang Xie
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Yuwei Zhou
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Rong Gong
- School of Computer Science and Technology, Aba Teachers University, Aba, 623002, China
| | - Qianhu Jiang
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - ChangCheng Xiang
- School of Computer Science and Technology, Aba Teachers University, Aba, 623002, China.
| | - Jian Huang
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 611731, China.
- School of Healthcare Technology, Chengdu Neusoft University, Chengdu, 611844, China.
| |
Collapse
|
5
|
Grunfeld N, Levine E, Libby E. Experimental measurement and computational prediction of bacterial Hanks-type Ser/Thr signaling system regulatory targets. Mol Microbiol 2024; 122:152-164. [PMID: 38167835 PMCID: PMC11219531 DOI: 10.1111/mmi.15220] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2023] [Revised: 12/15/2023] [Accepted: 12/17/2023] [Indexed: 01/05/2024]
Abstract
Bacteria possess diverse classes of signaling systems that they use to sense and respond to their environments and execute properly timed developmental transitions. One widespread and evolutionarily ancient class of signaling systems are the Hanks-type Ser/Thr kinases, also sometimes termed "eukaryotic-like" due to their homology with eukaryotic kinases. In diverse bacterial species, these signaling systems function as critical regulators of general cellular processes such as metabolism, growth and division, developmental transitions such as sporulation, biofilm formation, and virulence, as well as antibiotic tolerance. This multifaceted regulation is due to the ability of a single Hanks-type Ser/Thr kinase to post-translationally modify the activity of multiple proteins, resulting in the coordinated regulation of diverse cellular pathways. However, in part due to their deep integration with cellular physiology, to date, we have a relatively limited understanding of the timing, regulatory hierarchy, the complete list of targets of a given kinase, as well as the potential regulatory overlap between the often multiple kinases present in a single organism. In this review, we discuss experimental methods and curated datasets aimed at elucidating the targets of these signaling pathways and approaches for using these datasets to develop computational models for quantitative predictions of target motifs. We emphasize novel approaches and opportunities for collecting data suitable for the creation of new predictive computational models applicable to diverse species.
Collapse
Affiliation(s)
- Noam Grunfeld
- Department of Bioengineering, Northeastern University, Boston MA USA
| | - Erel Levine
- Department of Bioengineering, Northeastern University, Boston MA USA
- Department of Chemical Engineering, Northeastern University, Boston MA USA
| | - Elizabeth Libby
- Department of Bioengineering, Northeastern University, Boston MA USA
| |
Collapse
|
6
|
Yu Z, Yu J, Wang H, Zhang S, Zhao L, Shi S. PhosAF: An integrated deep learning architecture for predicting protein phosphorylation sites with AlphaFold2 predicted structures. Anal Biochem 2024; 690:115510. [PMID: 38513769 DOI: 10.1016/j.ab.2024.115510] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2023] [Revised: 03/14/2024] [Accepted: 03/18/2024] [Indexed: 03/23/2024]
Abstract
Phosphorylation is indispensable in comprehending biological processes, while biological experimental methods for identifying phosphorylation sites are tedious and arduous. With the rapid growth of biotechnology, deep learning methods have made significant progress in site prediction tasks. Nevertheless, most existing predictors only consider protein sequence information, that limits the capture of protein spatial information. Building upon the latest advancement in protein structure prediction by AlphaFold2, a novel integrated deep learning architecture PhosAF is developed to predict phosphorylation sites in human proteins by integrating CMA-Net and MFC-Net, which considers sequence and structure information predicted by AlphaFold2. Here, CMA-Net module is composed of multiple convolutional neural network layers and multi-head attention is appended to obtaining the local and long-term dependencies of sequence features. Meanwhile, the MFC-Net module composed of deep neural network layers is used to capture the complex representations of evolutionary and structure features. Furthermore, different features are combined to predict the final phosphorylation sites. In addition, we put forward a new strategy to construct reliable negative samples via protein secondary structures. Experimental results on independent test data and case study indicate that our model PhosAF surpasses the current most advanced methods in phosphorylation site prediction.
Collapse
Affiliation(s)
- Ziyuan Yu
- Department of Mathematics, School of Mathematics and Computer Sciences, Nanchang University, Nanchang, 330031, China.
| | - Jialin Yu
- Department of Mathematics, School of Mathematics and Computer Sciences, Nanchang University, Nanchang, 330031, China.
| | - Hongmei Wang
- Department of Mathematics, School of Mathematics and Computer Sciences, Nanchang University, Nanchang, 330031, China.
| | - Shuai Zhang
- Department of Mathematics, School of Mathematics and Computer Sciences, Nanchang University, Nanchang, 330031, China.
| | - Long Zhao
- Department of Mathematics, School of Mathematics and Computer Sciences, Nanchang University, Nanchang, 330031, China.
| | - Shaoping Shi
- Department of Mathematics, School of Mathematics and Computer Sciences, Nanchang University, Nanchang, 330031, China; Institute of Mathematics and Interdisciplinary Sciences, Nanchang University, Nanchang, 330031, China.
| |
Collapse
|
7
|
Gutierrez CS, Kassim AA, Gutierrez BD, Raines RT. Sitetack: A Deep Learning Model that Improves PTM Prediction by Using Known PTMs. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.03.596298. [PMID: 38895359 PMCID: PMC11185516 DOI: 10.1101/2024.06.03.596298] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/21/2024]
Abstract
Post-translational modifications (PTMs) increase the diversity of the proteome and are vital to organismal life and therapeutic strategies. Deep learning has been used to predict PTM locations. Still, limitations in datasets and their analyses compromise success. Here we evaluate the use of known PTM sites in prediction via sequence-based deep learning algorithms. Specifically, PTM locations were encoded as a separate amino acid before sequences were encoded via word embedding and passed into a convolutional neural network that predicts the probability of a modification at a given site. Without labeling known PTMs, our model is on par with others. With labeling, however, we improved significantly upon extant models. Moreover, knowing PTM locations can increase the predictability of a different PTM. Our findings highlight the importance of PTMs for the installation of additional PTMs. We anticipate that including known PTM locations will enhance the performance of other proteomic machine learning algorithms.
Collapse
|
8
|
Chen Z, Ge R, Wang C, Elazab A, Fu X, Min W, Qin F, Jia G, Fan X. Identification of important gene signatures in schizophrenia through feature fusion and genetic algorithm. Mamm Genome 2024; 35:241-255. [PMID: 38512459 DOI: 10.1007/s00335-024-10034-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2023] [Accepted: 02/07/2024] [Indexed: 03/23/2024]
Abstract
Schizophrenia is a debilitating psychiatric disorder that can significantly affect a patient's quality of life and lead to permanent brain damage. Although medical research has identified certain genetic risk factors, the specific pathogenesis of the disorder remains unclear. Despite the prevalence of research employing magnetic resonance imaging, few studies have focused on the gene level and gene expression profile involving a large number of screened genes. However, the high dimensionality of genetic data presents a great challenge to accurately modeling the data. To tackle the current challenges, this study presents a novel feature selection strategy that utilizes heuristic feature fusion and a multi-objective optimization genetic algorithm. The goal is to improve classification performance and identify the key gene subset for schizophrenia diagnostics. Traditional gene screening techniques are inadequate for accurately determining the precise number of key genes associated with schizophrenia. Our innovative approach integrates a filter-based feature selection method to reduce data dimensionality and a multi-objective optimization genetic algorithm for improved classification tasks. By combining the filtering and wrapper methods, our strategy leverages their respective strengths in a deliberate manner, leading to superior classification accuracy and a more efficient selection of relevant genes. This approach has demonstrated significant improvements in classification results across 11 out of 14 relevant datasets. The performance on the remaining three datasets is comparable to the existing methods. Furthermore, visual and enrichment analyses have confirmed the practicality of our proposed method as a promising tool for the early detection of schizophrenia.
Collapse
Affiliation(s)
| | - Ruiquan Ge
- Hangzhou Dianzi University, Hangzhou, China.
- Hangzhou Institute of Advanced Technology, Hangzhou, China.
- Key Laboratory of Discrete Industrial Internet of Things of Zhejiang Province, Hangzhou, China.
| | - Changmiao Wang
- Shenzhen Research Institute of Big Data, Shenzhen, China
| | - Ahmed Elazab
- Computer Science Department, Misr Higher Institute for Commerce and Computers, Mansoura, Egypt
| | - Xianjun Fu
- School of Artificial Intelligence, Zhejiang College of Security Technology, Wenzhou, China
| | - Wenwen Min
- School of Information Science and Engineering, Yunnan University, Kunming, China
| | - Feiwei Qin
- Hangzhou Dianzi University, Hangzhou, China
| | | | - Xiaopeng Fan
- Hangzhou Institute of Advanced Technology, Hangzhou, China
| |
Collapse
|
9
|
Ke J, Zhao J, Li H, Yuan L, Dong G, Wang G. Prediction of protein N-terminal acetylation modification sites based on CNN-BiLSTM-attention model. Comput Biol Med 2024; 174:108330. [PMID: 38588617 DOI: 10.1016/j.compbiomed.2024.108330] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2024] [Revised: 03/06/2024] [Accepted: 03/17/2024] [Indexed: 04/10/2024]
Abstract
N-terminal acetylation is one of the most common and important post-translational modifications (PTM) of eukaryotic proteins. PTM plays a crucial role in various cellular processes and disease pathogenesis. Thus, the accurate identification of N-terminal acetylation modifications is important to gain insight into cellular processes and other possible functional mechanisms. Although some algorithmic models have been proposed, most have been developed based on traditional machine learning algorithms and small training datasets. Their practical applications are limited. Nevertheless, deep learning algorithmic models are better at handling high-throughput and complex data. In this study, DeepCBA, a model based on the hybrid framework of convolutional neural network (CNN), bidirectional long short-term memory network (BiLSTM), and attention mechanism deep learning, was constructed to detect the N-terminal acetylation sites. The DeepCBA was built as follows: First, a benchmark dataset was generated by selecting low-redundant protein sequences from the Uniport database and further reducing the redundancy of the protein sequences using the CD-HIT tool. Subsequently, based on the skip-gram model in the word2vec algorithm, tripeptide word vector features were generated on the benchmark dataset. Finally, the CNN, BiLSTM, and attention mechanism were combined, and the tripeptide word vector features were fed into the stacked model for multiple rounds of training. The model performed excellently on independent dataset test, with accuracy and area under the curve of 80.51% and 87.36%, respectively. Altogether, DeepCBA achieved superior performance compared with the baseline model, and significantly outperformed most existing predictors. Additionally, our model can be used to identify disease loci and drug targets.
Collapse
Affiliation(s)
- Jinsong Ke
- College of Computer and Control Engineering, Northeast Forestry University, Harbin, 150040, China
| | - Jianmei Zhao
- College of Computer and Control Engineering, Northeast Forestry University, Harbin, 150040, China; College of Life Science, Northeast Forestry University, Harbin, 150040, China
| | - Hongfei Li
- College of Computer and Control Engineering, Northeast Forestry University, Harbin, 150040, China; College of Life Science, Northeast Forestry University, Harbin, 150040, China
| | - Lei Yuan
- Department of Hepatobiliary Surgery, Quzhou People's Hospital, Quzhou, 324000, China
| | - Guanghui Dong
- College of Computer and Control Engineering, Northeast Forestry University, Harbin, 150040, China
| | - Guohua Wang
- College of Computer and Control Engineering, Northeast Forestry University, Harbin, 150040, China.
| |
Collapse
|
10
|
Zahiri Z, Mehrshad N, Mehrshad M. DF-Phos: Prediction of Protein Phosphorylation Sites by Deep Forest. J Biochem 2024; 175:447-456. [PMID: 38153271 DOI: 10.1093/jb/mvad116] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2023] [Revised: 12/10/2023] [Accepted: 12/12/2023] [Indexed: 12/29/2023] Open
Abstract
Phosphorylation is the most important and studied post-translational modification (PTM), which plays a crucial role in protein function studies and experimental design. Many significant studies have been performed to predict phosphorylation sites using various machine-learning methods. Recently, several studies have claimed that deep learning-based methods are the best way to predict the phosphorylation sites because deep learning as an advanced machine learning method can automatically detect complex representations of phosphorylation patterns from raw sequences and thus offers a powerful tool to improve phosphorylation site prediction. In this study, we report DF-Phos, a new phosphosite predictor based on the Deep Forest to predict phosphorylation sites. In DF-Phos, the feature vector taken from the CkSAApair method is as input for a Deep Forest framework for predicting phosphorylation sites. The results of 10-fold cross-validation show that the Deep Forest method has the highest performance among other available methods. We implemented a Python program of DF-Phos, which is freely available for non-commercial use at https://github.com/zahiriz/DF-Phos Moreover, users can use it for various PTM predictions.
Collapse
Affiliation(s)
- Zeynab Zahiri
- Faculty of Electrical and Computer Engineering, University of Birjand, Birjand, Iran
| | - Nasser Mehrshad
- Faculty of Electrical and Computer Engineering, University of Birjand, Birjand, Iran
| | - Maliheh Mehrshad
- Department of Aquatic Sciences and Assessment, Swedish University of Agricultural Sciences, Uppsala, 750 07 Sweden
| |
Collapse
|
11
|
Ertelt M, Mulligan VK, Maguire JB, Lyskov S, Moretti R, Schiffner T, Meiler J, Schoeder CT. Combining machine learning with structure-based protein design to predict and engineer post-translational modifications of proteins. PLoS Comput Biol 2024; 20:e1011939. [PMID: 38484014 DOI: 10.1371/journal.pcbi.1011939] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2023] [Revised: 03/26/2024] [Accepted: 02/20/2024] [Indexed: 03/27/2024] Open
Abstract
Post-translational modifications (PTMs) of proteins play a vital role in their function and stability. These modifications influence protein folding, signaling, protein-protein interactions, enzyme activity, binding affinity, aggregation, degradation, and much more. To date, over 400 types of PTMs have been described, representing chemical diversity well beyond the genetically encoded amino acids. Such modifications pose a challenge to the successful design of proteins, but also represent a major opportunity to diversify the protein engineering toolbox. To this end, we first trained artificial neural networks (ANNs) to predict eighteen of the most abundant PTMs, including protein glycosylation, phosphorylation, methylation, and deamidation. In a second step, these models were implemented inside the computational protein modeling suite Rosetta, which allows flexible combination with existing protocols to model the modified sites and understand their impact on protein stability as well as function. Lastly, we developed a new design protocol that either maximizes or minimizes the predicted probability of a particular site being modified. We find that this combination of ANN prediction and structure-based design can enable the modification of existing, as well as the introduction of novel, PTMs. The potential applications of our work include, but are not limited to, glycan masking of epitopes, strengthening protein-protein interactions through phosphorylation, as well as protecting proteins from deamidation liabilities. These applications are especially important for the design of new protein therapeutics where PTMs can drastically change the therapeutic properties of a protein. Our work adds novel tools to Rosetta's protein engineering toolbox that allow for the rational design of PTMs.
Collapse
Affiliation(s)
- Moritz Ertelt
- Institute for Drug Discovery, Leipzig University Medical Faculty, Leipzig, Germany
- Center for Scalable Data Analytics and Artificial Intelligence ScaDS.AI, Dresden/Leipzig, Germany
| | - Vikram Khipple Mulligan
- Center for Computational Biology, Flatiron Institute, New York, New York, United States of America
| | - Jack B Maguire
- Program in Bioinformatics and Computational Biology, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America
| | - Sergey Lyskov
- Department of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, Maryland, United States of America
| | - Rocco Moretti
- Department of Chemistry, Vanderbilt University, Nashville, Tennessee, United States of America
- Center for Structural Biology, Vanderbilt University, Nashville, Tennessee, United States of America
| | - Torben Schiffner
- Institute for Drug Discovery, Leipzig University Medical Faculty, Leipzig, Germany
| | - Jens Meiler
- Institute for Drug Discovery, Leipzig University Medical Faculty, Leipzig, Germany
- Center for Scalable Data Analytics and Artificial Intelligence ScaDS.AI, Dresden/Leipzig, Germany
- Department of Chemistry, Vanderbilt University, Nashville, Tennessee, United States of America
- Center for Structural Biology, Vanderbilt University, Nashville, Tennessee, United States of America
| | - Clara T Schoeder
- Institute for Drug Discovery, Leipzig University Medical Faculty, Leipzig, Germany
- Center for Scalable Data Analytics and Artificial Intelligence ScaDS.AI, Dresden/Leipzig, Germany
| |
Collapse
|
12
|
Zhou Z, Yeung W, Soleymani S, Gravel N, Salcedo M, Li S, Kannan N. Using explainable machine learning to uncover the kinase-substrate interaction landscape. Bioinformatics 2024; 40:btae033. [PMID: 38244571 PMCID: PMC10868336 DOI: 10.1093/bioinformatics/btae033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2023] [Revised: 12/09/2023] [Accepted: 01/17/2024] [Indexed: 01/22/2024] Open
Abstract
MOTIVATION Phosphorylation, a post-translational modification regulated by protein kinase enzymes, plays an essential role in almost all cellular processes. Understanding how each of the nearly 500 human protein kinases selectively phosphorylates their substrates is a foundational challenge in bioinformatics and cell signaling. Although deep learning models have been a popular means to predict kinase-substrate relationships, existing models often lack interpretability and are trained on datasets skewed toward a subset of well-studied kinases. RESULTS Here we leverage recent peptide library datasets generated to determine substrate specificity profiles of 300 serine/threonine kinases to develop an explainable Transformer model for kinase-peptide interaction prediction. The model, trained solely on primary sequences, achieved state-of-the-art performance. Its unique multitask learning paradigm built within the model enables predictions on virtually any kinase-peptide pair, including predictions on 139 kinases not used in peptide library screens. Furthermore, we employed explainable machine learning methods to elucidate the model's inner workings. Through analysis of learned embeddings at different training stages, we demonstrate that the model employs a unique strategy of substrate prediction considering both substrate motif patterns and kinase evolutionary features. SHapley Additive exPlanation (SHAP) analysis reveals key specificity determining residues in the peptide sequence. Finally, we provide a web interface for predicting kinase-substrate associations for user-defined sequences and a resource for visualizing the learned kinase-substrate associations. AVAILABILITY AND IMPLEMENTATION All code and data are available at https://github.com/esbgkannan/Phosformer-ST. Web server is available at https://phosformer.netlify.app.
Collapse
Affiliation(s)
- Zhongliang Zhou
- School of Computing, University of Georgia, Athens, GA 30602, United States
| | - Wayland Yeung
- Institute of Bioinformatics, University of Georgia, Athens, GA 30602, United States
| | - Saber Soleymani
- School of Computing, University of Georgia, Athens, GA 30602, United States
| | - Nathan Gravel
- Institute of Bioinformatics, University of Georgia, Athens, GA 30602, United States
| | - Mariah Salcedo
- Department of Biochemistry and Molecular Biology, University of Georgia, Athens, GA 30602, United States
| | - Sheng Li
- School of Data Science, University of Virginia, Charlottesville, VA 22903, United States
| | - Natarajan Kannan
- Institute of Bioinformatics, University of Georgia, Athens, GA 30602, United States
- Department of Biochemistry and Molecular Biology, University of Georgia, Athens, GA 30602, United States
| |
Collapse
|
13
|
Shrestha P, Kandel J, Tayara H, Chong KT. DL-SPhos: Prediction of serine phosphorylation sites using transformer language model. Comput Biol Med 2024; 169:107925. [PMID: 38183701 DOI: 10.1016/j.compbiomed.2024.107925] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2023] [Revised: 12/21/2023] [Accepted: 01/01/2024] [Indexed: 01/08/2024]
Abstract
Serine phosphorylation plays a pivotal role in the pathogenesis of various cellular processes and diseases. Roughly 81% of human diseases have links to phosphorylation, and an overwhelming 86.4% of protein phosphorylation takes place at serine residues. In eukaryotes, over a quarter of proteins undergo phosphorylation, with more than half implicated in numerous disorders, notably cancer and reproductive system diseases. This study primarily focuses on serine-phosphorylation-driven pathogenesis and the critical role of conserved motif identification. While numerous techniques exist for predicting serine phosphorylation sites, traditional wet lab experiments are resource-intensive. Our paper introduces a cutting-edge deep learning tool for predicting S phosphorylation sites, integrating explainable AI for motif identification, a transformer language model, and deep neural network components. We trained our model on protein sequences from UniProt, validated it against the dbPTM benchmark dataset, and employed the PTMD dataset to explore motifs related to mammalian disorders. Our results highlight that our model surpasses other deep learning predictors by a significant 3%. Furthermore, we utilized the local interpretable model-agnostic explanations (LIME) approach to shed light on the predictions, emphasizing the amino acid residues crucial for S phosphorylation. Notably, our model also outperformed competitors in kinase-specific serine phosphorylation prediction on benchmark datasets.
Collapse
Affiliation(s)
- Palistha Shrestha
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju-si, 54896, Jeollabuk-do, Republic of Korea
| | - Jeevan Kandel
- Graduate School of Integrated Energy-AI, Jeonbuk National University, Jeonju-si, 54896, Jeollabuk-do, Republic of Korea
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju-si, 54896, Jeollabuk-do, Republic of Korea.
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju-si, 54896, Jeollabuk-do, Republic of Korea; Advances Electronics and Information Research Center, Jeonbuk National University, Jeonju-si, 54896, Jeollabuk-do, Republic of Korea.
| |
Collapse
|
14
|
Song T, Yang Q, Qu P, Qiao L, Wang X. Attenphos: General Phosphorylation Site Prediction Model Based on Attention Mechanism. Int J Mol Sci 2024; 25:1526. [PMID: 38338804 PMCID: PMC10855885 DOI: 10.3390/ijms25031526] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Revised: 01/18/2024] [Accepted: 01/23/2024] [Indexed: 02/12/2024] Open
Abstract
Phosphorylation site prediction has important application value in the field of bioinformatics. It can act as an important reference and help with protein function research, protein structure research, and drug discovery. So, it is of great significance to propose scientific and effective calculation methods to accurately predict phosphorylation sites. In this study, we propose a new method, Attenphos, based on the self-attention mechanism for predicting general phosphorylation sites in proteins. The method not only captures the long-range dependence information of proteins but also better represents the correlation between amino acids through feature vector encoding transformation. Attenphos takes advantage of the one-dimensional convolutional layer to reduce the number of model parameters, improve model efficiency and prediction accuracy, and enhance model generalization. Comparisons between our method and existing state-of-the-art prediction tools were made using balanced datasets from human proteins and unbalanced datasets from mouse proteins. We performed prediction comparisons using independent test sets. The results showed that Attenphos demonstrated the best overall performance in the prediction of Serine (S), Threonine (T), and Tyrosine (Y) sites on both balanced and unbalanced datasets. Compared to current state-of-the-art methods, Attenphos has significantly higher prediction accuracy. This proves the potential of Attenphos in accelerating the identification and functional analysis of protein phosphorylation sites and provides new tools and ideas for biological research and drug discovery.
Collapse
Affiliation(s)
| | | | | | | | - Xun Wang
- Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum, Qingdao 266555, China; (T.S.); (Q.Y.); (P.Q.); (L.Q.)
| |
Collapse
|
15
|
Skinnider MA, Akinlaja MO, Foster LJ. Mapping protein states and interactions across the tree of life with co-fractionation mass spectrometry. Nat Commun 2023; 14:8365. [PMID: 38102123 PMCID: PMC10724252 DOI: 10.1038/s41467-023-44139-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2023] [Accepted: 12/01/2023] [Indexed: 12/17/2023] Open
Abstract
We present CFdb, a harmonized resource of interaction proteomics data from 411 co-fractionation mass spectrometry (CF-MS) datasets spanning 21,703 fractions. Meta-analysis of this resource charts protein abundance, phosphorylation, and interactions throughout the tree of life, including a reference map of the human interactome. We show how large-scale CF-MS data can enhance analyses of individual CF-MS datasets, and exemplify this strategy by mapping the honey bee interactome.
Collapse
Affiliation(s)
- Michael A Skinnider
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC, Canada
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA
- Ludwig Institute for Cancer Research, Princeton University, Princeton, NJ, USA
| | - Mopelola O Akinlaja
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC, Canada
- Department of Biochemistry and Molecular Biology, University of British Columbia, Vancouver, BC, Canada
| | - Leonard J Foster
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC, Canada.
- Department of Biochemistry and Molecular Biology, University of British Columbia, Vancouver, BC, Canada.
| |
Collapse
|
16
|
Poretsky E, Andorf CM, Sen TZ. PhosBoost: Improved phosphorylation prediction recall using gradient boosting and protein language models. PLANT DIRECT 2023; 7:e554. [PMID: 38124705 PMCID: PMC10732782 DOI: 10.1002/pld3.554] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/10/2023] [Revised: 11/20/2023] [Accepted: 11/26/2023] [Indexed: 12/23/2023]
Abstract
Protein phosphorylation is a dynamic and reversible post-translational modification that regulates a variety of essential biological processes. The regulatory role of phosphorylation in cellular signaling pathways, protein-protein interactions, and enzymatic activities has motivated extensive research efforts to understand its functional implications. Experimental protein phosphorylation data in plants remains limited to a few species, necessitating a scalable and accurate prediction method. Here, we present PhosBoost, a machine-learning approach that leverages protein language models and gradient-boosting trees to predict protein phosphorylation from experimentally derived data. Trained on data obtained from a comprehensive plant phosphorylation database, qPTMplants, we compared the performance of PhosBoost to existing protein phosphorylation prediction methods, PhosphoLingo and DeepPhos. For serine and threonine prediction, PhosBoost achieved higher recall than PhosphoLingo and DeepPhos (.78, .56, and .14, respectively) while maintaining a competitive area under the precision-recall curve (.54, .56, and .42, respectively). PhosphoLingo and DeepPhos failed to predict any tyrosine phosphorylation sites, while PhosBoost achieved a recall score of .6. Despite the precision-recall tradeoff, PhosBoost offers improved performance when recall is prioritized while consistently providing more confident probability scores. A sequence-based pairwise alignment step improved prediction results for all classifiers by effectively increasing the number of inferred positive phosphosites. We provide evidence to show that PhosBoost models are transferable across species and scalable for genome-wide protein phosphorylation predictions. PhosBoost is freely and publicly available on GitHub.
Collapse
Affiliation(s)
- Elly Poretsky
- Agricultural Research Service, Crop Improvement and Genetics Research UnitU.S. Department of AgricultureAlbanyCAUnited States
| | - Carson M. Andorf
- Agricultural Research Service, Corn Insects and Crop Genetics ResearchU.S. Department of AgricultureAmesIAUnited States
- Department of Computer ScienceIowa State UniversityAmesIAUnited States
| | - Taner Z. Sen
- Agricultural Research Service, Crop Improvement and Genetics Research UnitU.S. Department of AgricultureAlbanyCAUnited States
- Department of BioengineeringUniversity of CaliforniaBerkeleyCAUnited States
| |
Collapse
|
17
|
Esmaili F, Pourmirzaei M, Ramazi S, Shojaeilangari S, Yavari E. A Review of Machine Learning and Algorithmic Methods for Protein Phosphorylation Site Prediction. GENOMICS, PROTEOMICS & BIOINFORMATICS 2023; 21:1266-1285. [PMID: 37863385 PMCID: PMC11082408 DOI: 10.1016/j.gpb.2023.03.007] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/16/2022] [Revised: 01/16/2023] [Accepted: 03/23/2023] [Indexed: 10/22/2023]
Abstract
Post-translational modifications (PTMs) have key roles in extending the functional diversity of proteins and, as a result, regulating diverse cellular processes in prokaryotic and eukaryotic organisms. Phosphorylation modification is a vital PTM that occurs in most proteins and plays a significant role in many biological processes. Disorders in the phosphorylation process lead to multiple diseases, including neurological disorders and cancers. The purpose of this review is to organize this body of knowledge associated with phosphorylation site (p-site) prediction to facilitate future research in this field. At first, we comprehensively review all related databases and introduce all steps regarding dataset creation, data preprocessing, and method evaluation in p-site prediction. Next, we investigate p-site prediction methods, which are divided into two computational groups: algorithmic and machine learning (ML). Additionally, it is shown that there are basically two main approaches for p-site prediction by ML: conventional and end-to-end deep learning methods, both of which are given an overview. Moreover, this review introduces the most important feature extraction techniques, which have mostly been used in p-site prediction. Finally, we create three test sets from new proteins related to the released version of the database of protein post-translational modifications (dbPTM) in 2022 based on general and human species. Evaluating online p-site prediction tools on newly added proteins introduced in the dbPTM 2022 release, distinct from those in the dbPTM 2019 release, reveals their limitations. In other words, the actual performance of these online p-site prediction tools on unseen proteins is notably lower than the results reported in their respective research papers.
Collapse
Affiliation(s)
- Farzaneh Esmaili
- Department of Information Technology, Tarbiat Modares University, Tehran 14115-111, Iran
| | - Mahdi Pourmirzaei
- Department of Information Technology, Tarbiat Modares University, Tehran 14115-111, Iran
| | - Shahin Ramazi
- Department of Biophysics, Faculty of Biological Sciences, Tarbiat Modares University, Tehran 14115-111, Iran.
| | - Seyedehsamaneh Shojaeilangari
- Biomedical Engineering Group, Department of Electrical Engineering and Information Technology, Iranian Research Organization for Science and Technology (IROST), Tehran 33535-111, Iran
| | - Elham Yavari
- Department of Information Technology, Tarbiat Modares University, Tehran 14115-111, Iran
| |
Collapse
|
18
|
López-Correa JM, König C, Vellido A. GPCR molecular dynamics forecasting using recurrent neural networks. Sci Rep 2023; 13:20995. [PMID: 38017062 PMCID: PMC10684758 DOI: 10.1038/s41598-023-48346-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2023] [Accepted: 11/25/2023] [Indexed: 11/30/2023] Open
Abstract
G protein-coupled receptors (GPCRs) are a large superfamily of cell membrane proteins that play an important physiological role as transmitters of extracellular signals. Signal transmission through the cell membrane depends on conformational changes in the transmembrane region of the receptor, which makes the investigation of the dynamics in these regions particularly relevant. Molecular dynamics (MD) simulations provide a wealth of data about the structure, dynamics, and physiological function of biological macromolecules by modelling the interactions between their atomic constituents. In this study, a Recurrent and Convolutional Neural Network (RNN) model, namely Long Short-Term Memory (LSTM), is used to predict the dynamics of two GPCR states and three specific simulations of each one, through their activation path and focussing on specific receptor regions. Active and inactive states of the GPCRs are analysed in six scenarios involving APO, Full Agonist (BI 167107) and Partial Inverse Agonist (carazolol) of the receptor. Four Machine Learning models with increasing complexity in terms of neural network architecture are evaluated, and their results discussed. The best method achieves an overall RMSD lower than 0.139 Å and the transmembrane helices are the regions showing the minimum prediction errors and minimum relative movements of the protein.
Collapse
Affiliation(s)
| | - Caroline König
- Universitat Politècnica de Catalunya, Barcelona, Spain
- IDEAI-UPC - Research Center, Universitat Politècnica de Catalunya, Barcelona, Spain
| | - Alfredo Vellido
- Universitat Politècnica de Catalunya, Barcelona, Spain.
- IDEAI-UPC - Research Center, Universitat Politècnica de Catalunya, Barcelona, Spain.
| |
Collapse
|
19
|
Xie J, Quan L, Wang X, Wu H, Jin Z, Pan D, Chen T, Wu T, Lyu Q. DeepMPSF: A Deep Learning Network for Predicting General Protein Phosphorylation Sites Based on Multiple Protein Sequence Features. J Chem Inf Model 2023; 63:7258-7271. [PMID: 37931253 DOI: 10.1021/acs.jcim.3c00996] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2023]
Abstract
Phosphorylation, as one of the most important post-translational modifications, plays a key role in various cellular physiological processes and disease occurrences. In recent years, computer technology has been gradually applied to the prediction of protein phosphorylation sites. However, most existing methods rely on simple protein sequence features that provide limited contextual information. To overcome this limitation, we propose DeepMPSF, a phosphorylation site prediction model based on multiple protein sequence features. There are two types of features: sequence semantic features, which comprise protein residue type information and relative position information within protein sequence, and protein background biophysical features, which include global semantic information containing more comprehensive protein background information obtained from pretrained models. To extract these features, DeepMPSF employs two separate subnetworks: the S71SFE module and the BBFE module, which automatically extract high-level semantic features. Our model incorporates a learning strategy for handling imbalanced datasets through ensemble learning during training and prediction. DeepMPSF is trained and evaluated on a well-established dataset of human proteins. Comparing the analysis with other benchmark methods reveals that DeepMPSF outperforms in predicting both S/T residues and Y residues. In particular, DeepMPSF showed excellent generalization performance in cross-species blind test performance, with an average improvement of 5.63%/5.72%, 22.28%/25.94%, 20.11%/17.49%, and 26.40%/28.33% for Mus musculus/Rattus norvegicus test sets in area under curves (AUCs) of ROC curve, AUC of the PR curve, F1-score, and MCC metrics, respectively. Furthermore, it also shows excellent performance in the latest updated case of natural proteins with functional phosphorylation sites. Through an ablation study and visual analysis, we uncover that the design of different feature modules significantly contributes to the accurate classification of DeepMPSF, which provides valuable insights for predicting phosphorylation sites and offers effective support for future downstream research.
Collapse
Affiliation(s)
- Jingxin Xie
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
| | - Lijun Quan
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
- Province Key Lab for Information Processing Technologies, Soochow University, Suzhou 215006, China
- Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210000, China
| | - Xuejiao Wang
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
| | - Hongjie Wu
- Suzhou University of Science and Technology, Suzhou 215006, China
| | - Zhi Jin
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
| | - Deng Pan
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
| | - Taoning Chen
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
| | - Tingfang Wu
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
- Province Key Lab for Information Processing Technologies, Soochow University, Suzhou 215006, China
- Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210000, China
| | - Qiang Lyu
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
- Province Key Lab for Information Processing Technologies, Soochow University, Suzhou 215006, China
- Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210000, China
| |
Collapse
|
20
|
Pham NT, Phan LT, Seo J, Kim Y, Song M, Lee S, Jeon YJ, Manavalan B. Advancing the accuracy of SARS-CoV-2 phosphorylation site detection via meta-learning approach. Brief Bioinform 2023; 25:bbad433. [PMID: 38058187 PMCID: PMC10753650 DOI: 10.1093/bib/bbad433] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2023] [Revised: 10/30/2023] [Accepted: 11/05/2023] [Indexed: 12/08/2023] Open
Abstract
The worldwide appearance of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has generated significant concern and posed a considerable challenge to global health. Phosphorylation is a common post-translational modification that affects many vital cellular functions and is closely associated with SARS-CoV-2 infection. Precise identification of phosphorylation sites could provide more in-depth insight into the processes underlying SARS-CoV-2 infection and help alleviate the continuing COVID-19 crisis. Currently, available computational tools for predicting these sites lack accuracy and effectiveness. In this study, we designed an innovative meta-learning model, Meta-Learning for Serine/Threonine Phosphorylation (MeL-STPhos), to precisely identify protein phosphorylation sites. We initially performed a comprehensive assessment of 29 unique sequence-derived features, establishing prediction models for each using 14 renowned machine learning methods, ranging from traditional classifiers to advanced deep learning algorithms. We then selected the most effective model for each feature by integrating the predicted values. Rigorous feature selection strategies were employed to identify the optimal base models and classifier(s) for each cell-specific dataset. To the best of our knowledge, this is the first study to report two cell-specific models and a generic model for phosphorylation site prediction by utilizing an extensive range of sequence-derived features and machine learning algorithms. Extensive cross-validation and independent testing revealed that MeL-STPhos surpasses existing state-of-the-art tools for phosphorylation site prediction. We also developed a publicly accessible platform at https://balalab-skku.org/MeL-STPhos. We believe that MeL-STPhos will serve as a valuable tool for accelerating the discovery of serine/threonine phosphorylation sites and elucidating their role in post-translational regulation.
Collapse
Affiliation(s)
- Nhat Truong Pham
- Department of Integrative Biotechnology and of Biopharmaceutical Convergence, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| | - Le Thi Phan
- Department of Integrative Biotechnology and of Biopharmaceutical Convergence, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| | - Jimin Seo
- Department of Integrative Biotechnology and of Biopharmaceutical Convergence, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| | - Yeonwoo Kim
- Department of Integrative Biotechnology and of Biopharmaceutical Convergence, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| | - Minkyung Song
- Department of Integrative Biotechnology and of Biopharmaceutical Convergence, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| | - Sukchan Lee
- Department of Integrative Biotechnology and of Biopharmaceutical Convergence, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| | - Young-Jun Jeon
- Department of Integrative Biotechnology and of Biopharmaceutical Convergence, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| | - Balachandran Manavalan
- Department of Integrative Biotechnology and of Biopharmaceutical Convergence, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| |
Collapse
|
21
|
Kumari S, Gupta R, Ambasta RK, Kumar P. Emerging trends in post-translational modification: Shedding light on Glioblastoma multiforme. Biochim Biophys Acta Rev Cancer 2023; 1878:188999. [PMID: 37858622 DOI: 10.1016/j.bbcan.2023.188999] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2023] [Revised: 10/06/2023] [Accepted: 10/06/2023] [Indexed: 10/21/2023]
Abstract
Recent multi-omics studies, including proteomics, transcriptomics, genomics, and metabolomics have revealed the critical role of post-translational modifications (PTMs) in the progression and pathogenesis of Glioblastoma multiforme (GBM). Further, PTMs alter the oncogenic signaling events and offer a novel avenue in GBM therapeutics research through PTM enzymes as potential biomarkers for drug targeting. In addition, PTMs are critical regulators of chromatin architecture, gene expression, and tumor microenvironment (TME), that play a crucial function in tumorigenesis. Moreover, the implementation of artificial intelligence and machine learning algorithms enhances GBM therapeutics research through the identification of novel PTM enzymes and residues. Herein, we briefly explain the mechanism of protein modifications in GBM etiology, and in altering the biologics of GBM cells through chromatin remodeling, modulation of the TME, and signaling pathways. In addition, we highlighted the importance of PTM enzymes as therapeutic biomarkers and the role of artificial intelligence and machine learning in protein PTM prediction.
Collapse
Affiliation(s)
- Smita Kumari
- Molecular Neuroscience and Functional Genomics Laboratory, Department of Biotechnology, Delhi Technological, University, India
| | - Rohan Gupta
- Molecular Neuroscience and Functional Genomics Laboratory, Department of Biotechnology, Delhi Technological, University, India; School of Medicine, University of South Carolina, Columbia, SC, United States of America
| | - Rashmi K Ambasta
- Molecular Neuroscience and Functional Genomics Laboratory, Department of Biotechnology, Delhi Technological, University, India; Department of Biotechnology and Microbiology, SRM University, Sonepat, Haryana, India.
| | - Pravir Kumar
- Molecular Neuroscience and Functional Genomics Laboratory, Department of Biotechnology, Delhi Technological, University, India.
| |
Collapse
|
22
|
Liang Z, Liu T, Li Q, Zhang G, Zhang B, Du X, Liu J, Chen Z, Ding H, Hu G, Lin H, Zhu F, Luo C. Deciphering the functional landscape of phosphosites with deep neural network. Cell Rep 2023; 42:113048. [PMID: 37659078 DOI: 10.1016/j.celrep.2023.113048] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2023] [Revised: 07/11/2023] [Accepted: 08/11/2023] [Indexed: 09/04/2023] Open
Abstract
Current biochemical approaches have only identified the most well-characterized kinases for a tiny fraction of the phosphoproteome, and the functional assignments of phosphosites are almost negligible. Herein, we analyze the substrate preference catalyzed by a specific kinase and present a novel integrated deep neural network model named FuncPhos-SEQ for functional assignment of human proteome-level phosphosites. FuncPhos-SEQ incorporates phosphosite motif information from a protein sequence using multiple convolutional neural network (CNN) channels and network features from protein-protein interactions (PPIs) using network embedding and deep neural network (DNN) channels. These concatenated features are jointly fed into a heterogeneous feature network to prioritize functional phosphosites. Combined with a series of in vitro and cellular biochemical assays, we confirm that NADK-S48/50 phosphorylation could activate its enzymatic activity. In addition, ERK1/2 are discovered as the primary kinases responsible for NADK-S48/50 phosphorylation. Moreover, FuncPhos-SEQ is developed as an online server.
Collapse
Affiliation(s)
- Zhongjie Liang
- Center for Systems Biology, Department of Bioinformatics, School of Biology and Basic Medical Sciences, Soochow University, Suzhou 215123, China; Jiangsu Province Engineering Research Center of Precision Diagnostics and Therapeutics Development, Soochow University, Suzhou 215123, China
| | - Tonghai Liu
- Zhongshan Institute for Drug Discovery, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Zhongshan 528437, China; State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
| | - Qi Li
- Zhongshan Institute for Drug Discovery, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Zhongshan 528437, China; State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
| | - Guangyu Zhang
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
| | - Bei Zhang
- State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
| | - Xikun Du
- Center for Systems Biology, Department of Bioinformatics, School of Biology and Basic Medical Sciences, Soochow University, Suzhou 215123, China
| | - Jingqiu Liu
- State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
| | - Zhifeng Chen
- State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
| | - Hong Ding
- State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
| | - Guang Hu
- Center for Systems Biology, Department of Bioinformatics, School of Biology and Basic Medical Sciences, Soochow University, Suzhou 215123, China; Jiangsu Province Engineering Research Center of Precision Diagnostics and Therapeutics Development, Soochow University, Suzhou 215123, China
| | - Hao Lin
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
| | - Fei Zhu
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China.
| | - Cheng Luo
- Zhongshan Institute for Drug Discovery, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Zhongshan 528437, China; State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China; School of Pharmaceutical Science and Technology, Hangzhou Institute for Advanced Study, UCAS, Hangzhou 310024, China; School of Life Science and Technology, Shanghai Tech University, 100 Haike Road, Shanghai 201210, China; School of Pharmacy, Fujian Medical University, Fuzhou 350122, China.
| |
Collapse
|
23
|
Pakhrin SC, Pokharel S, Pratyush P, Chaudhari M, Ismail HD, Kc DB. LMPhosSite: A Deep Learning-Based Approach for General Protein Phosphorylation Site Prediction Using Embeddings from the Local Window Sequence and Pretrained Protein Language Model. J Proteome Res 2023; 22:2548-2557. [PMID: 37459437 DOI: 10.1021/acs.jproteome.2c00667] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/05/2023]
Abstract
Phosphorylation is one of the most important post-translational modifications and plays a pivotal role in various cellular processes. Although there exist several computational tools to predict phosphorylation sites, existing tools have not yet harnessed the knowledge distilled by pretrained protein language models. Herein, we present a novel deep learning-based approach called LMPhosSite for the general phosphorylation site prediction that integrates embeddings from the local window sequence and the contextualized embedding obtained using global (overall) protein sequence from a pretrained protein language model to improve the prediction performance. Thus, the LMPhosSite consists of two base-models: one for capturing effective local representation and the other for capturing global per-residue contextualized embedding from a pretrained protein language model. The output of these base-models is integrated using a score-level fusion approach. LMPhosSite achieves a precision, recall, Matthew's correlation coefficient, and F1-score of 38.78%, 67.12%, 0.390, and 49.15%, for the combined serine and threonine independent test data set and 34.90%, 62.03%, 0.298, and 44.67%, respectively, for the tyrosine independent test data set, which is better than the compared approaches. These results demonstrate that LMPhosSite is a robust computational tool for the prediction of the general phosphorylation sites in proteins.
Collapse
Affiliation(s)
- Subash C Pakhrin
- School of Computing, Wichita State University, 1845 Fairmount St., Wichita, Kansas 67260, United States
- Department of Computer Science & Engineering Technology, University of Houston-Downtown, 1 Main St., Houston, Texas 77002, United States
| | - Suresh Pokharel
- Department of Computer Science, Michigan Technological University, Houghton, Michigan 49931, United States
| | - Pawel Pratyush
- Department of Computer Science, Michigan Technological University, Houghton, Michigan 49931, United States
| | - Meenal Chaudhari
- Department of Biology, North Carolina A&T State University, Greensboro, North Carolina 27411, United States
| | - Hamid D Ismail
- Department of Computer Science, Michigan Technological University, Houghton, Michigan 49931, United States
| | - Dukka B Kc
- Department of Computer Science, Michigan Technological University, Houghton, Michigan 49931, United States
| |
Collapse
|
24
|
Raslan MA, Raslan SA, Shehata EM, Mahmoud AS, Sabri NA. Advances in the Applications of Bioinformatics and Chemoinformatics. Pharmaceuticals (Basel) 2023; 16:1050. [PMID: 37513961 PMCID: PMC10384252 DOI: 10.3390/ph16071050] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Revised: 07/19/2023] [Accepted: 07/20/2023] [Indexed: 07/30/2023] Open
Abstract
Chemoinformatics involves integrating the principles of physical chemistry with computer-based and information science methodologies, commonly referred to as "in silico techniques", in order to address a wide range of descriptive and prescriptive chemistry issues, including applications to biology, drug discovery, and related molecular areas. On the other hand, the incorporation of machine learning has been considered of high importance in the field of drug design, enabling the extraction of chemical data from enormous compound databases to develop drugs endowed with significant biological features. The present review discusses the field of cheminformatics and proposes the use of virtual chemical libraries in virtual screening methods to increase the probability of discovering novel hit chemicals. The virtual libraries address the need to increase the quality of the compounds as well as discover promising ones. On the other hand, various applications of bioinformatics in disease classification, diagnosis, and identification of multidrug-resistant organisms were discussed. The use of ensemble models and brute-force feature selection methodology has resulted in high accuracy rates for heart disease and COVID-19 diagnosis, along with the role of special formulations for targeting meningitis and Alzheimer's disease. Additionally, the correlation between genomic variations and disease states such as obesity and chronic progressive external ophthalmoplegia, the investigation of the antibacterial activity of pyrazole and benzimidazole-based compounds against resistant microorganisms, and its applications in chemoinformatics for the prediction of drug properties and toxicity-all the previously mentioned-were presented in the current review.
Collapse
Affiliation(s)
| | | | | | - Amr S Mahmoud
- Department of Obstetrics and Gynecology, Faculty of Medicine, Ain Shams University, Cairo P.O. Box 11566, Egypt
| | - Nagwa A Sabri
- Department of Clinical Pharmacy, Faculty of Pharmacy, Ain Shams University, Cairo P.O. Box 11566, Egypt
| |
Collapse
|
25
|
Manzoori S, Farahani AHK, Moradi MH, Kazemi-Bonchenari M. Detecting SNP markers discriminating horse breeds by deep learning. Sci Rep 2023; 13:11592. [PMID: 37464049 DOI: 10.1038/s41598-023-38601-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2022] [Accepted: 07/11/2023] [Indexed: 07/20/2023] Open
Abstract
The assignment of an individual to the true population of origin using a low-panel of discriminant SNP markers is one of the most important applications of genomic data for practical use. The aim of this study was to evaluate the potential of different Artificial Neural Networks (ANNs) approaches consisting Deep Neural Networks (DNN), Garson and Olden methods for feature selection of informative SNP markers from high-throughput genotyping data, that would be able to trace the true breed of unknown samples. The total of 795 animals from 37 breeds, genotyped by using the Illumina SNP 50k Bead chip were used in the current study and principal component analysis (PCA), log-likelihood ratios (LLR) and Neighbor-Joining (NJ) were applied to assess the performance of different assignment methods. The results revealed that the DNN, Garson, and Olden methods are able to assign individuals to true populations with 4270, 4937, and 7999 SNP markers, respectively. The PCA was used to determine how the animals allocated to the groups using all genotyped markers available on 50k Bead chip and the subset of SNP markers identified with different methods. The results indicated that all SNP panels are able to assign individuals into their true breeds. The success percentage of genetic assignment for different methods assessed by different levels of LLR showed that the success rate of 70% in the analysis was obtained by three methods with the number of markers of 110, 208, and 178 tags for DNN, Garson, and Olden methods, respectively. Also the results showed that DNN performed better than other two approaches by achieving 93% accuracy at the most stringent threshold. Finally, the identified SNPs were successfully used in independent out-group breeds consisting 120 individuals from eight breeds and the results indicated that these markers are able to correctly allocate all unknown samples to true population of origin. Furthermore, the NJ tree of allele-sharing distances on the validation dataset showed that the DNN has a high potential for feature selection. In general, the results of this study indicated that the DNN technique represents an efficient strategy for selecting a reduced pool of highly discriminant markers for assigning individuals to the true population of origin.
Collapse
Affiliation(s)
- Siavash Manzoori
- Department of Animal Science, Faculty of Agriculture and Natural Resources, Arak University, Arak, Iran
| | | | - Mohammad Hossein Moradi
- Department of Animal Science, Faculty of Agriculture and Natural Resources, Arak University, Arak, Iran
| | - Mehdi Kazemi-Bonchenari
- Department of Animal Science, Faculty of Agriculture and Natural Resources, Arak University, Arak, Iran
| |
Collapse
|
26
|
Kim Y, Lee H. PINNet: a deep neural network with pathway prior knowledge for Alzheimer's disease. Front Aging Neurosci 2023; 15:1126156. [PMID: 37520124 PMCID: PMC10380929 DOI: 10.3389/fnagi.2023.1126156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2022] [Accepted: 06/20/2023] [Indexed: 08/01/2023] Open
Abstract
Introduction Identification of Alzheimer's Disease (AD)-related transcriptomic signatures from blood is important for early diagnosis of the disease. Deep learning techniques are potent classifiers for AD diagnosis, but most have been unable to identify biomarkers because of their lack of interpretability. Methods To address these challenges, we propose a pathway information-based neural network (PINNet) to predict AD patients and analyze blood and brain transcriptomic signatures using an interpretable deep learning model. PINNet is a deep neural network (DNN) model with pathway prior knowledge from either the Gene Ontology or Kyoto Encyclopedia of Genes and Genomes databases. Then, a backpropagation-based model interpretation method was applied to reveal essential pathways and genes for predicting AD. Results The performance of PINNet was compared with a DNN model without a pathway. Performances of PINNet outperformed or were similar to those of DNN without a pathway using blood and brain gene expressions, respectively. Moreover, PINNet considers more AD-related genes as essential features than DNN without a pathway in the learning process. Pathway analysis of protein-protein interaction modules of highly contributed genes showed that AD-related genes in blood were enriched with cell migration, PI3K-Akt, MAPK signaling, and apoptosis in blood. The pathways enriched in the brain module included cell migration, PI3K-Akt, MAPK signaling, apoptosis, protein ubiquitination, and t-cell activation. Discussion By integrating prior knowledge about pathways, PINNet can reveal essential pathways related to AD. The source codes are available at https://github.com/DMCB-GIST/PINNet.
Collapse
Affiliation(s)
- Yeojin Kim
- Artificial Intelligence Graduate School, Gwangju Institute of Science and Technology, Gwangju, Republic of Korea
| | - Hyunju Lee
- Artificial Intelligence Graduate School, Gwangju Institute of Science and Technology, Gwangju, Republic of Korea
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, Republic of Korea
| |
Collapse
|
27
|
Zhang G, Tang Q, Feng P, Chen W. IPs-GRUAtt: An attention-based bidirectional gated recurrent unit network for predicting phosphorylation sites of SARS-CoV-2 infection. MOLECULAR THERAPY. NUCLEIC ACIDS 2023; 32:28-35. [PMID: 36908648 PMCID: PMC9968446 DOI: 10.1016/j.omtn.2023.02.027] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/03/2023] [Accepted: 02/22/2023] [Indexed: 02/27/2023]
Abstract
The global pandemic of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection has generated tremendous concern and poses a serious threat to international public health. Phosphorylation is a common post-translational modification affecting many essential cellular processes and is inextricably linked to SARS-CoV-2 infection. Hence, accurate identification of phosphorylation sites will be helpful to understand the mechanisms of SARS-CoV-2 infection and mitigate the ongoing COVID-19 pandemic. In the present study, an attention-based bidirectional gated recurrent unit network, called IPs-GRUAtt, was proposed to identify phosphorylation sites in SARS-CoV-2-infected host cells. Comparative results demonstrated that IPs-GRUAtt surpassed both state-of-the-art machine-learning methods and existing models for identifying phosphorylation sites. Moreover, the attention mechanism made IPs-GRUAtt able to extract the key features from protein sequences. These results demonstrated that the IPs-GRUAtt is a powerful tool for identifying phosphorylation sites. For facilitating its academic use, a freely available online web server for IPs-GRUAtt is provided at http://cbcb.cdutcm.edu.cn/phosphory/.
Collapse
Affiliation(s)
- Guiyang Zhang
- State Key Laboratory of Southwestern Chinese Medicine Resources, Innovative Chengdu University of Traditional Chinese Medicine, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China
| | - Qiang Tang
- State Key Laboratory of Southwestern Chinese Medicine Resources, Innovative Chengdu University of Traditional Chinese Medicine, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China
| | - Pengmian Feng
- State Key Laboratory of Southwestern Chinese Medicine Resources, School of Basic Medicine, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China
| | - Wei Chen
- State Key Laboratory of Southwestern Chinese Medicine Resources, Innovative Chengdu University of Traditional Chinese Medicine, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China.,State Key Laboratory of Southwestern Chinese Medicine Resources, School of Basic Medicine, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China
| |
Collapse
|
28
|
Varshney N, Mishra AK. Deep Learning in Phosphoproteomics: Methods and Application in Cancer Drug Discovery. Proteomes 2023; 11:proteomes11020016. [PMID: 37218921 DOI: 10.3390/proteomes11020016] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2023] [Revised: 04/24/2023] [Accepted: 04/25/2023] [Indexed: 05/24/2023] Open
Abstract
Protein phosphorylation is a key post-translational modification (PTM) that is a central regulatory mechanism of many cellular signaling pathways. Several protein kinases and phosphatases precisely control this biochemical process. Defects in the functions of these proteins have been implicated in many diseases, including cancer. Mass spectrometry (MS)-based analysis of biological samples provides in-depth coverage of phosphoproteome. A large amount of MS data available in public repositories has unveiled big data in the field of phosphoproteomics. To address the challenges associated with handling large data and expanding confidence in phosphorylation site prediction, the development of many computational algorithms and machine learning-based approaches have gained momentum in recent years. Together, the emergence of experimental methods with high resolution and sensitivity and data mining algorithms has provided robust analytical platforms for quantitative proteomics. In this review, we compile a comprehensive collection of bioinformatic resources used for the prediction of phosphorylation sites, and their potential therapeutic applications in the context of cancer.
Collapse
Affiliation(s)
- Neha Varshney
- Division of Biological Sciences, Department of Cellular and Molecular Medicine, University of California, San Diego, CA 93093, USA
- Ludwig Institute for Cancer Research, La Jolla, CA 92093, USA
| | - Abhinava K Mishra
- Molecular, Cellular and Developmental Biology Department, University of California, Santa Barbara, CA 93106, USA
| |
Collapse
|
29
|
Wang C, Yang Q. ScerePhoSite: An interpretable method for identifying fungal phosphorylation sites in proteins using sequence-based features. Comput Biol Med 2023; 158:106798. [PMID: 36966555 DOI: 10.1016/j.compbiomed.2023.106798] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2023] [Revised: 03/03/2023] [Accepted: 03/20/2023] [Indexed: 03/31/2023]
Abstract
Protein phosphorylation plays a vital role in signal transduction pathways and diverse cellular processes. To date, a tremendous number of in silico tools have been designed for phosphorylation site identification, but few of them are suitable for the identification of fungal phosphorylation sites. This largely hampers the functional investigation of fungal phosphorylation. In this paper, we present ScerePhoSite, a machine learning method for fungal phosphorylation site identification. The sequence fragments are represented by hybrid physicochemical features, and then LGB-based feature importance combined with the sequential forward search method is used to choose the optimal feature subset. As a result, ScerePhoSite surpasses current available tools and shown a more robust and balanced performance. Furthermore, the impact and contribution of specific features on the model performance were investigated by SHAP values. We expect ScerePhoSite to be a useful bioinformatics tool that complements hands-on experiments for the pre-screening of possible phosphorylation sites and facilitates our functional understanding of phosphorylation modification in fungi. The source code and datasets are accessible at https://github.com/wangchao-malab/ScerePhoSite/.
Collapse
|
30
|
Wang M, Yan L, Jia J, Lai J, Zhou H, Yu B. DE-MHAIPs: Identification of SARS-CoV-2 phosphorylation sites based on differential evolution multi-feature learning and multi-head attention mechanism. Comput Biol Med 2023; 160:106935. [PMID: 37120990 PMCID: PMC10140648 DOI: 10.1016/j.compbiomed.2023.106935] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2023] [Revised: 03/12/2023] [Accepted: 04/13/2023] [Indexed: 05/02/2023]
Abstract
The rapid spread of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) around the world affects the normal lives of people all over the world. The computational methods can be used to accurately identify SARS-CoV-2 phosphorylation sites. In this paper, a new prediction model of SARS-CoV-2 phosphorylation sites, called DE-MHAIPs, is proposed. First, we use six feature extraction methods to extract protein sequence information from different perspectives. For the first time, we use a differential evolution (DE) algorithm to learn individual feature weights and fuse multi-information in a weighted combination. Next, Group LASSO is used to select a subset of good features. Then, the important protein information is given higher weight through multi-head attention. After that, the processed data is fed into long short-term memory network (LSTM) to further enhance model's ability to learn features. Finally, the data from LSTM are input into fully connected neural network (FCN) to predict SARS-CoV-2 phosphorylation sites. The AUC values of the S/T and Y datasets under 5-fold cross-validation reach 91.98% and 98.32%, respectively. The AUC values of the two datasets on the independent test set reach 91.72% and 97.78%, respectively. The experimental results show that the DE-MHAIPs method exhibits excellent predictive ability compared with other methods.
Collapse
Affiliation(s)
- Minghui Wang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Lu Yan
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Jihua Jia
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Jiali Lai
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Hongyan Zhou
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.
| | - Bin Yu
- College of Information Science and Technology, School of Data Science, Qingdao University of Science and Technology, Qingdao, 266061, China; School of Data Science, University of Science and Technology of China, Hefei, 230027, China.
| |
Collapse
|
31
|
Zhou H, Tan W, Shi S. DeepGpgs: a novel deep learning framework for predicting arginine methylation sites combined with Gaussian prior and gated self-attention mechanism. Brief Bioinform 2023; 24:7000314. [PMID: 36694944 DOI: 10.1093/bib/bbad018] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2022] [Revised: 12/26/2022] [Accepted: 01/04/2023] [Indexed: 01/26/2023] Open
Abstract
Protein arginine methylation is an important posttranslational modification (PTM) associated with protein functional diversity and pathological conditions including cancer. Identification of methylation binding sites facilitates a better understanding of the molecular function of proteins. Recent developments in the field of deep neural networks have led to a proliferation of deep learning-based methylation identification studies because of their fast and accurate prediction. In this paper, we propose DeepGpgs, an advanced deep learning model incorporating Gaussian prior and gated attention mechanism. We introduce a residual network channel to extract the evolutionary information of proteins. Then we combine the adaptive embedding with bidirectional long short-term memory networks to form a context-shared encoder layer. A gated multi-head attention mechanism is followed to obtain the global information about the sequence. A Gaussian prior is injected into the sequence to assist in predicting PTMs. We also propose a weighted joint loss function to alleviate the false negative problem. We empirically show that DeepGpgs improves Matthews correlation coefficient by 6.3% on the arginine methylation independent test set compared with the existing state-of-the-art methylation site prediction methods. Furthermore, DeepGpgs has good robustness in phosphorylation site prediction of SARS-CoV-2, which indicates that DeepGpgs has good transferability and the potential to be extended to other modification sites prediction. The open-source code and data of the DeepGpgs can be obtained from https://github.com/saizhou1/DeepGpgs.
Collapse
Affiliation(s)
- Haiwei Zhou
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China
| | - Wenxi Tan
- School of Mathematical Sciences, Fudan University, Shanghai 200433, China
| | - Shaoping Shi
- Department of Mathematics, School of Mathematics and Computer Sciences, Nanchang University, Nanchang 330031, China
| |
Collapse
|
32
|
Zhou Z, Yeung W, Gravel N, Salcedo M, Soleymani S, Li S, Kannan N. Phosformer: an explainable transformer model for protein kinase-specific phosphorylation predictions. Bioinformatics 2023; 39:7000331. [PMID: 36692152 PMCID: PMC9900213 DOI: 10.1093/bioinformatics/btad046] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2022] [Revised: 01/16/2023] [Accepted: 01/23/2023] [Indexed: 01/25/2023] Open
Abstract
MOTIVATION The human genome encodes over 500 distinct protein kinases which regulate nearly all cellular processes by the specific phosphorylation of protein substrates. While advances in mass spectrometry and proteomics studies have identified thousands of phosphorylation sites across species, information on the specific kinases that phosphorylate these sites is currently lacking for the vast majority of phosphosites. Recently, there has been a major focus on the development of computational models for predicting kinase-substrate associations. However, most current models only allow predictions on a subset of well-studied kinases. Furthermore, the utilization of hand-curated features and imbalances in training and testing datasets pose unique challenges in the development of accurate predictive models for kinase-specific phosphorylation prediction. Motivated by the recent development of universal protein language models which automatically generate context-aware features from primary sequence information, we sought to develop a unified framework for kinase-specific phosphosite prediction, allowing for greater investigative utility and enabling substrate predictions at the whole kinome level. RESULTS We present a deep learning model for kinase-specific phosphosite prediction, termed Phosformer, which predicts the probability of phosphorylation given an arbitrary pair of unaligned kinase and substrate peptide sequences. We demonstrate that Phosformer implicitly learns evolutionary and functional features during training, removing the need for feature curation and engineering. Further analyses reveal that Phosformer also learns substrate specificity motifs and is able to distinguish between functionally distinct kinase families. Benchmarks indicate that Phosformer exhibits significant improvements compared to the state-of-the-art models, while also presenting a more generalized, unified, and interpretable predictive framework. AVAILABILITY AND IMPLEMENTATION Code and data are available at https://github.com/esbgkannan/phosformer. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | - Nathan Gravel
- Institute of Bioinformatics, University of Georgia, GA 30602, USA
| | - Mariah Salcedo
- Department of Biochemistry and Molecular Biology, University of Georgia, GA 30602, USA
| | | | - Sheng Li
- To whom correspondence should be addressed. or
| | | |
Collapse
|
33
|
Xiao D, Chen C, Yang P. Computational systems approach towards phosphoproteomics and their downstream regulation. Proteomics 2023; 23:e2200068. [PMID: 35580145 DOI: 10.1002/pmic.202200068] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2022] [Revised: 04/26/2022] [Accepted: 05/03/2022] [Indexed: 11/07/2022]
Abstract
Protein phosphorylation plays an essential role in modulating cell signalling and its downstream transcriptional and translational regulations. Until recently, protein phosphorylation has been studied mostly using low-throughput biochemical assays. The advancement of mass spectrometry (MS)-based phosphoproteomics transformed the field by enabling measurement of proteome-wide phosphorylation events, where tens of thousands of phosphosites are routinely identified and quantified in an experiment. This has brought a significant challenge in analysing large-scale phosphoproteomic data, making computational methods and systems approaches integral parts of phosphoproteomics. Previous works have primarily focused on reviewing the experimental techniques in MS-based phosphoproteomics, yet a systematic survey of the computational landscape in this field is still missing. Here, we review computational methods and tools, and systems approaches that have been developed for phosphoproteomics data analysis. We categorise them into four aspects including data processing, functional analysis, phosphoproteome annotation and their integration with other omics, and in each aspect, we discuss the key methods and example studies. Lastly, we highlight some of the potential research directions on which future work would make a significant contribution to this fast-growing field. We hope this review provides a useful snapshot of the field of computational systems phosphoproteomics and stimulates new research that drives future development.
Collapse
Affiliation(s)
- Di Xiao
- Computational Systems Biology Group, Children's Medical Research Institute, The University of Sydney, Westmead, New South Wales, Australia.,Charles Perkins Centre, The University of Sydney, Sydney, New South Wales, Australia
| | - Carissa Chen
- Computational Systems Biology Group, Children's Medical Research Institute, The University of Sydney, Westmead, New South Wales, Australia.,Charles Perkins Centre, The University of Sydney, Sydney, New South Wales, Australia
| | - Pengyi Yang
- Computational Systems Biology Group, Children's Medical Research Institute, The University of Sydney, Westmead, New South Wales, Australia.,Charles Perkins Centre, The University of Sydney, Sydney, New South Wales, Australia.,School of Mathematics and Statistics, The University of Sydney, Sydney, New South Wales, Australia
| |
Collapse
|
34
|
Li Z, Gao E, Zhou J, Han W, Xu X, Gao X. Applications of deep learning in understanding gene regulation. CELL REPORTS METHODS 2023; 3:100384. [PMID: 36814848 PMCID: PMC9939384 DOI: 10.1016/j.crmeth.2022.100384] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
Abstract
Gene regulation is a central topic in cell biology. Advances in omics technologies and the accumulation of omics data have provided better opportunities for gene regulation studies than ever before. For this reason deep learning, as a data-driven predictive modeling approach, has been successfully applied to this field during the past decade. In this article, we aim to give a brief yet comprehensive overview of representative deep-learning methods for gene regulation. Specifically, we discuss and compare the design principles and datasets used by each method, creating a reference for researchers who wish to replicate or improve existing methods. We also discuss the common problems of existing approaches and prospectively introduce the emerging deep-learning paradigms that will potentially alleviate them. We hope that this article will provide a rich and up-to-date resource and shed light on future research directions in this area.
Collapse
Affiliation(s)
- Zhongxiao Li
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Elva Gao
- The KAUST School, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Juexiao Zhou
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Wenkai Han
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Xiaopeng Xu
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Xin Gao
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| |
Collapse
|
35
|
Ullah M, Hadi F, Song J, Yu DJ. PScL-2LSAESM: bioimage-based prediction of protein subcellular localization by integrating heterogeneous features with the two-level SAE-SM and mean ensemble method. Bioinformatics 2023; 39:6839969. [PMID: 36413068 PMCID: PMC9947927 DOI: 10.1093/bioinformatics/btac727] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2022] [Revised: 11/02/2022] [Accepted: 11/21/2022] [Indexed: 11/23/2022] Open
Abstract
MOTIVATION Over the past decades, a variety of in silico methods have been developed to predict protein subcellular localization within cells. However, a common and major challenge in the design and development of such methods is how to effectively utilize the heterogeneous feature sets extracted from bioimages. In this regards, limited efforts have been undertaken. RESULTS We propose a new two-level stacked autoencoder network (termed 2L-SAE-SM) to improve its performance by integrating the heterogeneous feature sets. In particular, in the first level of 2L-SAE-SM, each optimal heterogeneous feature set is fed to train our designed stacked autoencoder network (SAE-SM). All the trained SAE-SMs in the first level can output the decision sets based on their respective optimal heterogeneous feature sets, known as 'intermediate decision' sets. Such intermediate decision sets are then ensembled using the mean ensemble method to generate the 'intermediate feature' set for the second-level SAE-SM. Using the proposed framework, we further develop a novel predictor, referred to as PScL-2LSAESM, to characterize image-based protein subcellular localization. Extensive benchmarking experiments on the latest benchmark training and independent test datasets collected from the human protein atlas databank demonstrate the effectiveness of the proposed 2L-SAE-SM framework for the integration of heterogeneous feature sets. Moreover, performance comparison of the proposed PScL-2LSAESM with current state-of-the-art methods further illustrates that PScL-2LSAESM clearly outperforms the existing state-of-the-art methods for the task of protein subcellular localization. AVAILABILITY AND IMPLEMENTATION https://github.com/csbio-njust-edu/PScL-2LSAESM. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Matee Ullah
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Fazal Hadi
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | | | - Dong-Jun Yu
- To whom correspondence should be addressed. or
| |
Collapse
|
36
|
Ahmed F, Dehzangi I, Hasan MM, Shatabda S. Accurately predicting microbial phosphorylation sites using evolutionary and structural features. Gene 2023; 851:146993. [DOI: 10.1016/j.gene.2022.146993] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Revised: 10/05/2022] [Accepted: 10/14/2022] [Indexed: 11/27/2022]
|
37
|
Jiang H, Shang S, Sha Y, Zhang L, He N, Li L. EdeepSADPr: an extensive deep-learning architecture for prediction of the in situ crosstalks of serine phosphorylation and ADP-ribosylation. Front Cell Dev Biol 2023; 11:1149535. [PMID: 37187615 PMCID: PMC10175571 DOI: 10.3389/fcell.2023.1149535] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2023] [Accepted: 04/17/2023] [Indexed: 05/17/2023] Open
Abstract
The in situ post-translational modification (PTM) crosstalk refers to the interactions between different types of PTMs that occur on the same residue site of a protein. The crosstalk sites generally have different characteristics from those with the single PTM type. Studies targeting the latter's features have been widely conducted, while studies on the former's characteristics are rare. For example, the characteristics of serine phosphorylation (pS) and serine ADP-ribosylation (SADPr) have been investigated, whereas those of their in situ crosstalks (pSADPr) are unknown. In this study, we collected 3,250 human pSADPr, 7,520 SADPr, 151,227 pS and 80,096 unmodified serine sites and explored the features of the pSADPr sites. We found that the characteristics of pSADPr sites are more similar to those of SADPr compared to pS or unmodified serine sites. Moreover, the crosstalk sites are likely to be phosphorylated by some kinase families (e.g., AGC, CAMK, STE and TKL) rather than others (e.g., CK1 and CMGC). Additionally, we constructed three classifiers to predict pSADPr sites from the pS dataset, the SADPr dataset and the protein sequences separately. We built and evaluated five deep-learning classifiers in ten-fold cross-validation and independent test datasets. We also used the classifiers as base classifiers to develop a few stacking-based ensemble classifiers to improve performance. The best classifiers had the AUC values of 0.700, 0.914 and 0.954 for recognizing pSADPr sites from the SADPr, pS and unmodified serine sites, respectively. The lowest prediction accuracy was achieved by separating pSADPr and SADPr sites, which is consistent with the observation that pSADPr's characteristics are more similar to those of SADPr than the rest. Finally, we developed an online tool for extensively predicting human pSADPr sites based on the CNNOH classifier, dubbed EdeepSADPr. It is freely available through http://edeepsadpr.bioinfogo.org/. We expect our investigation will promote a comprehensive understanding of crosstalks.
Collapse
Affiliation(s)
- Haoqiang Jiang
- College of Basic Medicine, Qingdao University, Qingdao, China
- Sino Genomics Technology Co., Ltd., Qingdao, China
| | - Shipeng Shang
- College of Basic Medicine, Qingdao University, Qingdao, China
| | - Yutong Sha
- College of Basic Medicine, Qingdao University, Qingdao, China
| | - Lin Zhang
- College of Computer Science and Technology, Qingdao University, Qingdao, China
| | - Ningning He
- College of Basic Medicine, Qingdao University, Qingdao, China
| | - Lei Li
- College of Basic Medicine, Qingdao University, Qingdao, China
- Faculty of Biomedical and Rehabilitation Engineering, University of Health and Rehabilitation Sciences, Qingdao, China
- *Correspondence: Lei Li,
| |
Collapse
|
38
|
A Novel Capsule Network with Attention Routing to Identify Prokaryote Phosphorylation Sites. Biomolecules 2022; 12:biom12121854. [PMID: 36551282 PMCID: PMC9775645 DOI: 10.3390/biom12121854] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2022] [Revised: 12/07/2022] [Accepted: 12/09/2022] [Indexed: 12/14/2022] Open
Abstract
By denaturing proteins and promoting the formation of multiprotein complexes, protein phosphorylation has important effects on the activity of protein functional molecules and cell signaling. The regulation of protein phosphorylation allows microbes to respond rapidly and reversibly to specific environmental stimuli or niches, which is closely related to the molecular mechanisms of bacterial drug resistance. Accurate prediction of phosphorylation sites (p-site) of prokaryotes can contribute to addressing bacterial resistance and providing new perspectives for developing novel antibacterial drugs. Most existing studies focus on human phosphorylation sites, while tools targeting phosphorylation site identification of prokaryotic proteins are still relatively scarce. This study designs a capsule network-based prediction technique for p-site in prokaryotes. To address the poor scalability and unreliability of dynamic routing processes in the output space of capsule networks, a more reliable way is introduced to learn the consistency between capsules. We incorporate a self-attention mechanism into the routing algorithm to capture the global information of the capsule, reducing the computational effort while enriching the representation capability of the capsule. Aiming at the weak robustness of the model, EcapsP improves the prediction accuracy and stability by introducing shortcuts and unconditional reconfiguration. In addition, the study compares and analyzes the prediction performance based on word vectors, physicochemical properties, and mixing characteristics in predicting serine (Ser/S), threonine (Thr/T), and tyrosine (Tyr/Y) p-site. The comprehensive experimental results show that the accuracy of the developed technique is close to 70% for the identification of the three phosphorylation sites in prokaryotes. Importantly, in side-by-side comparisons with other state-of-the-art predictors, our method improves the Matthews correlation coefficient (MCC) by approximately 7%. The results demonstrate the superiority of EcapsP in terms of high performance and reliability.
Collapse
|
39
|
Zeng Y, Liu D, Wang Y. Identification of phosphorylation site using S-padding strategy based convolutional neural network. Health Inf Sci Syst 2022; 10:29. [PMID: 36124094 PMCID: PMC9481819 DOI: 10.1007/s13755-022-00196-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2022] [Accepted: 08/25/2022] [Indexed: 10/14/2022] Open
Abstract
Purpose Abnormal phosphorylation has been proved to associate with a variety of human diseases, and the identification of phosphorylation sites is one of the research hotspots in healthcare. The study of phosphorylation site prediction in deep learning models often introduces a variety of information, and the utilization of complex models limits the usage scenarios of the models. Methods An enhanced deep learning method with S-padding strategy based on convolutional neural network is proposed in this paper. The S-padding strategy forms a three-dimensional matrix with extension information from original amino acid sequences, and a corresponding 2D-CNN model is designed to abstract the comprehensive features of phosphorylation site area in protein sequences. Results The fivefold cross-validation experiments are conducted, and the results show the performance of the proposed method on human dataset can achieve an accuracy of 89.68 % on serine/threonine sites and 88.16 % on tyrosine sites, respectively. Furthermore, phosphorylation site prediction on different organisms obtains the accuracy, sensitivity, and specificity of over 0.85, indicating a potential capability on phosphorylation site prediction task. Comparison result with existing models shows that the proposed method obtains better performance on both accuracy and AUC value, and the proposed method can further improve performance with sufficient training data. Conclusion This method enables proteome-wide predictions via models trained on a large amount of phosphorylation data, further exploiting the potential of protein phosphorylation site identification, and helping to provide insights into phosphorylation mechanisms.
Collapse
Affiliation(s)
- Yanjiao Zeng
- School of Computer Science and Technology, Guangdong University of Technology, Guangzhou, 510006 Guangdong China
| | - Dongning Liu
- School of Computer Science and Technology, Guangdong University of Technology, Guangzhou, 510006 Guangdong China
| | - Yang Wang
- School of Computer Science and Technology, Guangdong University of Technology, Guangzhou, 510006 Guangdong China
| |
Collapse
|
40
|
Zhao J, Zhuang M, Liu J, Zhang M, Zeng C, Jiang B, Wu J, Song X. pHisPred: a tool for the identification of histidine phosphorylation sites by integrating amino acid patterns and properties. BMC Bioinformatics 2022; 23:399. [PMID: 36171552 PMCID: PMC9520798 DOI: 10.1186/s12859-022-04938-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2022] [Accepted: 09/16/2022] [Indexed: 11/17/2022] Open
Abstract
Background Protein histidine phosphorylation (pHis) plays critical roles in prokaryotic signal transduction pathways and various eukaryotic cellular processes. It is estimated to account for 6–10% of the phosphoproteome, however only hundreds of pHis sites have been discovered to date. Due to the inherent disadvantages of experimental methods, it is an urgent task for developing efficient computational approaches to identify pHis sites. Results Here, we present a novel tool, pHisPred, for accurately identifying pHis sites from protein sequences. We manually collected the largest number of experimental validated pHis sites to build benchmark datasets. Using randomized tenfold CV, the weighted SVM-RBF model shows the best performance than other four commonly used classification models (LR, KNN, RF, and MLP). From ten thousands of features, 140 and 150 most informative features were individually selected out for eukaryotic and prokaryotic models. The average AUC and F1-score values of pHisPred were (0.81, 0.40) and (0.78, 0.46) for tenfold CV on the eukaryotic and prokaryotic training datasets, respectively. In addition, pHisPred significantly outperforms other tools on testing datasets, in particular on the eukaryotic one. Conclusion We implemented a python program of pHisPred, which is freely available for non-commercial use at https://github.com/xiaofengsong/pHisPred. Moreover, users can use it to train new models with their own data. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04938-x.
Collapse
Affiliation(s)
- Jian Zhao
- Department of Biomedical Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, 210016, China
| | - Minhui Zhuang
- Department of Biomedical Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, 210016, China
| | - Jingjing Liu
- Department of Biomedical Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, 210016, China
| | - Meng Zhang
- Department of Biomedical Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, 210016, China
| | - Cong Zeng
- Department of Biomedical Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, 210016, China
| | - Bin Jiang
- College of Automation Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, 211106, China
| | - Jing Wu
- School of Biomedical Engineering and Informatics, Nanjing Medical University, Nanjing, 211166, China.
| | - Xiaofeng Song
- Department of Biomedical Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, 210016, China.
| |
Collapse
|
41
|
Ullah M, Hadi F, Song J, Yu DJ. PScL-DDCFPred: an ensemble deep learning-based approach for characterizing multiclass subcellular localization of human proteins from bioimage data. Bioinformatics 2022; 38:4019-4026. [PMID: 35771606 PMCID: PMC9890309 DOI: 10.1093/bioinformatics/btac432] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2022] [Revised: 06/03/2022] [Accepted: 06/28/2022] [Indexed: 02/04/2023] Open
Abstract
MOTIVATION Characterization of protein subcellular localization has become an important and long-standing task in bioinformatics and computational biology, which provides valuable information for elucidating various cellular functions of proteins and guiding drug design. RESULTS Here, we develop a novel bioimage-based computational approach, termed PScL-DDCFPred, to accurately predict protein subcellular localizations in human tissues. PScL-DDCFPred first extracts multiview image features, including global and local features, as base or pure features; next, it applies a new integrative feature selection method based on stepwise discriminant analysis and generalized discriminant analysis to identify the optimal feature sets from the extracted pure features; Finally, a classifier based on deep neural network (DNN) and deep-cascade forest (DCF) is established. Stringent 10-fold cross-validation tests on the new protein subcellular localization training dataset, constructed from the human protein atlas databank, illustrates that PScL-DDCFPred achieves a better performance than several existing state-of-the-art methods. Moreover, the independent test set further illustrates the generalization capability and superiority of PScL-DDCFPred over existing predictors. In-depth analysis shows that the excellent performance of PScL-DDCFPred can be attributed to three critical factors, namely the effective combination of the DNN and DCF models, complementarity of global and local features, and use of the optimal feature sets selected by the integrative feature selection algorithm. AVAILABILITY AND IMPLEMENTATION https://github.com/csbio-njust-edu/PScL-DDCFPred. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Matee Ullah
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Fazal Hadi
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| |
Collapse
|
42
|
Mini-review: Recent advances in post-translational modification site prediction based on deep learning. Comput Struct Biotechnol J 2022; 20:3522-3532. [PMID: 35860402 PMCID: PMC9284371 DOI: 10.1016/j.csbj.2022.06.045] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2022] [Revised: 06/21/2022] [Accepted: 06/21/2022] [Indexed: 11/23/2022] Open
Abstract
Post-translational modifications (PTMs) are closely linked to numerous diseases, playing a significant role in regulating protein structures, activities, and functions. Therefore, the identification of PTMs is crucial for understanding the mechanisms of cell biology and diseases therapy. Compared to traditional machine learning methods, the deep learning approaches for PTM prediction provide accurate and rapid screening, guiding the downstream wet experiments to leverage the screened information for focused studies. In this paper, we reviewed the recent works in deep learning to identify phosphorylation, acetylation, ubiquitination, and other PTM types. In addition, we summarized PTM databases and discussed future directions with critical insights.
Collapse
Key Words
- AAindex, Amino acid index
- ATP, Adenosine triphosphate
- AUC, Area under curve
- Ac, Acetylation
- BE, Binary encoding
- BLOSUM, Blocks substitution matrix
- Bi-LSTM, Bidirectional LSTM
- CKSAAP, Composition of k-spaced amino acid Pairs
- CNN, Convolutional neural network
- CNNOH, CNN with the one-hot encoding
- CNNWE, CNN with the word-embedding encoding
- CNNrgb, CNN red green blue
- CV, Cross-validation
- DC-CNN, Densely connected convolutional neural network
- DL, Deep learning
- DNNs, Deep neural networks
- Deep learning
- E. coli, Escherichia coli
- EBGW, Encoding based on grouped weight
- EGAAC, Enhanced grouped amino acids content
- IG, Information gain
- K, Lysine
- KNN, k nearest neighbor
- LASSO, Least absolute shrinkage and selection operator
- LSTM, Long short-term memory
- LSTMWE, LSTM with the word-embedding encoding
- M.musculus, Mus musculus
- MDC, Modular densely connected convolutional networks
- MDCAN, Multilane dense convolutional attention network
- ML, Machine learning
- MLP, Multilayer perceptron
- MMI, Multivariate mutual information
- Machine learning
- Mass spectrometry
- NMBroto, Normalized Moreau-Broto autocorrelation
- P, Proline
- PSP, PhosphoSitePlus
- PSSM, Position-specific scoring matrix
- PTM, Post-translational modifications
- Ph, Phosphorylation
- Post-translational modification
- Prediction
- PseAAC, Pseudo-amino acid composition
- R, Arginine
- RF, Random forest
- RNN, Recurrent neural network
- ROC, Receiver operating characteristic
- S, Serine
- S. typhimurium, Salmonella typhimurium
- S.cerevisiae, Saccharomyces cerevisiae
- SE, Squeeze and excitation
- SEV, Split to Equal Validation
- ST, Source and target
- SUMO, Small ubiquitin-like modifier
- SVM, Support vector machines
- T, Threonine
- Ub, Ubiquitination
- Y, Tyrosine
- ZSL, Zero-shot learning
Collapse
|
43
|
Zhu F, Yang S, Meng F, Zheng Y, Ku X, Luo C, Hu G, Liang Z. Leveraging Protein Dynamics to Identify Functional Phosphorylation Sites using Deep Learning Models. J Chem Inf Model 2022; 62:3331-3345. [PMID: 35816597 DOI: 10.1021/acs.jcim.2c00484] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Accurate prediction of post-translational modifications (PTMs) is of great significance in understanding cellular processes, by modulating protein structure and dynamics. Nowadays, with the rapid growth of protein data at different "omics" levels, machine learning models largely enriched the prediction of PTMs. However, most machine learning models only rely on protein sequence and little structural information. The lack of the systematic dynamics analysis underlying PTMs largely limits the PTM functional predictions. In this research, we present two dynamics-centric deep learning models, namely, cDL-PAU and cDL-FuncPhos, by incorporating sequence, structure, and dynamics-based features to elucidate the molecular basis and underlying functional landscape of PTMs. cDL-PAU achieved satisfactory area under the curve (AUC) scores of 0.804-0.888 for predicting phosphorylation, acetylation, and ubiquitination (PAU) sites, while cDL-FuncPhos achieved an AUC value of 0.771 for predicting functional phosphorylation (FuncPhos) sites, displaying reliable improvements. Through a feature selection, the dynamics-based coupling and commute ability show large contributions in discovering PAU sites and FuncPhos sites, suggesting the allosteric propensity for important PTMs. The application of cDL-FuncPhos in three oncoproteins not only corroborates its strong performance in FuncPhos prioritization but also gains insight into the physical basis for the functions. The source code and data set of cDL-PAU and cDL-FuncPhos are available at https://github.com/ComputeSuda/PTM_ML.
Collapse
Affiliation(s)
- Fei Zhu
- Center for Systems Biology, Department of Bioinformatics, School of Biology and Basic Medical Sciences, Soochow University, Suzhou 215123, China.,School of Computer Science and Technology, Soochow University, Suzhou 215006, China
| | - Sijie Yang
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
| | - Fanwang Meng
- Department of Chemistry and Chemical Biology, McMaster University, Hamilton L8S 4L8, Ontario, Canada
| | - Yuxiang Zheng
- Center for Systems Biology, Department of Bioinformatics, School of Biology and Basic Medical Sciences, Soochow University, Suzhou 215123, China
| | - Xin Ku
- Key Laboratory of Systems Biomedicine (Ministry of Education), Shanghai Center for Systems Biomedicine, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Cheng Luo
- State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China
| | - Guang Hu
- Center for Systems Biology, Department of Bioinformatics, School of Biology and Basic Medical Sciences, Soochow University, Suzhou 215123, China
| | - Zhongjie Liang
- Center for Systems Biology, Department of Bioinformatics, School of Biology and Basic Medical Sciences, Soochow University, Suzhou 215123, China.,Key Laboratory of Systems Biomedicine (Ministry of Education), Shanghai Center for Systems Biomedicine, Shanghai Jiao Tong University, Shanghai 200240, China.,State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China
| |
Collapse
|
44
|
Villalobos-Alva J, Ochoa-Toledo L, Villalobos-Alva MJ, Aliseda A, Pérez-Escamirosa F, Altamirano-Bustamante NF, Ochoa-Fernández F, Zamora-Solís R, Villalobos-Alva S, Revilla-Monsalve C, Kemper-Valverde N, Altamirano-Bustamante MM. Protein Science Meets Artificial Intelligence: A Systematic Review and a Biochemical Meta-Analysis of an Inter-Field. Front Bioeng Biotechnol 2022; 10:788300. [PMID: 35875501 PMCID: PMC9301016 DOI: 10.3389/fbioe.2022.788300] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2021] [Accepted: 05/25/2022] [Indexed: 11/23/2022] Open
Abstract
Proteins are some of the most fascinating and challenging molecules in the universe, and they pose a big challenge for artificial intelligence. The implementation of machine learning/AI in protein science gives rise to a world of knowledge adventures in the workhorse of the cell and proteome homeostasis, which are essential for making life possible. This opens up epistemic horizons thanks to a coupling of human tacit-explicit knowledge with machine learning power, the benefits of which are already tangible, such as important advances in protein structure prediction. Moreover, the driving force behind the protein processes of self-organization, adjustment, and fitness requires a space corresponding to gigabytes of life data in its order of magnitude. There are many tasks such as novel protein design, protein folding pathways, and synthetic metabolic routes, as well as protein-aggregation mechanisms, pathogenesis of protein misfolding and disease, and proteostasis networks that are currently unexplored or unrevealed. In this systematic review and biochemical meta-analysis, we aim to contribute to bridging the gap between what we call binomial artificial intelligence (AI) and protein science (PS), a growing research enterprise with exciting and promising biotechnological and biomedical applications. We undertake our task by exploring "the state of the art" in AI and machine learning (ML) applications to protein science in the scientific literature to address some critical research questions in this domain, including What kind of tasks are already explored by ML approaches to protein sciences? What are the most common ML algorithms and databases used? What is the situational diagnostic of the AI-PS inter-field? What do ML processing steps have in common? We also formulate novel questions such as Is it possible to discover what the rules of protein evolution are with the binomial AI-PS? How do protein folding pathways evolve? What are the rules that dictate the folds? What are the minimal nuclear protein structures? How do protein aggregates form and why do they exhibit different toxicities? What are the structural properties of amyloid proteins? How can we design an effective proteostasis network to deal with misfolded proteins? We are a cross-functional group of scientists from several academic disciplines, and we have conducted the systematic review using a variant of the PICO and PRISMA approaches. The search was carried out in four databases (PubMed, Bireme, OVID, and EBSCO Web of Science), resulting in 144 research articles. After three rounds of quality screening, 93 articles were finally selected for further analysis. A summary of our findings is as follows: regarding AI applications, there are mainly four types: 1) genomics, 2) protein structure and function, 3) protein design and evolution, and 4) drug design. In terms of the ML algorithms and databases used, supervised learning was the most common approach (85%). As for the databases used for the ML models, PDB and UniprotKB/Swissprot were the most common ones (21 and 8%, respectively). Moreover, we identified that approximately 63% of the articles organized their results into three steps, which we labeled pre-process, process, and post-process. A few studies combined data from several databases or created their own databases after the pre-process. Our main finding is that, as of today, there are no research road maps serving as guides to address gaps in our knowledge of the AI-PS binomial. All research efforts to collect, integrate multidimensional data features, and then analyze and validate them are, so far, uncoordinated and scattered throughout the scientific literature without a clear epistemic goal or connection between the studies. Therefore, our main contribution to the scientific literature is to offer a road map to help solve problems in drug design, protein structures, design, and function prediction while also presenting the "state of the art" on research in the AI-PS binomial until February 2021. Thus, we pave the way toward future advances in the synthetic redesign of novel proteins and protein networks and artificial metabolic pathways, learning lessons from nature for the welfare of humankind. Many of the novel proteins and metabolic pathways are currently non-existent in nature, nor are they used in the chemical industry or biomedical field.
Collapse
Affiliation(s)
- Jalil Villalobos-Alva
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Luis Ochoa-Toledo
- Instituto de Ciencias Aplicadas y Tecnología (ICAT), Universidad Nacional Autónoma de México (UNAM), Mexico City, Mexico
| | - Mario Javier Villalobos-Alva
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Atocha Aliseda
- Instituto de Investigaciones Filosóficas, Universidad Nacional Autónoma de México (UNAM), Mexico City, Mexico
| | - Fernando Pérez-Escamirosa
- Instituto de Ciencias Aplicadas y Tecnología (ICAT), Universidad Nacional Autónoma de México (UNAM), Mexico City, Mexico
| | | | - Francine Ochoa-Fernández
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Ricardo Zamora-Solís
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Sebastián Villalobos-Alva
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Cristina Revilla-Monsalve
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Nicolás Kemper-Valverde
- Instituto de Ciencias Aplicadas y Tecnología (ICAT), Universidad Nacional Autónoma de México (UNAM), Mexico City, Mexico
| | - Myriam M. Altamirano-Bustamante
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| |
Collapse
|
45
|
DeepDA-Ace: A Novel Domain Adaptation Method for Species-Specific Acetylation Site Prediction. MATHEMATICS 2022. [DOI: 10.3390/math10142364] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
Protein lysine acetylation is an important type of post-translational modification (PTM), and it plays a crucial role in various cellular processes. Recently, although many researchers have focused on developing tools for acetylation site prediction based on computational methods, most of these tools are based on traditional machine learning algorithms for acetylation site prediction without species specificity, still maintained as a single prediction model. Recent studies have shown that the acetylation sites of distinct species have evident location-specific differences; however, there is currently no integrated prediction model that can effectively predict acetylation sites cross all species. Therefore, to enhance the scope of species-specific level, it is necessary to establish a framework for species-specific acetylation site prediction. In this work, we propose a domain adaptation framework DeepDA-Ace for species-specific acetylation site prediction, including Rattus norvegicus, Schistosoma japonicum, Arabidopsis thaliana, and other types of species. In DeepDA-Ace, an attention based densely connected convolutional neural network is designed to capture sequence features, and the semantic adversarial learning strategy is proposed to align features of different species so as to achieve knowledge transfer. The DeepDA-Ace outperformed both the general prediction model and fine-tuning based species-specific model across most types of species. The experiment results have demonstrated that DeepDA-Ace is superior to the general and fine-tuning methods, and its precision exceeds 0.75 on most species. In addition, our method achieves at least 5% improvement over the existing acetylation prediction tools.
Collapse
|
46
|
Ma R, Li S, Li W, Yao L, Huang HD, Lee TY. KinasePhos 3.0: Redesign and Expansion of the Prediction on Kinase-specific Phosphorylation Sites. GENOMICS, PROTEOMICS & BIOINFORMATICS 2022:S1672-0229(22)00081-X. [PMID: 35781048 PMCID: PMC10373160 DOI: 10.1016/j.gpb.2022.06.004] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/21/2021] [Revised: 05/30/2022] [Accepted: 06/27/2022] [Indexed: 06/04/2023]
Abstract
The purpose of this work is to enhance KinasePhos, a machine learning-based kinase-specific phosphorylation site prediction tool. Experimentally verified kinase-specific phosphorylation data were collected from PhosphoSitePlus, UniProtKB, the Group-based Prediction System 5.0, and Phospho.ELM. In total, 41,421 experimentally verified kinase-specific phosphorylation sites were identified. A total of 1380 unique kinases were identified, including 753 with existing classification information from KinBase and the remaining 627 annotated by building a phylogenetic tree. Based on this kinase classification, a total of 771 predictive models were built at the individual, family, and group levels, using at least 15 experimentally verified substrate sites in positive training datasets. The improved models demonstrated their effectiveness compared with other prediction tools. For example, the prediction of sites phosphorylated by the protein kinase B, casein kinase 2, and protein kinase A families had accuracies of 94.5%, 92.5%, and 90.0%, respectively. The average prediction accuracy for all 771 models was 87.2%. For enhancing interpretability, the SHapley Additive exPlanations (SHAP) method was employed to assess feature importance. The web interface of KinasePhos 3.0 has been redesigned to provide comprehensive annotations of kinase-specific phosphorylation sites on multiple proteins. Additionally, considering the large scale of phosphoproteomic data, a downloadable prediction tool is available at https://awi.cuhk.edu.cn/KinasePhos/download.html or https://github.com/tom-209/KinasePhos-3.0-executable-file.
Collapse
Affiliation(s)
- Renfei Ma
- Warshel Institute for Computational Biology, School of Medicine, The Chinese University of Hong Kong, Shenzhen 518172, China; School of Life Sciences, University of Science and Technology of China, Hefei 230027, China
| | - Shangfu Li
- Warshel Institute for Computational Biology, School of Medicine, The Chinese University of Hong Kong, Shenzhen 518172, China
| | - Wenshuo Li
- School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen 518172, China
| | - Lantian Yao
- School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen 518172, China
| | - Hsien-Da Huang
- Warshel Institute for Computational Biology, School of Medicine, The Chinese University of Hong Kong, Shenzhen 518172, China; School of Life and Health Sciences, School of Medicine, The Chinese University of Hong Kong, Shenzhen 518172, China.
| | - Tzong-Yi Lee
- Warshel Institute for Computational Biology, School of Medicine, The Chinese University of Hong Kong, Shenzhen 518172, China; School of Life and Health Sciences, School of Medicine, The Chinese University of Hong Kong, Shenzhen 518172, China.
| |
Collapse
|
47
|
Deep Learning-Based Advances In Protein Posttranslational Modification Site and Protein Cleavage Prediction. METHODS IN MOLECULAR BIOLOGY (CLIFTON, N.J.) 2022; 2499:285-322. [PMID: 35696087 DOI: 10.1007/978-1-0716-2317-6_15] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
Posttranslational modification (PTM ) is a ubiquitous phenomenon in both eukaryotes and prokaryotes which gives rise to enormous proteomic diversity. PTM mostly comes in two flavors: covalent modification to polypeptide chain and proteolytic cleavage. Understanding and characterization of PTM is a fundamental step toward understanding the underpinning of biology. Recent advances in experimental approaches, mainly mass-spectrometry-based approaches, have immensely helped in obtaining and characterizing PTMs. However, experimental approaches are not enough to understand and characterize more than 450 different types of PTMs and complementary computational approaches are becoming popular. Recently, due to the various advancements in the field of Deep Learning (DL), along with the explosion of applications of DL to various fields, the field of computational prediction of PTM has also witnessed the development of a plethora of deep learning (DL)-based approaches. In this book chapter, we first review some recent DL-based approaches in the field of PTM site prediction. In addition, we also review the recent advances in the not-so-studied PTM , that is, proteolytic cleavage predictions. We describe advances in PTM prediction by highlighting the Deep learning architecture, feature encoding, novelty of the approaches, and availability of the tools/approaches. Finally, we provide an outlook and possible future research directions for DL-based approaches for PTM prediction.
Collapse
|
48
|
Wang B, Wang M, Zhang H, Xu J, Hou J, Zhu Y. Canine Adenovirus 1 Isolation Bioinformatics Analysis of the Fiber. Front Cell Infect Microbiol 2022; 12:879360. [PMID: 35770071 PMCID: PMC9235841 DOI: 10.3389/fcimb.2022.879360] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2022] [Accepted: 04/25/2022] [Indexed: 11/13/2022] Open
Abstract
Canine adenovirus type 1 (CAdV-1) is a double-stranded DNA virus, which is the causative agent of fox encephalitis. The Fiber protein is one of the structural proteins in CAdV-1, which mediates virion binding to the coxsackievirus and adenovirus receptor on host cells. The suspected virus was cultured in the MDCK cells, and it was determined through the cytopathic effects, sequencing and electron microscopy. The informatics analysis of the Fiber was done using online bioinformatics servers. The CAdV-1-JL2021 strain was isolated successfully, and were most similar to the CAdV-1 strain circulating in Italy. The occurrence of negative selection and recombination were found in the CAdV-1-JL2021 and CAdV-2-AC_000020.1. Host cell membrane was its subcellular localization. The CAdV-1-JL2021 Fiber (ON164651) had 6 glycosylation sites and 107 phosphorylation sites, exerted adhesion receptor-mediated virion attachment to host cell, which was the same as CAdV-2-AC_000020.1 Fiber. The Fiber tertiary structure of the CAdV-1-JL2021 and CAdV-2-AC_000020.1 was different, but they had the same coxsackievirus and adenovirus receptor. “VATTSPTLTFAYPLIKNNNH” were predicted to be the potential CAdV-1 B cell linear epitope. The MHC-I binding peptide “KLGVKPTTY” were both presented in the CAdV-1-JL2021 and CAdV-2-AC_000020.1 Fiber and it is useful to design the canine adenovirus vaccine.
Collapse
Affiliation(s)
- Ben Wang
- Animal Science and Technology College, Jilin Agriculture Science and Technology University, Jilin, China
| | - Minchun Wang
- Institute of Special Animal and Plant Sciences, Chinese Academy of Agricultural Sciences, Changchun, China
| | - Hongling Zhang
- Animal Science and Technology College, Jilin Agriculture Science and Technology University, Jilin, China
| | - Jinfeng Xu
- Institute of Special Animal and Plant Sciences, Chinese Academy of Agricultural Sciences, Changchun, China
| | - Jinyu Hou
- Institute of Special Animal and Plant Sciences, Chinese Academy of Agricultural Sciences, Changchun, China
- College of Veterinary Medicine, Jilin Agricultural University, Changchun, China
| | - Yanzhu Zhu
- Institute of Special Animal and Plant Sciences, Chinese Academy of Agricultural Sciences, Changchun, China
- College of Veterinary Medicine, Jilin Agricultural University, Changchun, China
- *Correspondence: Yanzhu Zhu,
| |
Collapse
|
49
|
A Novel -Gram-Based Image Classification Model and Its Applications in Diagnosing Thyroid Nodule and Retinal OCT Images. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2022; 2022:3151554. [PMID: 35547561 PMCID: PMC9085325 DOI: 10.1155/2022/3151554] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/06/2021] [Revised: 04/14/2022] [Accepted: 04/16/2022] [Indexed: 11/18/2022]
Abstract
Imbalanced classes and dimensional disasters are critical challenges in medical image classification. As a classical machine learning model, the n-gram model has shown excellent performance in addressing this issue in text classification. In this study, we proposed an algorithm to classify medical images by extracting their n-gram semantic features. This algorithm first converts an image classification problem to a text classification problem by building an n-gram corpus for an image. After that, the algorithm was based on the n-gram model to classify images. The algorithm was evaluated by two independent public datasets. The first experiment is to diagnose benign and malignant thyroid nodules. The best area under the curve (AUC) is 0.989. The second experiment is to diagnose the type of fundus lesion. The best result is that it correctly identified 86.667% of patients with dry age-related macular degeneration (AMD), 93.333% of patients with diabetic macular edema (DME), and 93.333% of normal individuals.
Collapse
|
50
|
Prediction of GPCR activity using Machine Learning. Comput Struct Biotechnol J 2022; 20:2564-2573. [PMID: 35685352 PMCID: PMC9163700 DOI: 10.1016/j.csbj.2022.05.016] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2022] [Revised: 05/08/2022] [Accepted: 05/09/2022] [Indexed: 11/20/2022] Open
Abstract
GPCRs are the target for one-third of the FDA-approved drugs, however; the development of new drug molecules targeting GPCRs is limited by the lack of mechanistic understanding of the GPCR structure–activity-function relationship. To modulate the GPCR activity with highly specific drugs and minimal side-effects, it is necessary to quantitatively describe the important structural features in the GPCR and correlate them to the activation state of GPCR. In this study, we developed 3 ML approaches to predict the conformation state of GPCR proteins. Additionally, we predict the activity level of GPCRs based on their structure. We leverage the unique advantages of each of the 3 ML approaches, interpretability of XGBoost, minimal feature engineering for 3D convolutional neural network, and graph representation of protein structure for graph neural network. By using these ML approaches, we are able to predict the activation state of GPCRs with high accuracy (91%–95%) and also predict the activation state of GPCRs with low error (MAE of 7.15–10.58). Furthermore, the interpretation of the ML approaches allows us to determine the importance of each of the features in distinguishing between the GPCRs conformations.
Collapse
|