1
|
Feng J, Sun M, Liu C, Zhang W, Xu C, Wang J, Wang G, Wan S. SAMP: Identifying antimicrobial peptides by an ensemble learning model based on proportionalized split amino acid composition. Brief Funct Genomics 2024; 23:879-890. [PMID: 39573886 PMCID: PMC11631067 DOI: 10.1093/bfgp/elae046] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2024] [Revised: 08/23/2024] [Accepted: 11/01/2024] [Indexed: 11/24/2024] Open
Abstract
It is projected that 10 million deaths could be attributed to drug-resistant bacteria infections in 2050. To address this concern, identifying new-generation antibiotics is an effective way. Antimicrobial peptides (AMPs), a class of innate immune effectors, have received significant attention for their capacity to eliminate drug-resistant pathogens, including viruses, bacteria, and fungi. Recent years have witnessed widespread applications of computational methods especially machine learning (ML) and deep learning (DL) for discovering AMPs. However, existing methods only use features including compositional, physiochemical, and structural properties of peptides, which cannot fully capture sequence information from AMPs. Here, we present SAMP, an ensemble random projection (RP) based computational model that leverages a new type of feature called proportionalized split amino acid composition (PSAAC) in addition to conventional sequence-based features for AMP prediction. With this new feature set, SAMP captures the residue patterns like sorting signals at both the N-terminal and the C-terminal, while also retaining the sequence order information from the middle peptide fragments. Benchmarking tests on different balanced and imbalanced datasets demonstrate that SAMP consistently outperforms existing state-of-the-art methods, such as iAMPpred and AMPScanner V2, in terms of accuracy, Matthews correlation coefficient (MCC), G-measure, and F1-score. In addition, by leveraging an ensemble RP architecture, SAMP is scalable to processing large-scale AMP identification with further performance improvement, compared to those models without RP. To facilitate the use of SAMP, we have developed a Python package that is freely available at https://github.com/wan-mlab/SAMP.
Collapse
Affiliation(s)
- Junxi Feng
- Department of Biostatistics, School of Public Health, Harvard University, Boston, MA 02115, United States
| | - Mengtao Sun
- Department of Genetics, Cell Biology and Anatomy, College of Medicine, University of Nebraska Medical Center, Omaha, NE 68198, United States
| | - Cong Liu
- Department of Mathematics, Data Science, University of Waterloo, Waterloo, ON N2L3G1, Canada
| | - Weiwei Zhang
- Department of Pathology, Microbiology, and Immunology, College of Medicine, University of Nebraska Medical Center, Omaha, NE 68198, United States
| | - Changmou Xu
- Department of Food Science and Human Nutrition, College of Agricultural, Consumer and Environmental Sciences, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States
| | - Jieqiong Wang
- Department of Neurological Sciences, College of Medicine, University of Nebraska Medical Center, Omaha, NE 68198, United States
| | - Guangshun Wang
- Department of Pathology, Microbiology, and Immunology, College of Medicine, University of Nebraska Medical Center, Omaha, NE 68198, United States
| | - Shibiao Wan
- Department of Genetics, Cell Biology and Anatomy, College of Medicine, University of Nebraska Medical Center, Omaha, NE 68198, United States
| |
Collapse
|
2
|
Li L, Sun M, Wang J, Wan S. Multi-omics based artificial intelligence for cancer research. Adv Cancer Res 2024; 163:303-356. [PMID: 39271266 DOI: 10.1016/bs.acr.2024.06.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/15/2024]
Abstract
With significant advancements of next generation sequencing technologies, large amounts of multi-omics data, including genomics, epigenomics, transcriptomics, proteomics, and metabolomics, have been accumulated, offering an unprecedented opportunity to explore the heterogeneity and complexity of cancer across various molecular levels and scales. One of the promising aspects of multi-omics lies in its capacity to offer a holistic view of the biological networks and pathways underpinning cancer, facilitating a deeper understanding of its development, progression, and response to treatment. However, the exponential growth of data generated by multi-omics studies present significant analytical challenges. Processing, analyzing, integrating, and interpreting these multi-omics datasets to extract meaningful insights is an ambitious task that stands at the forefront of current cancer research. The application of artificial intelligence (AI) has emerged as a powerful solution to these challenges, demonstrating exceptional capabilities in deciphering complex patterns and extracting valuable information from large-scale, intricate omics datasets. This review delves into the synergy of AI and multi-omics, highlighting its revolutionary impact on oncology. We dissect how this confluence is reshaping the landscape of cancer research and clinical practice, particularly in the realms of early detection, diagnosis, prognosis, treatment and pathology. Additionally, we elaborate the latest AI methods for multi-omics integration to provide a comprehensive insight of the complex biological mechanisms and inherent heterogeneity of cancer. Finally, we discuss the current challenges of data harmonization, algorithm interpretability, and ethical considerations. Addressing these challenges necessitates a multidisciplinary collaboration, paving the promising way for more precise, personalized, and effective treatments for cancer patients.
Collapse
Affiliation(s)
- Lusheng Li
- Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, NE, United States
| | - Mengtao Sun
- Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, NE, United States
| | - Jieqiong Wang
- Department of Neurological Sciences, University of Nebraska Medical Center, Omaha, NE, United States
| | - Shibiao Wan
- Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, NE, United States.
| |
Collapse
|
3
|
Feng J, Sun M, Liu C, Zhang W, Xu C, Wang J, Wang G, Wan S. SAMP: Identifying Antimicrobial Peptides by an Ensemble Learning Model Based on Proportionalized Split Amino Acid Composition. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.25.590553. [PMID: 38712184 PMCID: PMC11071531 DOI: 10.1101/2024.04.25.590553] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/08/2024]
Abstract
It is projected that 10 million deaths could be attributed to drug-resistant bacteria infections in 2050. To address this concern, identifying new-generation antibiotics is an effective way. Antimicrobial peptides (AMPs), a class of innate immune effectors, have received significant attention for their capacity to eliminate drug-resistant pathogens, including viruses, bacteria, and fungi. Recent years have witnessed widespread applications of computational methods especially machine learning (ML) and deep learning (DL) for discovering AMPs. However, existing methods only use features including compositional, physiochemical, and structural properties of peptides, which cannot fully capture sequence information from AMPs. Here, we present SAMP, an ensemble random projection (RP) based computational model that leverages a new type of features called Proportionalized Split Amino Acid Composition (PSAAC) in addition to conventional sequence-based features for AMP prediction. With this new feature set, SAMP captures the residue patterns like sorting signals at around both the N-terminus and the C-terminus, while also retaining the sequence order information from the middle peptide fragments. Benchmarking tests on different balanced and imbalanced datasets demonstrate that SAMP consistently outperforms existing state-of-the-art methods, such as iAMPpred and AMPScanner V2, in terms of accuracy, MCC, G-measure and F1-score. In addition, by leveraging an ensemble RP architecture, SAMP is scalable to processing large-scale AMP identification with further performance improvement, compared to those models without RP. To facilitate the use of SAMP, we have developed a Python package freely available at https://github.com/wan-mlab/SAMP.
Collapse
Affiliation(s)
- Junxi Feng
- Department of Biostatistics, Harvard School of Public Health, Boston, MA, United States, 02115
| | - Mengtao Sun
- Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, NE, United States, 68198
| | - Cong Liu
- Department of Mathematics, Data Science, University of Waterloo, Waterloo, ON, Canada, N2L3G1
| | - Weiwei Zhang
- Department of Pathology, Microbiology, and Immunology, University of Nebraska Medical Center, Omaha, NE, United States, 68198
| | - Changmou Xu
- Department of Food Science and Human Nutrition, University of Illinois Urbana-Champaign, Urbana, IL, United States, 61801
| | - Jieqiong Wang
- Department of Neurological Sciences, University of Nebraska Medical Center, Omaha, NE, United States, 68198
| | - Guangshun Wang
- Department of Pathology, Microbiology, and Immunology, University of Nebraska Medical Center, Omaha, NE, United States, 68198
| | - Shibiao Wan
- Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, NE, United States, 68198
| |
Collapse
|
4
|
Xiao H, Zou Y, Wang J, Wan S. A Review for Artificial Intelligence Based Protein Subcellular Localization. Biomolecules 2024; 14:409. [PMID: 38672426 PMCID: PMC11048326 DOI: 10.3390/biom14040409] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Revised: 03/21/2024] [Accepted: 03/25/2024] [Indexed: 04/28/2024] Open
Abstract
Proteins need to be located in appropriate spatiotemporal contexts to carry out their diverse biological functions. Mislocalized proteins may lead to a broad range of diseases, such as cancer and Alzheimer's disease. Knowing where a target protein resides within a cell will give insights into tailored drug design for a disease. As the gold validation standard, the conventional wet lab uses fluorescent microscopy imaging, immunoelectron microscopy, and fluorescent biomarker tags for protein subcellular location identification. However, the booming era of proteomics and high-throughput sequencing generates tons of newly discovered proteins, making protein subcellular localization by wet-lab experiments a mission impossible. To tackle this concern, in the past decades, artificial intelligence (AI) and machine learning (ML), especially deep learning methods, have made significant progress in this research area. In this article, we review the latest advances in AI-based method development in three typical types of approaches, including sequence-based, knowledge-based, and image-based methods. We also elaborately discuss existing challenges and future directions in AI-based method development in this research field.
Collapse
Affiliation(s)
- Hanyu Xiao
- Department of Genetics, Cell Biology and Anatomy, College of Medicine, University of Nebraska Medical Center, Omaha, NE 68198, USA;
| | - Yijin Zou
- College of Veterinary Medicine, China Agricultural University, Beijing 100193, China;
| | - Jieqiong Wang
- Department of Neurological Sciences, College of Medicine, University of Nebraska Medical Center, Omaha, NE 68198, USA;
| | - Shibiao Wan
- Department of Genetics, Cell Biology and Anatomy, College of Medicine, University of Nebraska Medical Center, Omaha, NE 68198, USA;
| |
Collapse
|
5
|
Sidorczuk K, Gagat P, Kała J, Nielsen H, Pietluch F, Mackiewicz P, Burdukiewicz M. Prediction of protein subplastid localization and origin with PlastoGram. Sci Rep 2023; 13:8365. [PMID: 37225726 DOI: 10.1038/s41598-023-35296-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2023] [Accepted: 05/16/2023] [Indexed: 05/26/2023] Open
Abstract
Due to their complex history, plastids possess proteins encoded in the nuclear and plastid genome. Moreover, these proteins localize to various subplastid compartments. Since protein localization is associated with its function, prediction of subplastid localization is one of the most important steps in plastid protein annotation, providing insight into their potential function. Therefore, we create a novel manually curated data set of plastid proteins and build an ensemble model for prediction of protein subplastid localization. Moreover, we discuss problems associated with the task, e.g. data set sizes and homology reduction. PlastoGram classifies proteins as nuclear- or plastid-encoded and predicts their localization considering: envelope, stroma, thylakoid membrane or thylakoid lumen; for the latter, the import pathway is also predicted. We also provide an additional function to differentiate nuclear-encoded inner and outer membrane proteins. PlastoGram is available as a web server at https://biogenies.info/PlastoGram and as an R package at https://github.com/BioGenies/PlastoGram . The code used for described analyses is available at https://github.com/BioGenies/PlastoGram-analysis .
Collapse
Affiliation(s)
| | - Przemysław Gagat
- Faculty of Biotechnology, University of Wrocław, 50-383, Wrocław, Poland
| | - Jakub Kała
- Faculty of Mathematics and Information Science, Warsaw University of Technology, 00-662, Warsaw, Poland
| | - Henrik Nielsen
- Department of Health Technology, Technical University of Denmark, 2800, Kgs. Lyngby, Denmark
| | - Filip Pietluch
- Faculty of Biotechnology, University of Wrocław, 50-383, Wrocław, Poland
| | - Paweł Mackiewicz
- Faculty of Biotechnology, University of Wrocław, 50-383, Wrocław, Poland
| | - Michał Burdukiewicz
- Institute of Biotechnology and Biomedicine, Autonomous University of Barcelona, 08193, Cerdanyola del Vallés, Spain.
- Clinical Research Centre, Medical University of Białystok, 15-089, Białystok, Poland.
| |
Collapse
|
6
|
Zhou H, Tan W, Shi S. DeepGpgs: a novel deep learning framework for predicting arginine methylation sites combined with Gaussian prior and gated self-attention mechanism. Brief Bioinform 2023; 24:7000314. [PMID: 36694944 DOI: 10.1093/bib/bbad018] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2022] [Revised: 12/26/2022] [Accepted: 01/04/2023] [Indexed: 01/26/2023] Open
Abstract
Protein arginine methylation is an important posttranslational modification (PTM) associated with protein functional diversity and pathological conditions including cancer. Identification of methylation binding sites facilitates a better understanding of the molecular function of proteins. Recent developments in the field of deep neural networks have led to a proliferation of deep learning-based methylation identification studies because of their fast and accurate prediction. In this paper, we propose DeepGpgs, an advanced deep learning model incorporating Gaussian prior and gated attention mechanism. We introduce a residual network channel to extract the evolutionary information of proteins. Then we combine the adaptive embedding with bidirectional long short-term memory networks to form a context-shared encoder layer. A gated multi-head attention mechanism is followed to obtain the global information about the sequence. A Gaussian prior is injected into the sequence to assist in predicting PTMs. We also propose a weighted joint loss function to alleviate the false negative problem. We empirically show that DeepGpgs improves Matthews correlation coefficient by 6.3% on the arginine methylation independent test set compared with the existing state-of-the-art methylation site prediction methods. Furthermore, DeepGpgs has good robustness in phosphorylation site prediction of SARS-CoV-2, which indicates that DeepGpgs has good transferability and the potential to be extended to other modification sites prediction. The open-source code and data of the DeepGpgs can be obtained from https://github.com/saizhou1/DeepGpgs.
Collapse
Affiliation(s)
- Haiwei Zhou
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China
| | - Wenxi Tan
- School of Mathematical Sciences, Fudan University, Shanghai 200433, China
| | - Shaoping Shi
- Department of Mathematics, School of Mathematics and Computer Sciences, Nanchang University, Nanchang 330031, China
| |
Collapse
|
7
|
Ju Z, Wang SY. Prediction of lysine HMGylation sites using multiple feature extraction and fuzzy support vector machine. Anal Biochem 2023; 663:115032. [PMID: 36592921 DOI: 10.1016/j.ab.2022.115032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2022] [Accepted: 12/25/2022] [Indexed: 12/31/2022]
Abstract
Protein 3-hydroxyl-3-methylglutarylation (HMGylation) is newly discovered lysine acylation modification in mitochondrion. The accurate identification of HMGylation sites is the premise and key to further explore the molecular mechanisms of HMGylation. In this study, a novel bioinformatics tool named HMGPred is developed to predict HMGylation sites. Multiple effective features, including amino acid composition, amino acid factors, binary encoding, and the composition of k-spaced amino acid pairs, are integrated to encode HMGylation sites. And F-score feature ranking with incremental feature selection was used to eliminate redundant features. Moreover, a fuzzy support vector machine algorithm is used to effectively reduce the influence of noise problem by assigning different samples to different fuzzy membership degrees. As illustrated by 10-fold cross-validation, HMGPred achieves a satisfactory performance with an area under receiver operating characteristic curve of 0.9110. Feature analysis indicates that some k-spaced amino acid pair features, such as 'KxxxT' and 'DxxxE', play a critical role in the prediction of HMGylation sites. The results of prediction and analysis might be helpful for investigating the mechanisms of HMGylation. For the convenience of experimental researchers, HMGPred is implemented as a web server at http://123.206.31.171/HMGPred/.
Collapse
Affiliation(s)
- Zhe Ju
- College of Science, Shenyang Aerospace University, 110136, People's Republic of China.
| | - Shi-Yun Wang
- College of Science, Shenyang Aerospace University, 110136, People's Republic of China
| |
Collapse
|
8
|
Bankapur S, Patil N. An Effective Multi-Label Protein Sub-Chloroplast Localization Prediction by Skipped-Grams of Evolutionary Profiles Using Deep Neural Network. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1449-1458. [PMID: 33175683 DOI: 10.1109/tcbb.2020.3037465] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Chloroplast is one of the most classic organelles in algae and plant cells. Identifying the locations of chloroplast proteins in the chloroplast organelle is an important as well as a challenging task in deciphering their functions. Biological-based experiments to identify the Protein Sub-Chloroplast Localization (PSCL) is time-consuming and cost-intensive. Over the last decade, a few computational methods have been developed to predict PSCL in which earlier works assumed to predict only single-location; whereas, recent works are able to predict multiple-locations of chloroplast organelle. However, the performances of all the state-of-the-art predictors are poor. This article proposes a novel skip-gram technique to extract highly discriminating patterns from evolutionary profiles and a multi-label deep neural network to predict the PSCL. The proposed model is assessed on two publicly available datasets, i.e., Benchmark and Novel. Experimental results demonstrate that the proposed work outperforms significantly when compared to the state-of-the-art multi-label PSCL predictors. A multi-label prediction accuracy (i.e., Overall Actual Accuracy) of the proposed model is enhanced by an absolute minimum margin of 6.7 percent on Benchmark dataset and 7.9 percent on Novel dataset when compared to the best PSCL predictor from the literature. Further, result of statistical t-test concludes that the performance of the proposed work is significantly improved and thus, the proposed work is an effective computational model to solve multi-label PSCL prediction. The proposed prediction model is hosted on web-server and available at https://nitkit-vgst727-nppsa.nitk.ac.in/deeplocpred/.
Collapse
|
9
|
Ras-Carmona A, Gomez-Perosanz M, Reche PA. Prediction of unconventional protein secretion by exosomes. BMC Bioinformatics 2021; 22:333. [PMID: 34134630 PMCID: PMC8210391 DOI: 10.1186/s12859-021-04219-z] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2021] [Accepted: 05/21/2021] [Indexed: 01/08/2023] Open
Abstract
MOTIVATION In eukaryotes, proteins targeted for secretion contain a signal peptide, which allows them to proceed through the conventional ER/Golgi-dependent pathway. However, an important number of proteins lacking a signal peptide can be secreted through unconventional routes, including that mediated by exosomes. Currently, no method is available to predict protein secretion via exosomes. RESULTS Here, we first assembled a dataset including the sequences of 2992 proteins secreted by exosomes and 2961 proteins that are not secreted by exosomes. Subsequently, we trained different random forests models on feature vectors derived from the sequences in this dataset. In tenfold cross-validation, the best model was trained on dipeptide composition, reaching an accuracy of 69.88% ± 2.08 and an area under the curve (AUC) of 0.76 ± 0.03. In an independent dataset, this model reached an accuracy of 75.73% and an AUC of 0.840. After these results, we developed ExoPred, a web-based tool that uses random forests to predict protein secretion by exosomes. CONCLUSION ExoPred is available for free public use at http://imath.med.ucm.es/exopred/ . Datasets are available at http://imath.med.ucm.es/exopred/datasets/ .
Collapse
Affiliation(s)
- Alvaro Ras-Carmona
- Laboratory of Immunomedicine, Department of Immunology, Faculty of Medicine, Complutense University of Madrid, Pza Ramón y Cajal, s/n, 28040 Madrid, Spain
| | - Marta Gomez-Perosanz
- Laboratory of Immunomedicine, Department of Immunology, Faculty of Medicine, Complutense University of Madrid, Pza Ramón y Cajal, s/n, 28040 Madrid, Spain
| | - Pedro A. Reche
- Laboratory of Immunomedicine, Department of Immunology, Faculty of Medicine, Complutense University of Madrid, Pza Ramón y Cajal, s/n, 28040 Madrid, Spain
| |
Collapse
|
10
|
Li J, Zhang L, He S, Guo F, Zou Q. SubLocEP: a novel ensemble predictor of subcellular localization of eukaryotic mRNA based on machine learning. Brief Bioinform 2021; 22:6059770. [PMID: 33388743 DOI: 10.1093/bib/bbaa401] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2020] [Revised: 11/28/2020] [Accepted: 12/08/2020] [Indexed: 01/23/2023] Open
Abstract
MOTIVATION mRNA location corresponds to the location of protein translation and contributes to precise spatial and temporal management of the protein function. However, current assignment of subcellular localization of eukaryotic mRNA reveals important limitations: (1) turning multiple classifications into multiple dichotomies makes the training process tedious; (2) the majority of the models trained by classical algorithm are based on the extraction of single sequence information; (3) the existing state-of-the-art models have not reached an ideal level in terms of prediction and generalization ability. To achieve better assignment of subcellular localization of eukaryotic mRNA, a better and more comprehensive model must be developed. RESULTS In this paper, SubLocEP is proposed as a two-layer integrated prediction model for accurate prediction of the location of sequence samples. Unlike the existing models based on limited features, SubLocEP comprehensively considers additional feature attributes and is combined with LightGBM to generated single feature classifiers. The initial integration model (single-layer model) is generated according to the categories of a feature. Subsequently, two single-layer integration models are weighted (sequence-based: physicochemical properties = 3:2) to produce the final two-layer model. The performance of SubLocEP on independent datasets is sufficient to indicate that SubLocEP is an accurate and stable prediction model with strong generalization ability. Additionally, an online tool has been developed that contains experimental data and can maximize the user convenience for estimation of subcellular localization of eukaryotic mRNA.
Collapse
Affiliation(s)
| | - Lichao Zhang
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology
| | | | | | | |
Collapse
|
11
|
Ju Z, Wang SY. Prediction of Neddylation Sites Using the Composition of k-spaced Amino Acid Pairs and Fuzzy SVM. Curr Bioinform 2020. [DOI: 10.2174/1574893614666191114123453] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Introduction:
Neddylation is the process of ubiquitin-like protein NEDD8 attaching
substrate lysine via isopeptide bonds. As a highly dynamic and reversible post-translational
modification, lysine neddylation has been found to be involved in various biological processes and
closely associated with many diseases.
Objective:
The accurate identification of neddylation sites is necessary to elucidate the underlying
molecular mechanisms of neddylation. As traditional experimental methods are often expensive
and time-consuming, it is imperative to design computational methods to identify neddylation
sites.
Methods:
In this study, a novel predictor named CKSAAP_NeddSite is developed to detect
neddylation sites. An effective feature encoding technology, the composition of k-spaced amino
acid pairs, is used to encode neddylation sites. And the F-score feature selection method is adopted
to remove the redundant features. Moreover, a fuzzy support vector machine algorithm is
employed to overcome the class imbalance and noise problem.
Results:
As illustrated by 10-fold cross-validation, CKSAAP_NeddSite achieves an AUC of
0.9848. Independent tests also show that CKSAAP_NeddSite significantly outperforms existing
neddylation sites predictor. Therefore, CKSAAP_NeddSite can be a useful bioinformatics tool for
the prediction of neddylation sites. Feature analysis shows that some residues around neddylation
sites may play an important role in the prediction.
Conclusion:
The results of analysis and prediction could offer useful information for elucidating
the molecular mechanisms of neddylation. A user-friendly web-server for CKSAAP_NeddSite is
established at 123.206.31.171/CKSAAP_NeddSite.
Collapse
Affiliation(s)
- Zhe Ju
- College of Science, Shenyang Aerospace University, Shenyang 110136, China
| | - Shi-Yun Wang
- College of Science, Shenyang Aerospace University, Shenyang 110136, China
| |
Collapse
|
12
|
Sharma N, Patiyal S, Dhall A, Pande A, Arora C, Raghava GPS. AlgPred 2.0: an improved method for predicting allergenic proteins and mapping of IgE epitopes. Brief Bioinform 2020; 22:5985292. [PMID: 33201237 DOI: 10.1093/bib/bbaa294] [Citation(s) in RCA: 123] [Impact Index Per Article: 24.6] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2020] [Revised: 10/02/2020] [Accepted: 10/05/2020] [Indexed: 12/22/2022] Open
Abstract
AlgPred 2.0 is a web server developed for predicting allergenic proteins and allergenic regions in a protein. It is an updated version of AlgPred developed in 2006. The dataset used for training, testing and validation consists of 10 075 allergens and 10 075 non-allergens. In addition, 10 451 experimentally validated immunoglobulin E (IgE) epitopes were used to identify antigenic regions in a protein. All models were trained on 80% of data called training dataset, and the performance of models was evaluated using 5-fold cross-validation technique. The performance of the final model trained on the training dataset was evaluated on 20% of data called validation dataset; no two proteins in any two sets have more than 40% similarity. First, a Basic Local Alignment Search Tool (BLAST) search has been performed against the dataset, and allergens were predicted based on the level of similarity with known allergens. Second, IgE epitopes obtained from the IEDB database were searched in the dataset to predict allergens based on their presence in a protein. Third, motif-based approaches like multiple EM for motif elicitation/motif alignment and search tool have been used to predict allergens. Fourth, allergen prediction models have been developed using a wide range of machine learning techniques. Finally, the ensemble approach has been used for predicting allergenic protein by combining prediction scores of different approaches. Our best model achieved maximum performance in terms of area under receiver operating characteristic curve 0.98 with Matthew's correlation coefficient 0.85 on the validation dataset. A web server AlgPred 2.0 has been developed that allows the prediction of allergens, mapping of IgE epitope, motif search and BLAST search (https://webs.iiitd.edu.in/raghava/algpred2/).
Collapse
Affiliation(s)
- Neelam Sharma
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| | - Sumeet Patiyal
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| | - Anjali Dhall
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| | - Akshara Pande
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| | - Chakit Arora
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| | - Gajendra P S Raghava
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| |
Collapse
|
13
|
Sharma R, Kumar S, Tsunoda T, Kumarevel T, Sharma A. Single-stranded and double-stranded DNA-binding protein prediction using HMM profiles. Anal Biochem 2020; 612:113954. [PMID: 32946833 DOI: 10.1016/j.ab.2020.113954] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2020] [Revised: 08/26/2020] [Accepted: 09/10/2020] [Indexed: 10/23/2022]
Abstract
BACKGROUND DNA-binding proteins perform important roles in cellular processes and are involved in many biological activities. These proteins include crucial protein-DNA binding domains and can interact with single-stranded or double-stranded DNA, and accordingly classified as single-stranded DNA-binding proteins (SSBs) or double-stranded DNA-binding proteins (DSBs). Computational prediction of SSBs and DSBs helps in annotating protein functions and understanding of protein-binding domains. RESULTS Performance is reported using the DNA-binding protein dataset that was recently introduced by Wang et al., [1]. The proposed method achieved a sensitivity of 0.600, specificity of 0.792, AUC of 0.758, MCC of 0.369, accuracy of 0.744, and F-measure of 0.536, on the independent test set. CONCLUSION The proposed method with the hidden Markov model (HMM) profiles for feature extraction, outperformed the benchmark method in the literature and achieved an overall improvement of approximately 3%. The source code and supplementary information of the proposed method is available at https://github.com/roneshsharma/Predict-DNA-binding-proteins/wiki.
Collapse
Affiliation(s)
- Ronesh Sharma
- School of Electrical and Electronics Engineering, Fiji National University, Suva, Fiji.
| | - Shiu Kumar
- School of Electrical and Electronics Engineering, Fiji National University, Suva, Fiji.
| | - Tatsuhiko Tsunoda
- Laboratory of Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, 230-0045, Japan; Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University (TMDU), Tokyo, 113-8510, Japan; Laboratory of Medical Science Mathematics, Department of Biological Sciences, Graduate School of Science, University of Tokyo, Tokyo, 113-0033, Japan.
| | - Thirumananseri Kumarevel
- Laboratory for Transcription Structural Biology, RIKEN Center for Biosystems Dynamics Research, 1-7-22 Suehiro, Tsurumi-ku, Yokohama, Kanagawa, 230-0045, Japan.
| | - Alok Sharma
- Laboratory of Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, 230-0045, Japan; Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University (TMDU), Tokyo, 113-8510, Japan; School of Engineering and Physics, The University of the South Pacific, Suva, Fiji; Institute for Integrated and Intelligent Systems, Griffith University, Nathan, Brisbane, QLD, Australia.
| |
Collapse
|
14
|
Wang YG, Huang SY, Wang LN, Zhou ZY, Qiu JD. Accurate prediction of species-specific 2-hydroxyisobutyrylation sites based on machine learning frameworks. Anal Biochem 2020; 602:113793. [DOI: 10.1016/j.ab.2020.113793] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2020] [Revised: 04/25/2020] [Accepted: 05/20/2020] [Indexed: 12/17/2022]
|
15
|
Dai C, He J, Hu K, Ding Y. Identifying essential proteins in dynamic protein networks based on an improved h-index algorithm. BMC Med Inform Decis Mak 2020; 20:110. [PMID: 32552708 PMCID: PMC7371468 DOI: 10.1186/s12911-020-01141-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2019] [Accepted: 06/01/2020] [Indexed: 11/30/2022] Open
Abstract
BACKGROUND The essential proteins in protein networks play an important role in complex cellular functions and in protein evolution. Therefore, the identification of essential proteins in a network can help to explain the structure, function, and dynamics of basic cellular networks. The existing dynamic protein networks regard the protein components as the same at all time points; however, the role of proteins can vary over time. METHODS To improve the accuracy of identifying essential proteins, an improved h-index algorithm based on the attenuation coefficient method is proposed in this paper. This method incorporates previously neglected node information to improve the accuracy of the essential protein search. Based on choosing the appropriate attenuation coefficient, the values, such as monotonicity, SN, SP, PPV and NPV of different essential protein search algorithms are tested. RESULTS The experimental results show that, the algorithm proposed in this paper can ensure the accuracy of the found proteins while identifying more essential proteins. CONCLUSIONS The described experiments show that this method is more effective than other similar methods in identifying essential proteins in dynamic protein networks. This study can better explain the mechanism of life activities and provide theoretical basis for the research and development of targeted drugs.
Collapse
Affiliation(s)
- Caiyan Dai
- College of Artificial Intelligence and Information Technology, Nanjing University of Chinese Medicine University, Nanjing, 210000, China.
| | - Ju He
- College of Artificial Intelligence and Information Technology, Nanjing University of Chinese Medicine University, Nanjing, 210000, China
| | - Kongfa Hu
- College of Artificial Intelligence and Information Technology, Nanjing University of Chinese Medicine University, Nanjing, 210000, China
| | - Youwei Ding
- College of Artificial Intelligence and Information Technology, Nanjing University of Chinese Medicine University, Nanjing, 210000, China
| |
Collapse
|
16
|
Ju Z, Wang SY. Prediction of 2-hydroxyisobutyrylation sites by integrating multiple sequence features with ensemble support vector machine. Comput Biol Chem 2020; 87:107280. [PMID: 32505881 DOI: 10.1016/j.compbiolchem.2020.107280] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2019] [Revised: 05/05/2020] [Accepted: 05/07/2020] [Indexed: 10/24/2022]
Abstract
Lysine 2-hydroxyisobutyrylation (Khib) is a new type of histone mark, which has been found to affect the association between histone and DNA. To better understand the molecular mechanism of Khib, it is important to identify 2-hydroxyisobutyrylated substrates and their corresponding Khib sites accurately. In this study, a novel bioinformatics tool named KhibPred is proposed to predict Khib sites in human HeLa cells. Three kinds of effective features, the composition of k-spaced amino acid pairs, binary encoding and amino acid factors, are incorporated to encode Khib sites. Moreover, an ensemble support vector machine is employed to overcome the imbalanced problem in the prediction. As illustrated by 10-fold cross-validation, the performance of KhibPred achieves a satisfactory performance with an area under receiver operating characteristic curve of 0.7937. Therefore, KhibPred can be a useful tool for predicting protein Khib sites. Feature analysis shows that the polarity factor features play significant roles in the prediction of Khib sites. The conclusions derived from this study might provide useful insights for in-depth investigation into the molecular mechanisms of Khib.
Collapse
Affiliation(s)
- Zhe Ju
- College of Science, Shenyang Aerospace University, 110136, People's Republic of China.
| | - Shi-Yun Wang
- College of Science, Shenyang Aerospace University, 110136, People's Republic of China.
| |
Collapse
|
17
|
Du L, Meng Q, Chen Y, Wu P. Subcellular location prediction of apoptosis proteins using two novel feature extraction methods based on evolutionary information and LDA. BMC Bioinformatics 2020; 21:212. [PMID: 32448129 PMCID: PMC7245797 DOI: 10.1186/s12859-020-3539-1] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2019] [Accepted: 05/06/2020] [Indexed: 11/13/2022] Open
Abstract
Background Apoptosis, also called programmed cell death, refers to the spontaneous and orderly death of cells controlled by genes in order to maintain a stable internal environment. Identifying the subcellular location of apoptosis proteins is very helpful in understanding the mechanism of apoptosis and designing drugs. Therefore, the subcellular localization of apoptosis proteins has attracted increased attention in computational biology. Effective feature extraction methods play a critical role in predicting the subcellular location of proteins. Results In this paper, we proposed two novel feature extraction methods based on evolutionary information. One of the features obtained the evolutionary information via the transition matrix of the consensus sequence (CTM). And the other utilized the evolutionary information from PSSM based on absolute entropy correlation analysis (AECA-PSSM). After fusing the two kinds of features, linear discriminant analysis (LDA) was used to reduce the dimension of the proposed features. Finally, the support vector machine (SVM) was adopted to predict the protein subcellular locations. The proposed CTM-AECA-PSSM-LDA subcellular location prediction method was evaluated using the CL317 dataset and ZW225 dataset. By jackknife test, the overall accuracy was 99.7% (CL317) and 95.6% (ZW225) respectively. Conclusions The experimental results show that the proposed method which is hopefully to be a complementary tool for the existing methods of subcellular localization, can effectively extract more abundant features of protein sequence and is feasible in predicting the subcellular location of apoptosis proteins.
Collapse
Affiliation(s)
- Lei Du
- School of Information Science and Engineering, University of Jinan, Jinan, 250022, China.,Shandong Provincial Key laboratory of Network Based Intelligent Computing, Jinan, 250022, China
| | - Qingfang Meng
- School of Information Science and Engineering, University of Jinan, Jinan, 250022, China. .,Shandong Provincial Key laboratory of Network Based Intelligent Computing, Jinan, 250022, China.
| | - Yuehui Chen
- School of Information Science and Engineering, University of Jinan, Jinan, 250022, China.,Shandong Provincial Key laboratory of Network Based Intelligent Computing, Jinan, 250022, China
| | - Peng Wu
- School of Information Science and Engineering, University of Jinan, Jinan, 250022, China.,Shandong Provincial Key laboratory of Network Based Intelligent Computing, Jinan, 250022, China
| |
Collapse
|
18
|
Dervisi I, Valassakis C, Agalou A, Papandreou N, Podia V, Haralampidis K, Iconomidou VA, Kouvelis VN, Spaink HP, Roussis A. Investigation of the interaction of DAD1-LIKE LIPASE 3 (DALL3) with Selenium Binding Protein 1 (SBP1) in Arabidopsis thaliana. PLANT SCIENCE : AN INTERNATIONAL JOURNAL OF EXPERIMENTAL PLANT BIOLOGY 2020; 291:110357. [PMID: 31928671 DOI: 10.1016/j.plantsci.2019.110357] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/26/2019] [Revised: 11/18/2019] [Accepted: 11/21/2019] [Indexed: 06/10/2023]
Abstract
Phospholipase PLA1-Iγ2 or otherwise DAD1-LIKE LIPASE 3 (DALL3) is a member of class I phospholipases and has a role in JA biosynthesis. AtDALL3 was previously identified in a yeast two-hybrid screening as an interacting protein of the Arabidopsis Selenium Binding Protein 1 (SBP1). In this work, we have studied AtDALL3 as an interacting partner of the Arabidopsis Selenium Binding Protein 1 (SBP1). Phylogenetic analysis showed that DALL3 appears in the PLA1-Igamma1, 2 group, paired with PLA1-Igammma1. The highest level of expression of AtDALL3 was observed in 10-day-old roots and in flowers, while constitutive levels were maintained in seedlings, cotyledons, shoots and leaves. In response to abiotic stress, DALL3 was shown to participate in the network of genes regulated by cadmium, selenite and selenate compounds. DALL3 promoter driven GUS assays revealed that the expression patterns defined were overlapping with the patterns reported for AtSBP1 gene, indicating that DALL3 and SBP1 transcripts co-localize. Furthermore, quantitative GUS assays showed that these compounds elicited changes in activity in specific cells files, indicating the differential response of DALL3 promoter. GFP::DALL3 studies by confocal microscopy demonstrated the localization of DALL3 in the plastids of the root apex, the plastids of the central root and the apex of emerging lateral root primordia. Additionally, we confirmed by yeast two hybrid assays the physical interaction of DALL3 with SBP1 and defined a minimal SBP1 fragment that DALL3 binds to. Finally, by employing bimolecular fluorescent complementation we demonstrated the in planta interaction of the two proteins.
Collapse
Affiliation(s)
- Irene Dervisi
- Department of Botany, Faculty of Biology, National & Kapodistrian University of Athens, 15784, Athens, Greece
| | - Chrysanthi Valassakis
- Department of Botany, Faculty of Biology, National & Kapodistrian University of Athens, 15784, Athens, Greece
| | - Adamantia Agalou
- Institute of Biology, Leiden University, Leiden, the Netherlands
| | - Nikolaos Papandreou
- Department of Cell Biology and Biophysics, Faculty of Biology, National & Kapodistrian University, 15784, Athens, Greece
| | - Varvara Podia
- Department of Botany, Faculty of Biology, National & Kapodistrian University of Athens, 15784, Athens, Greece
| | - Kosmas Haralampidis
- Department of Botany, Faculty of Biology, National & Kapodistrian University of Athens, 15784, Athens, Greece
| | - Vassiliki A Iconomidou
- Department of Cell Biology and Biophysics, Faculty of Biology, National & Kapodistrian University, 15784, Athens, Greece
| | - Vassili N Kouvelis
- Department of Genetics and Biotechnology, Faculty of Biology, National & Kapodistrian University of Athens, 15784, Athens, Greece
| | - Herman P Spaink
- Institute of Biology, Leiden University, Leiden, the Netherlands
| | - Andreas Roussis
- Department of Botany, Faculty of Biology, National & Kapodistrian University of Athens, 15784, Athens, Greece.
| |
Collapse
|
19
|
Ju Z, Wang SY. Prediction of lysine formylation sites using the composition of k-spaced amino acid pairs via Chou's 5-steps rule and general pseudo components. Genomics 2020; 112:859-866. [DOI: 10.1016/j.ygeno.2019.05.027] [Citation(s) in RCA: 54] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2019] [Revised: 05/13/2019] [Accepted: 05/30/2019] [Indexed: 11/30/2022]
|
20
|
Discovering nuclear targeting signal sequence through protein language learning and multivariate analysis. Anal Biochem 2019; 591:113565. [PMID: 31883904 DOI: 10.1016/j.ab.2019.113565] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2019] [Revised: 12/16/2019] [Accepted: 12/20/2019] [Indexed: 11/24/2022]
Abstract
Nuclear localization signals (NLSs) are peptides that target proteins to the nucleus by binding to carrier proteins in the cytoplasm that transport their cargo across the nuclear membrane. Accurate identification of NLSs can help elucidate the functions of nuclear protein complexes. The currently known NLS predictors are usually specific to certain species or largely dependent on prior knowledge of NLS basic residues. Thus, a more general predictor is highly desired to reduce the potentially high false positives or false negatives in discovering new NLSs. Here, we report a new method, INSP (Identification Nucleus Signal Peptide), to effectively identify NLS mainly based on statistical knowledge and machine learning algorithms. In our NLS machine learning model, we considered the query protein sequence as text and extracted the sequence context features using a natural language model. These word-vector features encode discriminative knowledge of NLS motif frequency and are thus useful for model recognition. The output of the machine learning model will be fused with statistical knowledge of the query sequence to build a final multivariate regression model for NLS peptide identification. The experimental results demonstrate a promising performance of the new INSP approach. INSP is freely available at: www.csbio.sjtu.edu.cn/bioinf/INSP/for academic use.
Collapse
|
21
|
Ju Z, Wang SY. Identify Lysine Neddylation Sites Using Bi-profile Bayes Feature Extraction via the Chou's 5-steps Rule and General Pseudo Components. Curr Genomics 2019; 20:592-601. [PMID: 32581647 PMCID: PMC7290059 DOI: 10.2174/1389202921666191223154629] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2019] [Revised: 10/19/2019] [Accepted: 11/07/2019] [Indexed: 01/06/2023] Open
Abstract
Introduction Neddylation is a highly dynamic and reversible post-translational modification. The abnormality of neddylation has previously been shown to be closely related to some human diseases. The detection of neddylation sites is essential for elucidating the regulation mechanisms of protein neddylation. Objective As the detection of the lysine neddylation sites by the traditional experimental method is often expensive and time-consuming, it is imperative to design computational methods to identify neddylation sites. Methods In this study, a bioinformatics tool named NeddPred is developed to identify underlying protein neddylation sites. A bi-profile bayes feature extraction is used to encode neddylation sites and a fuzzy support vector machine model is utilized to overcome the problem of noise and class imbalance in the prediction. Results Matthew's correlation coefficient of NeddPred achieved 0.7082 and an area under the receiver operating characteristic curve of 0.9769. Independent tests show that NeddPred significantly outperforms existing lysine neddylation sites predictor NeddyPreddy. Conclusion Therefore, NeddPred can be a complement to the existing tools for the prediction of neddylation sites. A user-friendly webserver for NeddPred is accessible at 123.206.31.171/NeddPred/.
Collapse
Affiliation(s)
- Zhe Ju
- College of Science, Shenyang Aerospace University, Shenyang110136, P.R. China
| | - Shi-Yun Wang
- College of Science, Shenyang Aerospace University, Shenyang110136, P.R. China
| |
Collapse
|
22
|
Khan YD, Amin N, Hussain W, Rasool N, Khan SA, Chou KC. iProtease-PseAAC(2L): A two-layer predictor for identifying proteases and their types using Chou's 5-step-rule and general PseAAC. Anal Biochem 2019; 588:113477. [PMID: 31654612 DOI: 10.1016/j.ab.2019.113477] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2019] [Revised: 10/02/2019] [Accepted: 10/18/2019] [Indexed: 12/16/2022]
Abstract
Proteases are a type of enzymes, which perform the process of proteolysis. Proteolysis normally refers to protein and peptide degradation which is crucial for the survival, growth and wellbeing of a cell. Moreover, proteases have a strong association with therapeutics and drug development. The proteases are classified into five different types according to their nature and physiochemical characteristics. Mostly the methods used to differentiate protease from other proteins and identify their class requires a clinical test which is usually time-consuming and operator dependent. Herein, we report a classifier named iProtease-PseAAC (2L) for identifying proteases and their classes. The predictor is developed employing the flow of 5-step rule, initiating from the collection of benchmark dataset and terminating at the development of predictor. Rigorous verification and validation tests are performed and metrics are collected to calculate the authenticity of the trained model. The self-consistency validation gives the 98.32% accuracy, for cross-validation the accuracy is 90.71% and jackknife gives 96.07% accuracy. The average accuracy for level-2 i.e. protease classification is 95.77%. Based on the above-mentioned results, it is concluded that iProtease-PseAAC (2L) has the great ability to identify the proteases and their classes using a given protein sequence.
Collapse
Affiliation(s)
- Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, P.O. Box 10033, C-II, Johar Town, Lahore, 54770, Pakistan.
| | - Najm Amin
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, P.O. Box 10033, C-II, Johar Town, Lahore, 54770, Pakistan
| | - Waqar Hussain
- National Center of Artificial Intelligence, Punjab University College of Information Technology, University of the Punjab, Lahore, Pakistan
| | - Nouman Rasool
- Dr Panjwani Center for Molecular Medicine and Drug Research, International Center for Chemical and Biological Sciences, University of Karachi, Karachi, 75270, Pakistan
| | - Sher Afzal Khan
- Faculty of Computing and Information Technology in Rabigh, Jeddah, 21577, Saudi Arabia; Abdul Wali Khan University, Department of Computer Sciences, Mardan, Pakistan
| | - Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA, 02478, USA
| |
Collapse
|
23
|
Yang W, Zhu XJ, Huang J, Ding H, Lin H. A Brief Survey of Machine Learning Methods in Protein Sub-Golgi Localization. Curr Bioinform 2019. [DOI: 10.2174/1574893613666181113131415] [Citation(s) in RCA: 111] [Impact Index Per Article: 18.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
Background:The location of proteins in a cell can provide important clues to their functions in various biological processes. Thus, the application of machine learning method in the prediction of protein subcellular localization has become a hotspot in bioinformatics. As one of key organelles, the Golgi apparatus is in charge of protein storage, package, and distribution.Objective:The identification of protein location in Golgi apparatus will provide in-depth insights into their functions. Thus, the machine learning-based method of predicting protein location in Golgi apparatus has been extensively explored. The development of protein sub-Golgi apparatus localization prediction should be reviewed for providing a whole background for the fields.Method:The benchmark dataset, feature extraction, machine learning method and published results were summarized.Results:We briefly introduced the recent progresses in protein sub-Golgi apparatus localization prediction using machine learning methods and discussed their advantages and disadvantages.Conclusion:We pointed out the perspective of machine learning methods in protein sub-Golgi localization prediction.
Collapse
Affiliation(s)
- Wuritu Yang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, Sichuan, 610054, China
| | - Xiao-Juan Zhu
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, Sichuan, 610054, China
| | - Jian Huang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, Sichuan, 610054, China
| | - Hui Ding
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, Sichuan, 610054, China
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, Sichuan, 610054, China
| |
Collapse
|
24
|
Prediction of Apoptosis Protein Subcellular Localization with Multilayer Sparse Coding and Oversampling Approach. BIOMED RESEARCH INTERNATIONAL 2019; 2019:2436924. [PMID: 30834257 PMCID: PMC6374881 DOI: 10.1155/2019/2436924] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/19/2018] [Revised: 01/04/2019] [Accepted: 01/20/2019] [Indexed: 11/29/2022]
Abstract
The prediction of apoptosis protein subcellular localization plays an important role in understanding the progress in cell proliferation and death. Recently computational approaches to this issue have become very popular, since the traditional biological experiments are so costly and time-consuming that they cannot catch up with the growth rate of sequence data anymore. In order to improve the prediction accuracy of apoptosis protein subcellular localization, we proposed a sparse coding method combined with traditional feature extraction algorithm to complete the sparse representation of apoptosis protein sequences, using multilayer pooling based on different sizes of dictionaries to integrate the processed features, as well as oversampling approach to decrease the influences caused by unbalanced data sets. Then the extracted features were input to a support vector machine to predict the subcellular localization of the apoptosis protein. The experiment results obtained by Jackknife test on two benchmark data sets indicate that our method can significantly improve the accuracy of the apoptosis protein subcellular localization prediction.
Collapse
|
25
|
Le NQK, Sandag GA, Ou YY. Incorporating post translational modification information for enhancing the predictive performance of membrane transport proteins. Comput Biol Chem 2018; 77:251-260. [DOI: 10.1016/j.compbiolchem.2018.10.010] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2017] [Revised: 08/01/2018] [Accepted: 10/14/2018] [Indexed: 10/28/2022]
|
26
|
Yang Z, Wang J, Zheng Z, Bai X. A New Method for Recognizing Cytokines Based on Feature Combination and a Support Vector Machine Classifier. Molecules 2018; 23:E2008. [PMID: 30103521 PMCID: PMC6222536 DOI: 10.3390/molecules23082008] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2018] [Revised: 07/31/2018] [Accepted: 08/07/2018] [Indexed: 12/14/2022] Open
Abstract
Research on cytokine recognition is of great significance in the medical field due to the fact cytokines benefit the diagnosis and treatment of diseases, but the current methods for cytokine recognition have many shortcomings, such as low sensitivity and low F-score. Therefore, this paper proposes a new method on the basis of feature combination. The features are extracted from compositions of amino acids, physicochemical properties, secondary structures, and evolutionary information. The classifier used in this paper is SVM. Experiments show that our method is better than other methods in terms of accuracy, sensitivity, specificity, F-score and Matthew's correlation coefficient.
Collapse
Affiliation(s)
- Zhe Yang
- School of Computer Science, Inner Mongolia University, Hohhot, Inner Mongolia 010021, China.
| | - Juan Wang
- School of Computer Science, Inner Mongolia University, Hohhot, Inner Mongolia 010021, China.
| | - Zhida Zheng
- School of Computer Science, Inner Mongolia University, Hohhot, Inner Mongolia 010021, China.
| | - Xin Bai
- School of Computer Science, Inner Mongolia University, Hohhot, Inner Mongolia 010021, China.
| |
Collapse
|
27
|
Ju Z, He JJ. Prediction of lysine glutarylation sites by maximum relevance minimum redundancy feature selection. Anal Biochem 2018; 550:1-7. [DOI: 10.1016/j.ab.2018.04.005] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2018] [Revised: 04/05/2018] [Accepted: 04/06/2018] [Indexed: 12/17/2022]
|
28
|
Zhang B, Li L, Lü Q. Protein Solvent-Accessibility Prediction by a Stacked Deep Bidirectional Recurrent Neural Network. Biomolecules 2018; 8:biom8020033. [PMID: 29799510 PMCID: PMC6023031 DOI: 10.3390/biom8020033] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2018] [Revised: 05/18/2018] [Accepted: 05/22/2018] [Indexed: 12/12/2022] Open
Abstract
Residue solvent accessibility is closely related to the spatial arrangement and packing of residues. Predicting the solvent accessibility of a protein is an important step to understand its structure and function. In this work, we present a deep learning method to predict residue solvent accessibility, which is based on a stacked deep bidirectional recurrent neural network applied to sequence profiles. To capture more long-range sequence information, a merging operator was proposed when bidirectional information from hidden nodes was merged for outputs. Three types of merging operators were used in our improved model, with a long short-term memory network performing as a hidden computing node. The trained database was constructed from 7361 proteins extracted from the PISCES server using a cut-off of 25% sequence identity. Sequence-derived features including position-specific scoring matrix, physical properties, physicochemical characteristics, conservation score and protein coding were used to represent a residue. Using this method, predictive values of continuous relative solvent-accessible area were obtained, and then, these values were transformed into binary states with predefined thresholds. Our experimental results showed that our deep learning method improved prediction quality relative to current methods, with mean absolute error and Pearson’s correlation coefficient values of 8.8% and 74.8%, respectively, on the CB502 dataset and 8.2% and 78%, respectively, on the Manesh215 dataset.
Collapse
Affiliation(s)
- Buzhong Zhang
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China.
- School of Computer and Information, Anqing Normal University, Anqing 246011, China.
| | - Linqing Li
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China.
| | - Qiang Lü
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China.
| |
Collapse
|
29
|
Hasan MAM, Ahmad S, Molla MKI. Protein subcellular localization prediction using multiple kernel learning based support vector machine. MOLECULAR BIOSYSTEMS 2017; 13:785-795. [DOI: 10.1039/c6mb00860g] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
An efficient multi-label protein subcellular localization prediction system was developed by introducing multiple kernel learning (MKL) based support vector machine (SVM).
Collapse
Affiliation(s)
- Md. Al Mehedi Hasan
- Department of Computer Science & Engineering
- University of Rajshahi
- Rajshahi
- Bangladesh
| | - Shamim Ahmad
- Department of Computer Science & Engineering
- University of Rajshahi
- Rajshahi
- Bangladesh
| | | |
Collapse
|