1
|
Chen L, Liu L, Su H, Xu Y. KbhbXG: A Machine learning architecture based on XGBoost for prediction of lysine β-Hydroxybutyrylation (Kbhb) modification sites. Methods 2024; 227:27-34. [PMID: 38679187 DOI: 10.1016/j.ymeth.2024.04.016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2024] [Revised: 04/16/2024] [Accepted: 04/20/2024] [Indexed: 05/01/2024] Open
Abstract
Lysine β-hydroxybutyrylation is an important post-translational modification (PTM) involved in various physiological and biological processes. In this research, we introduce a novel predictor KbhbXG, which utilizes XGBoost to identify β-hydroxybutyrylation modification sites based on protein sequence information. The traditional experimental methods employed for the identification of β-hydroxybutyrylated sites using proteomic techniques are both costly and time-consuming. Thus, the development of computational methods and predictors can play a crucial role in facilitating the rapid identification of β-hydroxybutyrylation sites. Our proposed KbhbXG model first utilizes machine learning algorithm XGBoost to predict β-hydroxybutyrylation modification sites. On the independent test set, KbhbXG achieves an accuracy of 0.7457, specificity of 0.7771, and an impressive area under the curve (AUC) score of 0.8172. The high AUC score achieved by our method demonstrates its potential for effectively identifying novel β-hydroxybutyrylation sites, thereby facilitating further research and exploration of the β-hydroxybutyrylation process. Also, functional analyses have revealed that different organisms preferentially engage in distinct biological processes and pathways, which can provide valuable insights for understanding the mechanism of β-hydroxybutyrylation and guide experimental verification. To promote transparency and reproducibility, we have made both the codes and dataset of KbhbXG publicly available. Researchers interested in utilizing our proposed model can access these resources at https://github.com/Lab-Xu/KbhbXG.
Collapse
Affiliation(s)
- Leqi Chen
- Department of Statistics, University of Science and Technology Beijing, Beijing 100083, China
| | - Liwen Liu
- The Open University of China, Beijing 100039, China
| | - Haiyan Su
- School of Computing, Montclair State University, NJ 07043, USA
| | - Yan Xu
- Department of Statistics, University of Science and Technology Beijing, Beijing 100083, China.
| |
Collapse
|
2
|
Wang GA, Yan X, Li X, Liu Y, Xia J, Zhu X. MSTL-Kace: Prediction of Prokaryotic Lysine Acetylation Sites Based on Multistage Transfer Learning Strategy. ACS OMEGA 2023; 8:41930-41942. [PMID: 37969991 PMCID: PMC10634282 DOI: 10.1021/acsomega.3c07086] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/16/2023] [Revised: 10/11/2023] [Accepted: 10/13/2023] [Indexed: 11/17/2023]
Abstract
As one of the most important post-translational modifications (PTM), lysine acetylation (Kace) plays an important role in various biological activities. Traditional experimental methods for identifying Kace sites are inefficient and expensive. Instead, several machine learning methods have been developed for Kace site prediction, and hand-crafted features have been used to encode the protein sequences. However, there are still two challenges: the complex biological information may be under-represented by these manmade features and the small sample issue of some species needs to be addressed. We propose a novel model, MSTL-Kace, which was developed based on transfer learning strategy with pretrained bidirectional encoder representations from transformers (BERT) model. In this model, the high-level embeddings were extracted from species-specific BERT models, and a two-stage fine-tuning strategy was used to deal with small sample issue. Specifically, a domain-specific BERT model was pretrained using all of the sequences in our data sets, which was then fine-tuned, or two-stage fine-tuned based on the training data set of each species to obtain the species-specific BERT models. Afterward, the embeddings of residues were extracted from the fine-tuned model and fed to the different downstream learning algorithms. After comparison, the best model for the six prokaryotic species was built by using a random forest. The results for the independent test sets show that our model outperforms the state-of-the-art methods on all six species. The source codes and data for MSTL-Kace are available at https://github.com/leo97king/MSTL-Kace.
Collapse
Affiliation(s)
- Gang-Ao Wang
- School
of Sciences, Anhui Agricultural University, Hefei 230036, Anhui, China
| | - Xiaodi Yan
- School
of Sciences, Anhui Agricultural University, Hefei 230036, Anhui, China
| | - Xiang Li
- School
of Sciences, Anhui Agricultural University, Hefei 230036, Anhui, China
| | - Yinbo Liu
- School
of Sciences, Anhui Agricultural University, Hefei 230036, Anhui, China
| | - Junfeng Xia
- Key
Laboratory of Intelligent Computing and Signal Processing of Ministry
of Education, Institutes of Physical Science and Information Technology, Anhui University, Hefei 230601, Anhui, China
| | - Xiaolei Zhu
- School
of Sciences, Anhui Agricultural University, Hefei 230036, Anhui, China
| |
Collapse
|
3
|
Kumari S, Gupta R, Ambasta RK, Kumar P. Emerging trends in post-translational modification: Shedding light on Glioblastoma multiforme. Biochim Biophys Acta Rev Cancer 2023; 1878:188999. [PMID: 37858622 DOI: 10.1016/j.bbcan.2023.188999] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2023] [Revised: 10/06/2023] [Accepted: 10/06/2023] [Indexed: 10/21/2023]
Abstract
Recent multi-omics studies, including proteomics, transcriptomics, genomics, and metabolomics have revealed the critical role of post-translational modifications (PTMs) in the progression and pathogenesis of Glioblastoma multiforme (GBM). Further, PTMs alter the oncogenic signaling events and offer a novel avenue in GBM therapeutics research through PTM enzymes as potential biomarkers for drug targeting. In addition, PTMs are critical regulators of chromatin architecture, gene expression, and tumor microenvironment (TME), that play a crucial function in tumorigenesis. Moreover, the implementation of artificial intelligence and machine learning algorithms enhances GBM therapeutics research through the identification of novel PTM enzymes and residues. Herein, we briefly explain the mechanism of protein modifications in GBM etiology, and in altering the biologics of GBM cells through chromatin remodeling, modulation of the TME, and signaling pathways. In addition, we highlighted the importance of PTM enzymes as therapeutic biomarkers and the role of artificial intelligence and machine learning in protein PTM prediction.
Collapse
Affiliation(s)
- Smita Kumari
- Molecular Neuroscience and Functional Genomics Laboratory, Department of Biotechnology, Delhi Technological, University, India
| | - Rohan Gupta
- Molecular Neuroscience and Functional Genomics Laboratory, Department of Biotechnology, Delhi Technological, University, India; School of Medicine, University of South Carolina, Columbia, SC, United States of America
| | - Rashmi K Ambasta
- Molecular Neuroscience and Functional Genomics Laboratory, Department of Biotechnology, Delhi Technological, University, India; Department of Biotechnology and Microbiology, SRM University, Sonepat, Haryana, India.
| | - Pravir Kumar
- Molecular Neuroscience and Functional Genomics Laboratory, Department of Biotechnology, Delhi Technological, University, India.
| |
Collapse
|
4
|
Kim JK, Lee S, Hong SK, Kwak C, Jeong CW, Kang SH, Hong SH, Kim YJ, Chung J, Hwang EC, Kwon TG, Byun SS, Jung YJ, Lim J, Kim J, Oh H. Machine learning based prediction for oncologic outcomes of renal cell carcinoma after surgery using Korean Renal Cell Carcinoma (KORCC) database. Sci Rep 2023; 13:5778. [PMID: 37031280 PMCID: PMC10082844 DOI: 10.1038/s41598-023-30826-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2022] [Accepted: 03/02/2023] [Indexed: 04/10/2023] Open
Abstract
We developed a novel prediction model for recurrence and survival in patients with localized renal cell carcinoma (RCC) after surgery and a novel statistical method of machine learning (ML) to improve accuracy in predicting outcomes using a large Asian nationwide dataset, updated KOrean Renal Cell Carcinoma (KORCC) database that covered data for a total of 10,068 patients who had received surgery for RCC. After data pre-processing, feature selection was performed with an elastic net. Nine variables for recurrence and 13 variables for survival were extracted from 206 variables. Synthetic minority oversampling technique (SMOTE) was used for the training data set to solve the imbalance problem. We applied the most of existing ML algorithms introduced so far to evaluate the performance. We also performed subgroup analysis according to the histologic type. Diagnostic performances of all prediction models achieved high accuracy (range, 0.77-0.94) and F1-score (range, 0.77-0.97) in all tested metrics. In an external validation set, high accuracy and F1-score were well maintained in both recurrence and survival. In subgroup analysis of both clear and non-clear cell type RCC group, we also found a good prediction performance.
Collapse
Affiliation(s)
- Jung Kwon Kim
- Department of Urology, Seoul National University Bundang Hospital, Seongnam, Korea
- Department of Urology, Seoul National University College of Medicine, Seoul, Korea
| | - Sangchul Lee
- Department of Urology, Seoul National University Bundang Hospital, Seongnam, Korea
- Department of Urology, Seoul National University College of Medicine, Seoul, Korea
| | - Sung Kyu Hong
- Department of Urology, Seoul National University Bundang Hospital, Seongnam, Korea
- Department of Urology, Seoul National University College of Medicine, Seoul, Korea
| | - Cheol Kwak
- Department of Urology, Seoul National University College of Medicine, Seoul, Korea
- Department of Urology, Seoul National University Hospital, Seoul, Korea
| | - Chang Wook Jeong
- Department of Urology, Seoul National University College of Medicine, Seoul, Korea
- Department of Urology, Seoul National University Hospital, Seoul, Korea
| | - Seok Ho Kang
- Department of Urology, Korea University Anam Hospital, Seoul, Korea
| | - Sung-Hoo Hong
- Department of Urology, Seoul St. Mary's Hospital, The Catholic University of Korea, Seoul, Korea
| | - Yong-June Kim
- Department of Urology, Chungbuk National University Hospital, Cheongju, Korea
| | - Jinsoo Chung
- Department of Urology, National Cancer Center, Goyang, Korea
| | - Eu Chang Hwang
- Department of Urology, Chonnam National University Medical School, Gwangju, Korea
| | - Tae Gyun Kwon
- Department of Urology, Kyungpook National University Chilgok Hospital, Daegu, Korea
| | - Seok-Soo Byun
- Department of Urology, Seoul National University Bundang Hospital, Seongnam, Korea.
- Department of Medical Device Development, Seoul National University College of Medicine, Seoul, Korea.
| | - Yu Jin Jung
- Department of Medical Device Development, Seoul National University College of Medicine, Seoul, Korea
| | | | | | | |
Collapse
|
5
|
Weigle AT, Feng J, Shukla D. Thirty years of molecular dynamics simulations on posttranslational modifications of proteins. Phys Chem Chem Phys 2022; 24:26371-26397. [PMID: 36285789 PMCID: PMC9704509 DOI: 10.1039/d2cp02883b] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/06/2023]
Abstract
Posttranslational modifications (PTMs) are an integral component to how cells respond to perturbation. While experimental advances have enabled improved PTM identification capabilities, the same throughput for characterizing how structural changes caused by PTMs equate to altered physiological function has not been maintained. In this Perspective, we cover the history of computational modeling and molecular dynamics simulations which have characterized the structural implications of PTMs. We distinguish results from different molecular dynamics studies based upon the timescales simulated and analysis approaches used for PTM characterization. Lastly, we offer insights into how opportunities for modern research efforts on in silico PTM characterization may proceed given current state-of-the-art computing capabilities and methodological advancements.
Collapse
Affiliation(s)
- Austin T Weigle
- Department of Chemistry, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA
| | - Jiangyan Feng
- Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA
| | - Diwakar Shukla
- Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA
- Center for Biophysics and Quantitative Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA
- Department of Bioengineering, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA
- Department of Plant Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA.
| |
Collapse
|
6
|
Mini-review: Recent advances in post-translational modification site prediction based on deep learning. Comput Struct Biotechnol J 2022; 20:3522-3532. [PMID: 35860402 PMCID: PMC9284371 DOI: 10.1016/j.csbj.2022.06.045] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2022] [Revised: 06/21/2022] [Accepted: 06/21/2022] [Indexed: 11/23/2022] Open
Abstract
Post-translational modifications (PTMs) are closely linked to numerous diseases, playing a significant role in regulating protein structures, activities, and functions. Therefore, the identification of PTMs is crucial for understanding the mechanisms of cell biology and diseases therapy. Compared to traditional machine learning methods, the deep learning approaches for PTM prediction provide accurate and rapid screening, guiding the downstream wet experiments to leverage the screened information for focused studies. In this paper, we reviewed the recent works in deep learning to identify phosphorylation, acetylation, ubiquitination, and other PTM types. In addition, we summarized PTM databases and discussed future directions with critical insights.
Collapse
Key Words
- AAindex, Amino acid index
- ATP, Adenosine triphosphate
- AUC, Area under curve
- Ac, Acetylation
- BE, Binary encoding
- BLOSUM, Blocks substitution matrix
- Bi-LSTM, Bidirectional LSTM
- CKSAAP, Composition of k-spaced amino acid Pairs
- CNN, Convolutional neural network
- CNNOH, CNN with the one-hot encoding
- CNNWE, CNN with the word-embedding encoding
- CNNrgb, CNN red green blue
- CV, Cross-validation
- DC-CNN, Densely connected convolutional neural network
- DL, Deep learning
- DNNs, Deep neural networks
- Deep learning
- E. coli, Escherichia coli
- EBGW, Encoding based on grouped weight
- EGAAC, Enhanced grouped amino acids content
- IG, Information gain
- K, Lysine
- KNN, k nearest neighbor
- LASSO, Least absolute shrinkage and selection operator
- LSTM, Long short-term memory
- LSTMWE, LSTM with the word-embedding encoding
- M.musculus, Mus musculus
- MDC, Modular densely connected convolutional networks
- MDCAN, Multilane dense convolutional attention network
- ML, Machine learning
- MLP, Multilayer perceptron
- MMI, Multivariate mutual information
- Machine learning
- Mass spectrometry
- NMBroto, Normalized Moreau-Broto autocorrelation
- P, Proline
- PSP, PhosphoSitePlus
- PSSM, Position-specific scoring matrix
- PTM, Post-translational modifications
- Ph, Phosphorylation
- Post-translational modification
- Prediction
- PseAAC, Pseudo-amino acid composition
- R, Arginine
- RF, Random forest
- RNN, Recurrent neural network
- ROC, Receiver operating characteristic
- S, Serine
- S. typhimurium, Salmonella typhimurium
- S.cerevisiae, Saccharomyces cerevisiae
- SE, Squeeze and excitation
- SEV, Split to Equal Validation
- ST, Source and target
- SUMO, Small ubiquitin-like modifier
- SVM, Support vector machines
- T, Threonine
- Ub, Ubiquitination
- Y, Tyrosine
- ZSL, Zero-shot learning
Collapse
|
7
|
DeepDA-Ace: A Novel Domain Adaptation Method for Species-Specific Acetylation Site Prediction. MATHEMATICS 2022. [DOI: 10.3390/math10142364] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
Protein lysine acetylation is an important type of post-translational modification (PTM), and it plays a crucial role in various cellular processes. Recently, although many researchers have focused on developing tools for acetylation site prediction based on computational methods, most of these tools are based on traditional machine learning algorithms for acetylation site prediction without species specificity, still maintained as a single prediction model. Recent studies have shown that the acetylation sites of distinct species have evident location-specific differences; however, there is currently no integrated prediction model that can effectively predict acetylation sites cross all species. Therefore, to enhance the scope of species-specific level, it is necessary to establish a framework for species-specific acetylation site prediction. In this work, we propose a domain adaptation framework DeepDA-Ace for species-specific acetylation site prediction, including Rattus norvegicus, Schistosoma japonicum, Arabidopsis thaliana, and other types of species. In DeepDA-Ace, an attention based densely connected convolutional neural network is designed to capture sequence features, and the semantic adversarial learning strategy is proposed to align features of different species so as to achieve knowledge transfer. The DeepDA-Ace outperformed both the general prediction model and fine-tuning based species-specific model across most types of species. The experiment results have demonstrated that DeepDA-Ace is superior to the general and fine-tuning methods, and its precision exceeds 0.75 on most species. In addition, our method achieves at least 5% improvement over the existing acetylation prediction tools.
Collapse
|
8
|
Deep Learning-Based Advances In Protein Posttranslational Modification Site and Protein Cleavage Prediction. METHODS IN MOLECULAR BIOLOGY (CLIFTON, N.J.) 2022; 2499:285-322. [PMID: 35696087 DOI: 10.1007/978-1-0716-2317-6_15] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
Posttranslational modification (PTM ) is a ubiquitous phenomenon in both eukaryotes and prokaryotes which gives rise to enormous proteomic diversity. PTM mostly comes in two flavors: covalent modification to polypeptide chain and proteolytic cleavage. Understanding and characterization of PTM is a fundamental step toward understanding the underpinning of biology. Recent advances in experimental approaches, mainly mass-spectrometry-based approaches, have immensely helped in obtaining and characterizing PTMs. However, experimental approaches are not enough to understand and characterize more than 450 different types of PTMs and complementary computational approaches are becoming popular. Recently, due to the various advancements in the field of Deep Learning (DL), along with the explosion of applications of DL to various fields, the field of computational prediction of PTM has also witnessed the development of a plethora of deep learning (DL)-based approaches. In this book chapter, we first review some recent DL-based approaches in the field of PTM site prediction. In addition, we also review the recent advances in the not-so-studied PTM , that is, proteolytic cleavage predictions. We describe advances in PTM prediction by highlighting the Deep learning architecture, feature encoding, novelty of the approaches, and availability of the tools/approaches. Finally, we provide an outlook and possible future research directions for DL-based approaches for PTM prediction.
Collapse
|
9
|
Kang W, Liu L, Yu P, Zhang T, Lei C, Nie Z. A switchable Cas12a enabling CRISPR-based direct histone deacetylase activity detection. Biosens Bioelectron 2022; 213:114468. [PMID: 35700604 DOI: 10.1016/j.bios.2022.114468] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2022] [Revised: 05/30/2022] [Accepted: 06/06/2022] [Indexed: 11/02/2022]
Abstract
The efficient and robust signal reporting ability of CRISPR-Cas system exhibits huge value in biosensing, but its applicability for non-nucleic acid analyte detection relies on the coupling of additional recognition modules. To address this limitation, we described a switchable Cas12a and exploited it for CRISPR-based direct analysis of histone deacetylase (HDAC) activity. Starting from the acetylation-mediated inactivation of Cas12a by anti-CRISPR protein AcrVA5, we demonstrated that the acetyl-inactivated Cas12a could be reversibly activated by HDAC-mediated deacetylation based on computational simulations (e.g., deep learning and protein-protein docking analysis) and experimental verifications. By leveraging this switchable Cas12a for both target sensing and signal amplification, we established a sensitive one-pot assay capable of detecting deacetylase sirtuin-1 with sub-nanomolar sensitivity, which is 50 times lower than the standard two-step peptide-based assay. The versability of this assay was validated by the sensitive assessment of cellular HDAC activities in different cell lines with good accuracy, making it a valuable tool for biochemical studies and clinical diagnostics.
Collapse
Affiliation(s)
- Wenyuan Kang
- State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan Provincial Key Laboratory of Biomacromolecular Chemical Biology, Hunan University, Changsha, 410082, PR China
| | - Lin Liu
- State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan Provincial Key Laboratory of Biomacromolecular Chemical Biology, Hunan University, Changsha, 410082, PR China
| | - Peihang Yu
- State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan Provincial Key Laboratory of Biomacromolecular Chemical Biology, Hunan University, Changsha, 410082, PR China
| | - Tianyi Zhang
- State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan Provincial Key Laboratory of Biomacromolecular Chemical Biology, Hunan University, Changsha, 410082, PR China
| | - Chunyang Lei
- State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan Provincial Key Laboratory of Biomacromolecular Chemical Biology, Hunan University, Changsha, 410082, PR China.
| | - Zhou Nie
- State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan Provincial Key Laboratory of Biomacromolecular Chemical Biology, Hunan University, Changsha, 410082, PR China.
| |
Collapse
|
10
|
Malebary S, Rahman S, Barukab O, Ash’ari R, Khan SA. iAcety–SmRF: Identification of Acetylation Protein by Using Statistical Moments and Random Forest. MEMBRANES 2022; 12:membranes12030265. [PMID: 35323738 PMCID: PMC8955084 DOI: 10.3390/membranes12030265] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/10/2021] [Revised: 01/25/2022] [Accepted: 02/01/2022] [Indexed: 12/21/2022]
Abstract
Acetylation is the most important post-translation modification (PTM) in eukaryotes; it has manifold effects on the level of protein that transform an acetyl group from an acetyl coenzyme to a specific site on a polypeptide chain. Acetylation sites play many important roles, including regulating membrane protein functions and strongly affecting the membrane interaction of proteins and membrane remodeling. Because of these properties, its correct identification is essential to understand its mechanism in biological systems. As such, some traditional methods, such as mass spectrometry and site-directed mutagenesis, are used, but they are tedious and time-consuming. To overcome such limitations, many computer models are being developed to correctly identify their sequences from non-acetyl sequences, but they have poor efficiency in terms of accuracy, sensitivity, and specificity. This work proposes an efficient and accurate computational model for predicting Acetylation using machine learning approaches. The proposed model achieved an accuracy of 100 percent with the 10-fold cross-validation test based on the Random Forest classifier, along with a feature extraction approach using statistical moments. The model is also validated by the jackknife, self-consistency, and independent test, which achieved an accuracy of 100, 100, and 97, respectively, results far better as compared to the already existing models available in the literature.
Collapse
Affiliation(s)
- Sharaf Malebary
- Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21911, Saudi Arabia; (S.M.); (O.B.); (R.A.)
| | - Shaista Rahman
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan 23200, Pakistan;
- Correspondence:
| | - Omar Barukab
- Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21911, Saudi Arabia; (S.M.); (O.B.); (R.A.)
| | - Rehab Ash’ari
- Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21911, Saudi Arabia; (S.M.); (O.B.); (R.A.)
| | - Sher Afzal Khan
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan 23200, Pakistan;
| |
Collapse
|
11
|
Ao C, Jiao S, Wang Y, Yu L, Zou Q. Biological Sequence Classification: A Review on Data and General Methods. RESEARCH 2022. [DOI: 10.34133/research.0011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
With the rapid development of biotechnology, the number of biological sequences has grown exponentially. The continuous expansion of biological sequence data promotes the application of machine learning in biological sequences to construct predictive models for mining biological sequence information. There are many branches of biological sequence classification research. In this review, we mainly focus on the function and modification classification of biological sequences based on machine learning. Sequence-based prediction and analysis are the basic tasks to understand the biological functions of DNA, RNA, proteins, and peptides. However, there are hundreds of classification models developed for biological sequences, and the quite varied specific methods seem dizzying at first glance. Here, we aim to establish a long-term support website (
http://lab.malab.cn/~acy/BioseqData/home.html
), which provides readers with detailed information on the classification method and download links to relevant datasets. We briefly introduce the steps to build an effective model framework for biological sequence data. In addition, a brief introduction to single-cell sequencing data analysis methods and applications in biology is also included. Finally, we discuss the current challenges and future perspectives of biological sequence classification research.
Collapse
Affiliation(s)
- Chunyan Ao
- School of Computer Science and Technology, Xidian University, Xi’an, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Shihu Jiao
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Yansu Wang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Liang Yu
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Quan Zou
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
12
|
Lv H, Zhang Y, Wang JS, Yuan SS, Sun ZJ, Dao FY, Guan ZX, Lin H, Deng KJ. iRice-MS: An integrated XGBoost model for detecting multitype post-translational modification sites in rice. Brief Bioinform 2021; 23:6447435. [PMID: 34864888 DOI: 10.1093/bib/bbab486] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2021] [Revised: 10/05/2021] [Accepted: 10/23/2021] [Indexed: 12/13/2022] Open
Abstract
Post-translational modification (PTM) refers to the covalent and enzymatic modification of proteins after protein biosynthesis, which orchestrates a variety of biological processes. Detecting PTM sites in proteome scale is one of the key steps to in-depth understanding their regulation mechanisms. In this study, we presented an integrated method based on eXtreme Gradient Boosting (XGBoost), called iRice-MS, to identify 2-hydroxyisobutyrylation, crotonylation, malonylation, ubiquitination, succinylation and acetylation in rice. For each PTM-specific model, we adopted eight feature encoding schemes, including sequence-based features, physicochemical property-based features and spatial mapping information-based features. The optimal feature set was identified from each encoding, and their respective models were established. Extensive experimental results show that iRice-MS always display excellent performance on 5-fold cross-validation and independent dataset test. In addition, our novel approach provides the superiority to other existing tools in terms of AUC value. Based on the proposed model, a web server named iRice-MS was established and is freely accessible at http://lin-group.cn/server/iRice-MS.
Collapse
Affiliation(s)
- Hao Lv
- Center for Informational Biology at University of Electronic Science and Technology of China, China
| | - Yang Zhang
- Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, China
| | - Jia-Shu Wang
- Center for Informational Biology at University of Electronic Science and Technology of China, China
| | - Shi-Shi Yuan
- Center for Informational Biology at University of Electronic Science and Technology of China, China
| | - Zi-Jie Sun
- Center for Informational Biology at University of Electronic Science and Technology of China, China
| | - Fu-Ying Dao
- Center for Informational Biology at University of Electronic Science and Technology of China, China
| | - Zheng-Xing Guan
- Center for Informational Biology at University of Electronic Science and Technology of China, China
| | - Hao Lin
- Center for Informational Biology at University of Electronic Science and Technology of China, China
| | - Ke-Jun Deng
- Center for Informational Biology at University of Electronic Science and Technology of China, China
| |
Collapse
|
13
|
Basith S, Lee G, Manavalan B. STALLION: a stacking-based ensemble learning framework for prokaryotic lysine acetylation site prediction. Brief Bioinform 2021; 23:6370848. [PMID: 34532736 PMCID: PMC8769686 DOI: 10.1093/bib/bbab376] [Citation(s) in RCA: 35] [Impact Index Per Article: 11.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2021] [Revised: 08/22/2021] [Accepted: 08/24/2021] [Indexed: 12/13/2022] Open
Abstract
Protein post-translational modification (PTM) is an important regulatory mechanism that plays a key role in both normal and disease states. Acetylation on lysine residues is one of the most potent PTMs owing to its critical role in cellular metabolism and regulatory processes. Identifying protein lysine acetylation (Kace) sites is a challenging task in bioinformatics. To date, several machine learning-based methods for the in silico identification of Kace sites have been developed. Of those, a few are prokaryotic species-specific. Despite their attractive advantages and performances, these methods have certain limitations. Therefore, this study proposes a novel predictor STALLION (STacking-based Predictor for ProkAryotic Lysine AcetyLatION), containing six prokaryotic species-specific models to identify Kace sites accurately. To extract crucial patterns around Kace sites, we employed 11 different encodings representing three different characteristics. Subsequently, a systematic and rigorous feature selection approach was employed to identify the optimal feature set independently for five tree-based ensemble algorithms and built their respective baseline model for each species. Finally, the predicted values from baseline models were utilized and trained with an appropriate classifier using the stacking strategy to develop STALLION. Comparative benchmarking experiments showed that STALLION significantly outperformed existing predictor on independent tests. To expedite direct accessibility to the STALLION models, a user-friendly online predictor was implemented, which is available at: http://thegleelab.org/STALLION.
Collapse
Affiliation(s)
- Shaherin Basith
- Department of Physiology, Ajou University School of Medicine, Republic of Korea
| | - Gwang Lee
- Department of Molecular Science and Technology, Ajou University, Suwon 16499, Republic of Korea
| | | |
Collapse
|
14
|
Islam MKB, Rahman J, Hasan MAM, Ahmad S. predForm-Site: Formylation site prediction by incorporating multiple features and resolving data imbalance. Comput Biol Chem 2021; 94:107553. [PMID: 34384997 DOI: 10.1016/j.compbiolchem.2021.107553] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2020] [Revised: 06/22/2021] [Accepted: 07/28/2021] [Indexed: 10/20/2022]
Abstract
Formylation is one of the newly discovered post-translational modifications in lysine residue which is responsible for different kinds of diseases. In this work, a novel predictor, named predForm-Site, has been developed to predict formylation sites with higher accuracy. We have integrated multiple sequence features for developing a more informative representation of formylation sites. Moreover, decision function of the underlying classifier have been optimized on skewed formylation dataset during prediction model training for prediction quality improvement. On the dataset used by LFPred and Formator predictor, predForm-Site achieved 99.5% sensitivity, 99.8% specificity and 99.8% overall accuracy with AUC of 0.999 in the jackknife test. In the independent test, it has also achieved more than 97% sensitivity and 99% specificity. Similarly, in benchmarking with recent method CKSAAP_FormSite, the proposed predictor significantly outperformed in all the measures, particularly sensitivity by around 20%, specificity by nearly 30% and overall accuracy by more than 22%. These experimental results show that the proposed predForm-Site can be used as a complementary tool for the fast exploration of formylation sites. For convenience of the scientific community, predForm-Site has been deployed as an online tool, accessible at http://103.99.176.239:8080/predForm-Site.
Collapse
Affiliation(s)
- Md Khaled Ben Islam
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia; Department of Computer Science & Engineering, Pabna University of Science and Technology, Pabna, Bangladesh.
| | - Julia Rahman
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia; Department of Computer Science & Engineering, Rajshahi University of Engineering and Technology, Rajshahi, Bangladesh.
| | - Md Al Mehedi Hasan
- Department of Computer Science & Engineering, Rajshahi University of Engineering and Technology, Rajshahi, Bangladesh
| | - Shamim Ahmad
- Department of Computer Science & Engineering, Rajshahi University, Rajshahi, Bangladesh
| |
Collapse
|
15
|
Li A, Deng Y, Tan Y, Chen M. A Transfer Learning-Based Approach for Lysine Propionylation Prediction. Front Physiol 2021; 12:658633. [PMID: 33967828 PMCID: PMC8096918 DOI: 10.3389/fphys.2021.658633] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2021] [Accepted: 03/15/2021] [Indexed: 12/12/2022] Open
Abstract
Lysine propionylation is a newly discovered posttranslational modification (PTM) and plays a key role in the cellular process. Although proteomics techniques was capable of detecting propionylation, large-scale detection was still challenging. To bridge this gap, we presented a transfer learning-based method for computationally predicting propionylation sites. The recurrent neural network-based deep learning model was trained firstly by the malonylation and then fine-tuned by the propionylation. The trained model served as feature extractor where protein sequences as input were translated into numerical vectors. The support vector machine was used as the final classifier. The proposed method reached a matthews correlation coefficient (MCC) of 0.6615 on the 10-fold crossvalidation and 0.3174 on the independent test, outperforming state-of-the-art methods. The enrichment analysis indicated that the propionylation was associated with these GO terms (GO:0016620, GO:0051287, GO:0003735, GO:0006096, and GO:0005737) and with metabolism. We developed a user-friendly online tool for predicting propoinylation sites which is available at http://47.113.117.61/.
Collapse
Affiliation(s)
- Ang Li
- School of Computer Science and Technology, Hunan Institute of Technology, Hengyang, China
| | - Yingwei Deng
- School of Computer Science and Technology, Hunan Institute of Technology, Hengyang, China
| | - Yan Tan
- School of Computer Science and Technology, Hunan Institute of Technology, Hengyang, China
| | - Min Chen
- School of Computer Science and Technology, Hunan Institute of Technology, Hengyang, China
| |
Collapse
|
16
|
Aggarwal S, Tolani P, Gupta S, Yadav AK. Posttranslational modifications in systems biology. ADVANCES IN PROTEIN CHEMISTRY AND STRUCTURAL BIOLOGY 2021; 127:93-126. [PMID: 34340775 DOI: 10.1016/bs.apcsb.2021.03.005] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The biological complexity cannot be captured by genes or proteins alone. The protein posttranslational modifications (PTMs) impart functional diversity to the proteome and regulate protein structure, activity, localization and interactions. Their dynamics drive cellular signaling, growth and development while their dysregulation causes many diseases. Mass spectrometry based quantitative profiling of PTMs and bioinformatics analysis tools allow systems level insights into their network architecture. High-resolution profiling of PTM networks will advance disease understanding and precision medicine. It can accelerate the discovery of biomarkers and drug targets. This requires better tools for unbiased, high-throughput and accurate PTM identification, site localization and automated annotation on a systems level.
Collapse
Affiliation(s)
- Suruchi Aggarwal
- Translational Health Science and Technology Institute, NCR Biotech Science Cluster, Faridabad, Haryana, India; Department of Molecular Biology and Biotechnology, Cotton University, Guwahati, Assam, India
| | - Priya Tolani
- Translational Health Science and Technology Institute, NCR Biotech Science Cluster, Faridabad, Haryana, India
| | - Srishti Gupta
- Translational Health Science and Technology Institute, NCR Biotech Science Cluster, Faridabad, Haryana, India; School of Biosciences and Technology, Vellore Institute of Technology, Vellore, India
| | - Amit Kumar Yadav
- Translational Health Science and Technology Institute, NCR Biotech Science Cluster, Faridabad, Haryana, India.
| |
Collapse
|
17
|
Yang Y, Wang H, Li W, Wang X, Wei S, Liu Y, Xu Y. Prediction and analysis of multiple protein lysine modified sites based on conditional wasserstein generative adversarial networks. BMC Bioinformatics 2021; 22:171. [PMID: 33789579 PMCID: PMC8010967 DOI: 10.1186/s12859-021-04101-y] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2020] [Accepted: 03/23/2021] [Indexed: 01/05/2023] Open
Abstract
BACKGROUND Protein post-translational modification (PTM) is a key issue to investigate the mechanism of protein's function. With the rapid development of proteomics technology, a large amount of protein sequence data has been generated, which highlights the importance of the in-depth study and analysis of PTMs in proteins. METHOD We proposed a new multi-classification machine learning pipeline MultiLyGAN to identity seven types of lysine modified sites. Using eight different sequential and five structural construction methods, 1497 valid features were remained after the filtering by Pearson correlation coefficient. To solve the data imbalance problem, Conditional Generative Adversarial Network (CGAN) and Conditional Wasserstein Generative Adversarial Network (CWGAN), two influential deep generative methods were leveraged and compared to generate new samples for the types with fewer samples. Finally, random forest algorithm was utilized to predict seven categories. RESULTS In the tenfold cross-validation, accuracy (Acc) and Matthews correlation coefficient (MCC) were 0.8589 and 0.8376, respectively. In the independent test, Acc and MCC were 0.8549 and 0.8330, respectively. The results indicated that CWGAN better solved the existing data imbalance and stabilized the training error. Alternatively, an accumulated feature importance analysis reported that CKSAAP, PWM and structural features were the three most important feature-encoding schemes. MultiLyGAN can be found at https://github.com/Lab-Xu/MultiLyGAN . CONCLUSIONS The CWGAN greatly improved the predictive performance in all experiments. Features derived from CKSAAP, PWM and structure schemes are the most informative and had the greatest contribution to the prediction of PTM.
Collapse
Affiliation(s)
- Yingxi Yang
- Department of Information and Computer Science, University of Science and Technology Beijing, Beijing, 100083, China
| | - Hui Wang
- Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100080, China
| | - Wen Li
- Department of Information and Computer Science, University of Science and Technology Beijing, Beijing, 100083, China
| | - Xiaobo Wang
- Department of Information and Computer Science, University of Science and Technology Beijing, Beijing, 100083, China
| | - Shizhao Wei
- No. 15 Research Institute, China Electronics Technology Group Corporation, Beijing, 100083, China
| | - Yulong Liu
- No. 15 Research Institute, China Electronics Technology Group Corporation, Beijing, 100083, China
| | - Yan Xu
- Department of Information and Computer Science, University of Science and Technology Beijing, Beijing, 100083, China.
| |
Collapse
|
18
|
Ao C, Yu L, Zou Q. Prediction of bio-sequence modifications and the associations with diseases. Brief Funct Genomics 2020; 20:1-18. [PMID: 33313647 DOI: 10.1093/bfgp/elaa023] [Citation(s) in RCA: 52] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2020] [Revised: 11/09/2020] [Accepted: 11/10/2020] [Indexed: 12/22/2022] Open
Abstract
Modifications of protein, RNA and DNA play an important role in many biological processes and are related to some diseases. Therefore, accurate identification and comprehensive understanding of protein, RNA and DNA modification sites can promote research on disease treatment and prevention. With the development of sequencing technology, the number of known sequences has continued to increase. In the past decade, many computational tools that can be used to predict protein, RNA and DNA modification sites have been developed. In this review, we comprehensively summarized the modification site predictors for three different biological sequences and the association with diseases. The relevant web server is accessible at http://lab.malab.cn/∼acy/PTM_data/ some sample data on protein, RNA and DNA modification can be downloaded from that website.
Collapse
|
19
|
Xu H, Jia P, Zhao Z. Deep4mC: systematic assessment and computational prediction for DNA N4-methylcytosine sites by deep learning. Brief Bioinform 2020; 22:5856341. [PMID: 32578842 DOI: 10.1093/bib/bbaa099] [Citation(s) in RCA: 42] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2020] [Revised: 04/16/2020] [Accepted: 05/02/2020] [Indexed: 12/11/2022] Open
Abstract
DNA N4-methylcytosine (4mC) modification represents a novel epigenetic regulation. It involves in various cellular processes, including DNA replication, cell cycle and gene expression, among others. In addition to experimental identification of 4mC sites, in silico prediction of 4mC sites in the genome has emerged as an alternative and promising approach. In this study, we first reviewed the current progress in the computational prediction of 4mC sites and systematically evaluated the predictive capacity of eight conventional machine learning algorithms as well as 12 feature types commonly used in previous studies in six species. Using a representative benchmark dataset, we investigated the contribution of feature selection and stacking approach to the model construction, and found that feature optimization and proper reinforcement learning could improve the performance. We next recollected newly added 4mC sites in the six species' genomes and developed a novel deep learning-based 4mC site predictor, namely Deep4mC. Deep4mC applies convolutional neural networks with four representative features. For species with small numbers of samples, we extended our deep learning framework with a bootstrapping method. Our evaluation indicated that Deep4mC could obtain high accuracy and robust performance with the average area under curve (AUC) values greater than 0.9 in all species (range: 0.9005-0.9722). In comparison, Deep4mC achieved an AUC value improvement from 10.14 to 46.21% when compared to previous tools in these six species. A user-friendly web server (https://bioinfo.uth.edu/Deep4mC) was built for predicting putative 4mC sites in a genome.
Collapse
Affiliation(s)
- Haodong Xu
- Center for Precision Health, School of Biomedical Informatics
| | - Peilin Jia
- Center for Precision Health, School of Biomedical Informatics
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics
| |
Collapse
|
20
|
Ning Q, Ma Z, Zhao X. dForml(KNN)-PseAAC: Detecting formylation sites from protein sequences using K-nearest neighbor algorithm via Chou's 5-step rule and pseudo components. J Theor Biol 2019; 470:43-49. [DOI: 10.1016/j.jtbi.2019.03.011] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2019] [Revised: 03/09/2019] [Accepted: 03/13/2019] [Indexed: 10/27/2022]
|
21
|
Chen G, Cao M, Yu J, Guo X, Shi S. Prediction and functional analysis of prokaryote lysine acetylation site by incorporating six types of features into Chou's general PseAAC. J Theor Biol 2019; 461:92-101. [DOI: 10.1016/j.jtbi.2018.10.047] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2018] [Revised: 10/09/2018] [Accepted: 10/22/2018] [Indexed: 12/12/2022]
|
22
|
Cao M, Chen G, Yu J, Shi S. Computational prediction and analysis of species-specific fungi phosphorylation via feature optimization strategy. Brief Bioinform 2018; 21:595-608. [DOI: 10.1093/bib/bby122] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2018] [Revised: 11/16/2018] [Accepted: 11/22/2018] [Indexed: 11/12/2022] Open
Abstract
Abstract
Protein phosphorylation is a reversible and ubiquitous post-translational modification that primarily occurs at serine, threonine and tyrosine residues and regulates a variety of biological processes. In this paper, we first briefly summarized the current progresses in computational prediction of eukaryotic protein phosphorylation sites, which mainly focused on animals and plants, especially on human, with a less extent on fungi. Since the number of identified fungi phosphorylation sites has greatly increased in a wide variety of organisms and their roles in pathological physiology still remain largely unknown, more attention has been paid on the identification of fungi-specific phosphorylation. Here, experimental fungi phosphorylation sites data were collected and most of the sites were classified into different types to be encoded with various features and trained via a two-step feature optimization method. A novel method for prediction of species-specific fungi phosphorylation-PreSSFP was developed, which can identify fungi phosphorylation in seven species for specific serine, threonine and tyrosine residues (http://computbiol.ncu.edu.cn/PreSSFP). Meanwhile, we critically evaluated the performance of PreSSFP and compared it with other existing tools. The satisfying results showed that PreSSFP is a robust predictor. Feature analyses exhibited that there have some significant differences among seven species. The species-specific prediction via two-step feature optimization method to mine important features for training could considerably improve the prediction performance. We anticipate that our study provides a new lead for future computational analysis of fungi phosphorylation.
Collapse
Affiliation(s)
- Man Cao
- Department of Mathematics and Numerical Simulation and High-Performance Computing Laboratory, School of Sciences, Nanchang University, Nanchang, China
| | - Guodong Chen
- Department of Mathematics and Numerical Simulation and High-Performance Computing Laboratory, School of Sciences, Nanchang University, Nanchang, China
| | - Jialin Yu
- Department of Mathematics and Numerical Simulation and High-Performance Computing Laboratory, School of Sciences, Nanchang University, Nanchang, China
| | - Shaoping Shi
- Department of Mathematics and Numerical Simulation and High-Performance Computing Laboratory, School of Sciences, Nanchang University, Nanchang, China
| |
Collapse
|
23
|
Yu J, Shi S, Zhang F, Chen G, Cao M. PredGly: predicting lysine glycation sites for Homo sapiens based on XGboost feature optimization. Bioinformatics 2018; 35:2749-2756. [DOI: 10.1093/bioinformatics/bty1043] [Citation(s) in RCA: 39] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2018] [Revised: 12/13/2018] [Accepted: 12/20/2018] [Indexed: 01/22/2023] Open
Abstract
Abstract
Motivation
Protein glycation is a familiar post-translational modification (PTM) which is a two-step non-enzymatic reaction. Glycation not only impairs the function but also changes the characteristics of the proteins so that it is related to many human diseases. It is still much more difficult to systematically detect glycation sites due to the glycated residues without crucial patterns. Computational approaches, which can filter supposed sites prior to experimental verification, can extremely increase the efficiency of experiment work. However, the previous lysine glycation prediction method uses a small number of training datasets. Hence, the model is not generalized or pervasive.
Results
By searching from a new database, we collected a large dataset in Homo sapiens. PredGly, a novel software, can predict lysine glycation sites for H.sapiens, which was developed by combining multiple features. In addition, XGboost was adopted to optimize feature vectors and to improve the model performance. Through comparing various classifiers, support vector machine achieved an optimal performance. On the basis of a new independent test set, PredGly outperformed other glycation tools. It suggests that PredGly can provide more instructive guidance for further experimental research of lysine glycation.
Availability and implementation
https://github.com/yujialinncu/PredGly
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jialin Yu
- Department of Mathematics and Numerical Simulation and High-Performance Computing Laboratory, School of Sciences, Nanchang University, Nanchang, China
| | - Shaoping Shi
- Department of Mathematics and Numerical Simulation and High-Performance Computing Laboratory, School of Sciences, Nanchang University, Nanchang, China
| | - Fang Zhang
- Department of Mathematics and Numerical Simulation and High-Performance Computing Laboratory, School of Sciences, Nanchang University, Nanchang, China
| | - Guodong Chen
- Department of Mathematics and Numerical Simulation and High-Performance Computing Laboratory, School of Sciences, Nanchang University, Nanchang, China
| | - Man Cao
- Department of Mathematics and Numerical Simulation and High-Performance Computing Laboratory, School of Sciences, Nanchang University, Nanchang, China
| |
Collapse
|