1
|
Wang X, Ding Z, Wang R, Lin X. Deepro-Glu: combination of convolutional neural network and Bi-LSTM models using ProtBert and handcrafted features to identify lysine glutarylation sites. Brief Bioinform 2023; 24:6991122. [PMID: 36653898 DOI: 10.1093/bib/bbac631] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2022] [Revised: 12/11/2022] [Accepted: 12/28/2022] [Indexed: 01/20/2023] Open
Abstract
Lysine glutarylation (Kglu) is a newly discovered post-translational modification of proteins with important roles in mitochondrial functions, oxidative damage, etc. The established biological experimental methods to identify glutarylation sites are often time-consuming and costly. Therefore, there is an urgent need to develop computational methods for efficient and accurate identification of glutarylation sites. Most of the existing computational methods only utilize handcrafted features to construct the prediction model and do not consider the positive impact of the pre-trained protein language model on the prediction performance. Based on this, we develop an ensemble deep-learning predictor Deepro-Glu that combines convolutional neural network and bidirectional long short-term memory network using the deep learning features and traditional handcrafted features to predict lysine glutaryation sites. The deep learning features are generated from the pre-trained protein language model called ProtBert, and the handcrafted features consist of sequence-based features, physicochemical property-based features and evolution information-based features. Furthermore, the attention mechanism is used to efficiently integrate the deep learning features and the handcrafted features by learning the appropriate attention weights. 10-fold cross-validation and independent tests demonstrate that Deepro-Glu achieves competitive or superior performance than the state-of-the-art methods. The source codes and data are publicly available at https://github.com/xwanggroup/Deepro-Glu.
Collapse
Affiliation(s)
- Xiao Wang
- School of Computer and Communication Engineering, Zhengzhou University of Light Industry, No. 136, Science Avenue, 450002, Zhengzhou, China
| | - Zhaoyuan Ding
- School of Computer and Communication Engineering, Zhengzhou University of Light Industry, No. 136, Science Avenue, 450002, Zhengzhou, China
| | - Rong Wang
- School of Computer and Communication Engineering, Zhengzhou University of Light Industry, No. 136, Science Avenue, 450002, Zhengzhou, China
| | - Xi Lin
- Instiute of Artificial Intelligence, Xiamen University, No.4221, Xiang'an South Road, 361000, Xiamen, China
| |
Collapse
|
2
|
Chen H, He Y. Machine Learning Approaches in Traditional Chinese Medicine: A Systematic Review. THE AMERICAN JOURNAL OF CHINESE MEDICINE 2022; 50:91-131. [PMID: 34931589 DOI: 10.1142/s0192415x22500045] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
Machine learning (ML), as a branch of artificial intelligence, acquires the potential and meaningful rules from the mass of data via diverse algorithms. Owing to all research of traditional Chinese medicine (TCM) belonging to the digitalization of clinical records or experimental works, a massive and complex amount of data has become an inextricable part of the related studies. It is thus not surprising that ML approaches, as novel and efficient tools to mine the useful knowledge from data, have created inroads in a diversity of scopes of TCM over the past decade of years. However, by browsing lots of literature, we find that not all of the ML approaches perform well in the same field. Upon further consideration, we infer that the specificity may inhere between the ML approaches and their applied fields. This systematic review focuses its attention on the four categories of ML approaches and their eight application scopes in TCM. According to the function, ML approaches are classified into four categories, including classification, regression, clustering, and dimensionality reduction, and into 14 models as follows in more detail: support vector machine, least square-support vector machine, logistic regression, partial least squares regression, k-means clustering, hierarchical cluster analysis, artificial neural network, back propagation neural network, convolutional neural network, decision tree, random forest, principal component analysis, partial least squares-discriminant analysis, and orthogonal partial least squares-discriminant analysis. The eight common applied fields are divided into two parts: one for TCM, such as the diagnosis of diseases, the determination of syndromes, and the analysis of prescription, and the other for the related researches of Chinese herbal medicine, such as the quality control, the identification of geographic origins, the pharmacodynamic material basis, the medicinal properties, and the pharmacokinetics and pharmacodynamics. Additionally, this paper discusses the function and feature difference among ML approaches when they are applied to the corresponding fields via comparing their principles. The specificity of each approach to its applied fields has also been affirmed, whereby laying a foundation for subsequent studies applying ML approaches to TCM.
Collapse
Affiliation(s)
- Haiyang Chen
- School of Pharmaceutical Sciences, Zhejiang Chinese Medical University, Hangzhou 310053, P. R. China
| | - Yu He
- School of Pharmaceutical Sciences, Zhejiang Chinese Medical University, Hangzhou 310053, P. R. China
| |
Collapse
|
3
|
Jiang Z, Wang D, Chen Y. Automatic classification of nerve discharge rhythms based on sparse auto-encoder and time series feature. BMC Bioinformatics 2022; 22:619. [PMID: 35168551 PMCID: PMC8848584 DOI: 10.1186/s12859-022-04592-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2022] [Accepted: 01/11/2022] [Indexed: 12/01/2022] Open
Abstract
Background Nerve discharge is the carrier of information transmission, which can reveal the basic rules of various nerve activities. Recognition of the nerve discharge rhythm is the key to correctly understand the dynamic behavior of the nervous system. The previous methods for the nerve discharge recognition almost depended on the traditional statistical features, and the nonlinear dynamical features of the discharge activity. The artificial extraction and the empirical judgment of the features were required for the recognition. Thus, these methods suffered from subjective factors and were not conducive to the identification of a large number of discharge rhythms. Results The ability of automatic feature extraction along with the development of the neural network has been greatly improved. In this paper, an effective discharge rhythm classification model based on sparse auto-encoder was proposed. The sparse auto-encoder was used to construct the feature learning network. The simulated discharge data from the Chay model and its variants were taken as the input of the network, and the fused features, including the network learning features, covariance and approximate entropy of nerve discharge, were classified by Softmax. The results showed that the accuracy of the classification on the testing data was 87.5%, which could provide more accurate classification results. Compared with other methods for the identification of nerve discharge types, this method could extract the characteristics of nerve discharge rhythm automatically without artificial design, and show a higher accuracy. Conclusions The sparse auto-encoder, even neural network has not been used to classify the basic nerve discharge from neither biological experiment data nor model simulation data. The automatic classification method of nerve discharge rhythm based on the sparse auto-encoder in this paper reduced the subjectivity and misjudgment of the artificial feature extraction, saved the time for the comparison with the traditional method, and improved the intelligence of the classification of discharge types. It could further help us to recognize and identify the nerve discharge activities in a new way. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04592-3.
Collapse
Affiliation(s)
- Zhongting Jiang
- School of Information Science and Engineering, University of Jinan, Jinan, 250022, China
| | - Dong Wang
- School of Information Science and Engineering, University of Jinan, Jinan, 250022, China. .,Shandong Provincial Key Laboratory of Network Based Intelligent Computing, Jinan, 250022, China.
| | - Yuehui Chen
- School of Information Science and Engineering, University of Jinan, Jinan, 250022, China.,Shandong Provincial Key Laboratory of Network Based Intelligent Computing, Jinan, 250022, China
| |
Collapse
|
4
|
Zheng K, You ZH, Wang L, Li YR, Zhou JR, Zeng HT. MISSIM: An Incremental Learning-Based Model With Applications to the Prediction of miRNA-Disease Association. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1733-1742. [PMID: 32749964 DOI: 10.1109/tcbb.2020.3013837] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
In the past few years, the prediction models have shown remarkable performance in most biological correlation prediction tasks. These tasks traditionally use a fixed dataset, and the model, once trained, is deployed as is. These models often encounter training issues such as sensitivity to hyperparameter tuning and "catastrophic forgetting" when adding new data. However, with the development of biomedicine and the accumulation of biological data, new predictive models are required to face the challenge of adapting to change. To this end, we propose a computational approach based on Broad learning system (BLS) to predict potential disease-associated miRNAs that retain the ability to distinguish prior training associations when new data need to be adapted. In particular, we are introducing incremental learning to the field of biological association prediction for the first time and proposed a new method for quantifying sequence similarity. In the performance evaluation, the AUC in the 5-fold cross-validation was 0.9400 +/- 0.0041. To better assess the effectiveness of MISSIM, we compared it with various classifiers and former prediction models. Its performance is superior to the previous method. Besides, the case study on identifying miRNAs associated with breast neoplasms, lung neoplasms and esophageal neoplasms show that 34, 36 and 35 out of the top 40 associations predicted by MISSIM are confirmed by recent biomedical resources. These results provide ample convincing evidence of this approach have potential value and prospect in promoting biomedical research productivity.
Collapse
|
5
|
PupStruct: Prediction of Pupylated Lysine Residues Using Structural Properties of Amino Acids. Genes (Basel) 2020; 11:genes11121431. [PMID: 33260770 PMCID: PMC7761138 DOI: 10.3390/genes11121431] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2020] [Revised: 11/23/2020] [Accepted: 11/23/2020] [Indexed: 12/23/2022] Open
Abstract
Post-translational modification (PTM) is a critical biological reaction which adds to the diversification of the proteome. With numerous known modifications being studied, pupylation has gained focus in the scientific community due to its significant role in regulating biological processes. The traditional experimental practice to detect pupylation sites proved to be expensive and requires a lot of time and resources. Thus, there have been many computational predictors developed to challenge this issue. However, performance is still limited. In this study, we propose another computational method, named PupStruct, which uses the structural information of amino acids with a radial basis kernel function Support Vector Machine (SVM) to predict pupylated lysine residues. We compared PupStruct with three state-of-the-art predictors from the literature where PupStruct has validated a significant improvement in performance over them with statistical metrics such as sensitivity (0.9234), specificity (0.9359), accuracy (0.9296), precision (0.9349), and Mathew’s correlation coefficient (0.8616) on a benchmark dataset.
Collapse
|
6
|
Jiang Z, Wang D, Wu P, Chen Y, Shang H, Wang L, Xie H. Predicting subcellular localization of multisite proteins using differently weighted multi-label k-nearest neighbors sets. Technol Health Care 2020; 27:185-193. [PMID: 31045538 PMCID: PMC6598103 DOI: 10.3233/thc-199018] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
BACKGROUND: For a protein to execute its function, ensuring its correct subcellular localization is essential. In addition to biological experiments, bioinformatics is widely used to predict and determine the subcellular localization of proteins. However, single-feature extraction methods cannot effectively handle the huge amount of data and multisite localization of proteins. Thus, we developed a pseudo amino acid composition (PseAAC) method and an entropy density technique to extract feature fusion information from subcellular multisite proteins. OBJECTIVE: Predicting multiplex protein subcellular localization and achieve high prediction accuracy. METHOD: To improve the efficiency of predicting multiplex protein subcellular localization, we used the multi-label k-nearest neighbors algorithm and assigned different weights to various attributes. The method was evaluated using several performance metrics with a dataset consisting of protein sequences with single-site and multisite subcellular localizations. RESULTS: Evaluation experiments showed that the proposed method significantly improves the optimal overall accuracy rate of multiplex protein subcellular localization. CONCLUSION: This method can help to more comprehensively predict protein subcellular localization toward better understanding protein function, thereby bridging the gap between theory and application toward improved identification and monitoring of drug targets.
Collapse
Affiliation(s)
- Zhongting Jiang
- School of Information Science and Engineering, University of Jinan, Jinan, Shandong, China
| | - Dong Wang
- School of Information Science and Engineering, University of Jinan, Jinan, Shandong, China.,CAS Key Laboratory of Bio-Medical Diagnostics, Suzhou Institute of Biomedical Engineering and Technology, Chinese Academy of Sciences, Suzhou, Jiangsu, China.,Key Laboratory of Medicinal Plant and Animal Resources of Qinghai-Tibet Plateau in Qinghai Province, Qinghai Normal University, Xining, Qinghai, China
| | - Peng Wu
- School of Information Science and Engineering, University of Jinan, Jinan, Shandong, China
| | - Yuehui Chen
- School of Information Science and Engineering, University of Jinan, Jinan, Shandong, China
| | - Huijie Shang
- School of Information Science and Engineering, University of Jinan, Jinan, Shandong, China
| | - Luyao Wang
- School of Information Science and Engineering, University of Jinan, Jinan, Shandong, China
| | - Huichun Xie
- Key Laboratory of Medicinal Plant and Animal Resources of Qinghai-Tibet Plateau in Qinghai Province, Qinghai Normal University, Xining, Qinghai, China
| |
Collapse
|
7
|
Zhang Q, Zhu L, Huang DS. High-Order Convolutional Neural Network Architecture for Predicting DNA-Protein Binding Sites. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1184-1192. [PMID: 29993783 DOI: 10.1109/tcbb.2018.2819660] [Citation(s) in RCA: 55] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/15/2023]
Abstract
Although Deep learning algorithms have outperformed conventional methods in predicting the sequence specificities of DNA-protein binding, they lack to consider the dependencies among nucleotides and the diverse binding lengths for different transcription factors (TFs). To address the above two limitations simultaneously, in this paper, we propose a high-order convolutional neural network architecture (HOCNN), which employs a high-order encoding method to build high-order dependencies among nucleotides, and a multi-scale convolutional layer to capture the motif features of different length. The experimental results on real ChIP-seq datasets show that the proposed method outperforms the state-of-the-art deep learning method (DeepBind) in the motif discovery task. In addition, we provide further insights about the importance of introducing additional convolutional kernels and the degeneration problem of importing high-order in the motif discovery task.
Collapse
|
8
|
Singh V, Sharma A, Chandra A, Dehzangi A, Shigemizu D, Tsunoda T. Computational Prediction of Lysine Pupylation Sites in Prokaryotic Proteins Using Position Specific Scoring Matrix into Bigram for Feature Extraction. PRICAI 2019: TRENDS IN ARTIFICIAL INTELLIGENCE 2019. [DOI: 10.1007/978-3-030-29894-4_39] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
|
9
|
Bao W, Yang B, Li Z, Zhou Y. LAIPT: Lysine Acetylation Site Identification with Polynomial Tree. Int J Mol Sci 2018; 20:ijms20010113. [PMID: 30597947 PMCID: PMC6337602 DOI: 10.3390/ijms20010113] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2018] [Revised: 11/30/2018] [Accepted: 12/05/2018] [Indexed: 11/16/2022] Open
Abstract
Post-translational modification plays a key role in the field of biology. Experimental identification methods are time-consuming and expensive. Therefore, computational methods to deal with such issues overcome these shortcomings and limitations. In this article, we propose a lysine acetylation site identification with polynomial tree method (LAIPT), making use of the polynomial style to demonstrate amino-acid residue relationships in peptide segments. This polynomial style was enriched by the physical and chemical properties of amino-acid residues. Then, these reconstructed features were input into the employed classification model, named the flexible neural tree. Finally, some effect evaluation measurements were employed to test the model’s performance.
Collapse
Affiliation(s)
- Wenzheng Bao
- School of Information and Electrical Engineering, Xuzhou University of Technology, Xuzhou 221018, China.
| | - Bin Yang
- School of Information Science and Engineering, Zaozhuang University, Zaozhuang 277100, China.
| | - Zhengwei Li
- School of Computer Science and Technology, China University of Mining and Technology, Xuzhou 221116, China.
| | - Yong Zhou
- School of Computer Science and Technology, China University of Mining and Technology, Xuzhou 221116, China.
| |
Collapse
|
10
|
Cao M, Chen G, Yu J, Shi S. Computational prediction and analysis of species-specific fungi phosphorylation via feature optimization strategy. Brief Bioinform 2018; 21:595-608. [DOI: 10.1093/bib/bby122] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2018] [Revised: 11/16/2018] [Accepted: 11/22/2018] [Indexed: 11/12/2022] Open
Abstract
Abstract
Protein phosphorylation is a reversible and ubiquitous post-translational modification that primarily occurs at serine, threonine and tyrosine residues and regulates a variety of biological processes. In this paper, we first briefly summarized the current progresses in computational prediction of eukaryotic protein phosphorylation sites, which mainly focused on animals and plants, especially on human, with a less extent on fungi. Since the number of identified fungi phosphorylation sites has greatly increased in a wide variety of organisms and their roles in pathological physiology still remain largely unknown, more attention has been paid on the identification of fungi-specific phosphorylation. Here, experimental fungi phosphorylation sites data were collected and most of the sites were classified into different types to be encoded with various features and trained via a two-step feature optimization method. A novel method for prediction of species-specific fungi phosphorylation-PreSSFP was developed, which can identify fungi phosphorylation in seven species for specific serine, threonine and tyrosine residues (http://computbiol.ncu.edu.cn/PreSSFP). Meanwhile, we critically evaluated the performance of PreSSFP and compared it with other existing tools. The satisfying results showed that PreSSFP is a robust predictor. Feature analyses exhibited that there have some significant differences among seven species. The species-specific prediction via two-step feature optimization method to mine important features for training could considerably improve the prediction performance. We anticipate that our study provides a new lead for future computational analysis of fungi phosphorylation.
Collapse
Affiliation(s)
- Man Cao
- Department of Mathematics and Numerical Simulation and High-Performance Computing Laboratory, School of Sciences, Nanchang University, Nanchang, China
| | - Guodong Chen
- Department of Mathematics and Numerical Simulation and High-Performance Computing Laboratory, School of Sciences, Nanchang University, Nanchang, China
| | - Jialin Yu
- Department of Mathematics and Numerical Simulation and High-Performance Computing Laboratory, School of Sciences, Nanchang University, Nanchang, China
| | - Shaoping Shi
- Department of Mathematics and Numerical Simulation and High-Performance Computing Laboratory, School of Sciences, Nanchang University, Nanchang, China
| |
Collapse
|
11
|
Yang B, Chen Y, Zhang W, Lv J, Bao W, Huang DS. HSCVFNT: Inference of Time-Delayed Gene Regulatory Network Based on Complex-Valued Flexible Neural Tree Model. Int J Mol Sci 2018; 19:E3178. [PMID: 30326663 PMCID: PMC6214043 DOI: 10.3390/ijms19103178] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2018] [Revised: 10/08/2018] [Accepted: 10/10/2018] [Indexed: 11/17/2022] Open
Abstract
Gene regulatory network (GRN) inference can understand the growth and development of animals and plants, and reveal the mystery of biology. Many computational approaches have been proposed to infer GRN. However, these inference approaches have hardly met the need of modeling, and the reducing redundancy methods based on individual information theory method have bad universality and stability. To overcome the limitations and shortcomings, this thesis proposes a novel algorithm, named HSCVFNT, to infer gene regulatory network with time-delayed regulations by utilizing a hybrid scoring method and complex-valued flexible neural network (CVFNT). The regulations of each target gene can be obtained by iteratively performing HSCVFNT. For each target gene, the HSCVFNT algorithm utilizes a novel scoring method based on time-delayed mutual information (TDMI), time-delayed maximum information coefficient (TDMIC) and time-delayed correlation coefficient (TDCC), to reduce the redundancy of regulatory relationships and obtain the candidate regulatory factor set. Then, the TDCC method is utilized to create time-delayed gene expression time-series matrix. Finally, a complex-valued flexible neural tree model is proposed to infer the time-delayed regulations of each target gene with the time-delayed time-series matrix. Three real time-series expression datasets from (Save Our Soul) SOS DNA repair system in E. coli and Saccharomyces cerevisiae are utilized to evaluate the performance of the HSCVFNT algorithm. As a result, HSCVFNT obtains outstanding F-scores of 0.923, 0.8 and 0.625 for SOS network and (In vivo Reverse-Engineering and Modeling Assessment) IRMA network inference, respectively, which are 5.5%, 14.3% and 72.2% higher than the best performance of other state-of-the-art GRN inference methods and time-delayed methods.
Collapse
Affiliation(s)
- Bin Yang
- School of Information Science and Engineering, Zaozhuang University, Zaozhuang 277100, China.
| | - Yuehui Chen
- School of Information Science and Engineering, University of Jinan, Jinan 250002, China.
| | - Wei Zhang
- School of Information Science and Engineering, Zaozhuang University, Zaozhuang 277100, China.
| | - Jiaguo Lv
- School of Information Science and Engineering, Zaozhuang University, Zaozhuang 277100, China.
| | - Wenzheng Bao
- School of Computer Science, China University of Mining and Technology, Xuzhou 221000, China.
| | - De-Shuang Huang
- Institute of Machine Learning and Systems Biology, Tongji University, Shanghai 200092, China.
| |
Collapse
|