1
|
Ju Z, Zhang QB. iBhb-Lys: Identify lysine β-hydroxybutyrylation sites using autoencoder feature representation and fuzzy SVM algorithm. Anal Biochem 2025; 697:115715. [PMID: 39521356 DOI: 10.1016/j.ab.2024.115715] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2024] [Revised: 11/05/2024] [Accepted: 11/06/2024] [Indexed: 11/16/2024]
Abstract
Lysine β-hydroxybutyrylation (Kbhb) is newly discovered β-hydroxybutyrylate-derived histone modification which has been associated with the pathogenesis of many human diseases. To further elucidate the biological significance and molecular mechanism of Kbhb, it is necessary to accurately identify the Kbhb sites from protein sequences. In this study, a novel computational model named iBhb-Lys is developed for the identification of Kbhb sites. Four types of features are combined to encode each Kbhb site as a 3266-dimensional feature vector. And the autoencoder network is used to reduce the dimensionality of feature space, due to the high dimensionality of the combined features. In addition, to effectively reduce the influence of noise and outlier on classification, a new fuzzy support vector machine algorithm is proposed by incorporating the density around the sample into the fuzzy membership function. As illustrated by independent test, the AUC value of iBhb-Lys has increased by 2.22 % compared to the existing predictor KbhbXG. Feature analysis shows that some amino acid composition features, such as the occurrence frequency of leucine and histidine residues around Kbhb sites, contribute profoundly to the identification of Kbhb sites. The conclusions drawn in this study may provide useful reference for studying the molecular mechanism of Kbhb.
Collapse
Affiliation(s)
- Zhe Ju
- College of Science, Shenyang Aerospace University, 110136, People's Republic of China.
| | - Qing-Bao Zhang
- College of Science, Shenyang Aerospace University, 110136, People's Republic of China
| |
Collapse
|
2
|
Zuo Y, Zhang B, Dong Y, He W, Bi Y, Liu X, Zeng X, Deng Z. Glypred: Lysine Glycation Site Prediction via CCU-LightGBM-BiLSTM Framework with Multi-Head Attention Mechanism. J Chem Inf Model 2024; 64:6699-6711. [PMID: 39121059 DOI: 10.1021/acs.jcim.4c01034] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/11/2024]
Abstract
Glycation, a type of posttranslational modification, preferentially occurs on lysine and arginine residues, impairing protein functionality and altering characteristics. This process is linked to diseases such as Alzheimer's, diabetes, and atherosclerosis. Traditional wet lab experiments are time-consuming, whereas machine learning has significantly streamlined the prediction of protein glycation sites. Despite promising results, challenges remain, including data imbalance, feature redundancy, and suboptimal classifier performance. This research introduces Glypred, a lysine glycation site prediction model combining ClusterCentroids Undersampling (CCU), LightGBM, and bidirectional long short-term memory network (BiLSTM) methodologies, with an additional multihead attention mechanism integrated into the BiLSTM. To achieve this, the study undertakes several key steps: selecting diverse feature types to capture comprehensive protein information, employing a cluster-based undersampling strategy to balance the data set, using LightGBM for feature selection to enhance model performance, and implementing a bidirectional LSTM network for accurate classification. Together, these approaches ensure that Glypred effectively identifies glycation sites with high accuracy and robustness. For feature encoding, five distinct feature types─AAC, KMER, DR, PWAA, and EBGW─were selected to capture a broad spectrum of protein sequence and biological information. These encoded features were integrated and validated to ensure comprehensive protein information acquisition. To address the issue of highly imbalanced positive and negative samples, various undersampling algorithms, including random undersampling, NearMiss, edited nearest neighbor rule, and CCU, were evaluated. CCU was ultimately chosen to remove redundant nonglycated training data, establishing a balanced data set that enhances the model's accuracy and robustness. For feature selection, the LightGBM ensemble learning algorithm was employed to reduce feature dimensionality by identifying the most significant features. This approach accelerates model training, enhances generalization capabilities, and ensures good transferability of the model. Finally, a bidirectional long short-term memory network was used as the classifier, with a network structure designed to capture glycation modification site features from both forward and backward directions. To prevent overfitting, appropriate regularization parameters and dropout rates were introduced, achieving efficient classification. Experimental results show that Glypred achieved optimal performance. This model provides new insights for bioinformatics and encourages the application of similar strategies in other fields. A lysine glycation site prediction software tool was also developed using the PyQt5 library, offering researchers an auxiliary screening tool to reduce workload and improve efficiency. The software and data sets are available on GitHub: https://github.com/ZBYnb/Glypred.
Collapse
Affiliation(s)
- Yun Zuo
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China
| | - Bangyi Zhang
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China
| | - Yinkang Dong
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China
| | - Wenying He
- School of Artificial Intelligence, Hebei University of Technology, Tianjin 300130, China
| | - Yue Bi
- Department of Biochemistry and Molecular Biology and Biomedicine Discovery Institute, Monash University, Clayton 3800, Australia
| | - Xiangrong Liu
- Department of Computer Science and Technology, National Institute for Data Science in Health and Medicine, Xiamen Key Laboratory of Intelligent Storage and Computing, Xiamen University, Xiamen 361005, China
| | - Xiangxiang Zeng
- School of Information Science and Engineering, Hunan University, Changsha 410012, China
| | - Zhaohong Deng
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China
| |
Collapse
|
3
|
Zhao Y, Jin J, Gao W, Qiao J, Wei L. Moss-m7G: A Motif-Based Interpretable Deep Learning Method for RNA N7-Methlguanosine Site Prediction. J Chem Inf Model 2024; 64:6230-6240. [PMID: 39011571 DOI: 10.1021/acs.jcim.4c00802] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/17/2024]
Abstract
N-7methylguanosine (m7G) modification plays a crucial role in various biological processes and is closely associated with the development and progression of many cancers. Accurate identification of m7G modification sites is essential for understanding their regulatory mechanisms and advancing cancer therapy. Previous studies often suffered from insufficient research data, underutilization of motif information, and lack of interpretability. In this work, we designed a novel motif-based interpretable method for m7G modification site prediction, called Moss-m7G. This approach enables the analysis of RNA sequences from a motif-centric perspective. Our proposed word-detection module and motif-embedding module within Moss-m7G extract motif information from sequences, transforming the raw sequences from base-level into motif-level and generating embeddings for these motif sequences. Compared with base sequences, motif sequences contain richer contextual information, which is further analyzed and integrated through the Transformer model. We constructed a comprehensive m7G data set to implement the training and testing process to address the data insufficiency noted in prior research. Our experimental results affirm the effectiveness and superiority of Moss-m7G in predicting m7G modification sites. Moreover, the introduction of the word-detection module enhances the interpretability of the model, providing insights into the predictive mechanisms.
Collapse
Affiliation(s)
- Yanxi Zhao
- School of Software, Shandong University, Jinan 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China
| | - Junru Jin
- School of Software, Shandong University, Jinan 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China
| | - Wenjia Gao
- School of Software, Shandong University, Jinan 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China
| | - Jianbo Qiao
- School of Software, Shandong University, Jinan 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China
| | - Leyi Wei
- School of Software, Shandong University, Jinan 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China
- School of Informatics, Xiamen University, Xiamen 361104, China
| |
Collapse
|
4
|
Zhang H, Jiao J, Zhao T, Zhao E, Li L, Li G, Zhang B, Qin QM. GERWR: Identifying the Key Pathogenicity- Associated sRNAs of Magnaporthe Oryzae Infection in Rice Based on Graph Embedding and Random Walk With Restart. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:227-239. [PMID: 38153818 DOI: 10.1109/tcbb.2023.3348080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/30/2023]
Abstract
Rice blast, caused by Magnaporthe oryzae(M.oryzae), is a destructive rice disease that reduces rice yield by 10% to 30% annually. It also affects other cereal crops such as barley, wheat, rye, millet, sorghum, and maize. Small RNAs (sRNAs) play an essential regulatory role in fungus-plant interaction during the fungal invasion, but studies on pathogenic sRNAs during the fungal invasion of plants based on multi-omics data integration are rare. This paper proposes a novel approach called Graph Embedding combined with Random Walk with Restart (GERWR) to identify pathogenic sRNAs based on multi-omics data integration during M.oryzae invasion. By constructing a multi-omics network (MRMO), we identified 29 pathogenic sRNAs of rice blast fungus. Further analysis revealed that these sRNAs regulate rice genes in a many-to-many relationship, playing a significant regulatory role in the pathogenesis of rice blast disease. This paper explores the pathogenic factors of rice blast disease from the perspective of multi-omics data analysis, revealing the inherent connection between pathogenic factors of different omics. It has essential scientific significance for studying the pathogenic mechanism of rice blast fungus, the rice blast fungus-rice model system, and the pathogen-host interaction in related fields.
Collapse
|
5
|
Fu X, Yuan Y, Qiu H, Suo H, Song Y, Li A, Zhang Y, Xiao C, Li Y, Dou L, Zhang Z, Cui F. AGF-PPIS: A protein-protein interaction site predictor based on an attention mechanism and graph convolutional networks. Methods 2024; 222:142-151. [PMID: 38242383 DOI: 10.1016/j.ymeth.2024.01.006] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Revised: 01/04/2024] [Accepted: 01/13/2024] [Indexed: 01/21/2024] Open
Abstract
Protein-protein interactions play an important role in various biological processes. Interaction among proteins has a wide range of applications. Therefore, the correct identification of protein-protein interactions sites is crucial. In this paper, we propose a novel predictor for protein-protein interactions sites, AGF-PPIS, where we utilize a multi-head self-attention mechanism (introducing a graph structure), graph convolutional network, and feed-forward neural network. We use the Euclidean distance between each protein residue to generate the corresponding protein graph as the input of AGF-PPIS. On the independent test dataset Test_60, AGF-PPIS achieves superior performance over comparative methods in terms of seven different evaluation metrics (ACC, precision, recall, F1-score, MCC, AUROC, AUPRC), which fully demonstrates the validity and superiority of the proposed AGF-PPIS model. The source codes and the steps for usage of AGF-PPIS are available at https://github.com/fxh1001/AGF-PPIS.
Collapse
Affiliation(s)
- Xiuhao Fu
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Ye Yuan
- Beidahuang Industry Group General Hospital, Harbin 150001, China
| | - Haoye Qiu
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Haodong Suo
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Yingying Song
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Anqi Li
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Yupeng Zhang
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Cuilin Xiao
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Yazi Li
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Lijun Dou
- Genomic Medicine Institute, Lerner Research Institute, Cleveland, OH 44106, USA
| | - Zilong Zhang
- School of Computer Science and Technology, Hainan University, Haikou 570228, China.
| | - Feifei Cui
- School of Computer Science and Technology, Hainan University, Haikou 570228, China.
| |
Collapse
|
6
|
Zhang W, Liu B. iSnoDi-MDRF: Identifying snoRNA-Disease Associations Based on Multiple Biological Data by Ranking Framework. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:3013-3019. [PMID: 37030816 DOI: 10.1109/tcbb.2023.3258448] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Accumulating evidence indicates that the dysregulation of small nucleolar RNAs (snoRNAs) is relevant with diseases. Identifying snoRNA-disease associations by computational methods is desired for biologists, which can save considerable costs and time compared biological experiments. However, it still faces some challenges as followings: (i) Many snoRNAs are detected in recent years, but only a few snoRNAs have been proved to be associated with diseases; (ii) Computational predictors trained with only a few known snoRNA-disease associations fail to accurately identify the snoRNA-disease associations. In this study, we propose a ranking framework, called iSnoDi-MDRF, to identify potential snoRNA-disease associations based on multiple biological data, which has the following highlights: (i) iSnoDi-MDRF integrates ranking framework, which is not only able to identify potential associations between known snoRNAs and diseases, but also can identify diseases associated with new snoRNAs. (ii) Known gene-disease associations are employed to help train a mature model for predicting snoRNA-disease association. Experimental results illustrate that iSnoDi-MDRF is very suitable for identifying potential snoRNA-disease associations. The web server of iSnoDi-MDRF predictor is freely available at http://bliulab.net/iSnoDi-MDRF/.
Collapse
|
7
|
Wei MM, Yu CQ, Li LP, You ZH, Wang L. BCMCMI: A Fusion Model for Predicting circRNA-miRNA Interactions Combining Semantic and Meta-path. J Chem Inf Model 2023; 63:5384-5394. [PMID: 37535872 DOI: 10.1021/acs.jcim.3c00852] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/05/2023]
Abstract
More and more evidence suggests that circRNA plays a vital role in generating and treating diseases by interacting with miRNA. Therefore, accurate prediction of potential circRNA-miRNA interaction (CMI) has become urgent. However, traditional wet experiments are time-consuming and costly, and the results will be affected by objective factors. In this paper, we propose a computational model BCMCMI, which combines three features to predict CMI. Specifically, BCMCMI utilizes the bidirectional encoding capability of the BERT algorithm to extract sequence features from the semantic information of circRNA and miRNA. Then, a heterogeneous network is constructed based on cosine similarity and known CMI information. The Metapath2vec is employed to conduct random walks following meta-paths in the network to capture topological features, including similarity features. Finally, potential CMIs are predicted using the XGBoost classifier. BCMCMI achieves superior results compared to other state-of-the-art models on two benchmark datasets for CMI prediction. We also utilize t-SNE to visually observe the distribution of the extracted features on a randomly selected dataset. The remarkable prediction results show that BCMCMI can serve as a valuable complement to the wet experiment process.
Collapse
Affiliation(s)
- Meng-Meng Wei
- School of Information Engineering, Xijing University, Xi'an, Shaanxi 710123, China
| | - Chang-Qing Yu
- School of Information Engineering, Xijing University, Xi'an, Shaanxi 710123, China
| | - Li-Ping Li
- College of Agriculture and Forestry, Longdong University, Qingyang, Gansu 745000, China
| | - Zhu-Hong You
- School of Computer Science, Northwestern Polytechnical University, Xi'an, Shaanxi 710072, China
| | - Lei Wang
- Guangxi Key Lab of Human-Machine Interaction and Intelligent Decision, Guangxi Academy of Sciences, Nanning, Guangxi 530007, China
| |
Collapse
|
8
|
Zhu W, Yuan SS, Li J, Huang CB, Lin H, Liao B. A First Computational Frame for Recognizing Heparin-Binding Protein. Diagnostics (Basel) 2023; 13:2465. [PMID: 37510209 PMCID: PMC10377868 DOI: 10.3390/diagnostics13142465] [Citation(s) in RCA: 35] [Impact Index Per Article: 35.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2023] [Revised: 07/13/2023] [Accepted: 07/21/2023] [Indexed: 07/30/2023] Open
Abstract
Heparin-binding protein (HBP) is a cationic antibacterial protein derived from multinuclear neutrophils and an important biomarker of infectious diseases. The correct identification of HBP is of great significance to the study of infectious diseases. This work provides the first HBP recognition framework based on machine learning to accurately identify HBP. By using four sequence descriptors, HBP and non-HBP samples were represented by discrete numbers. By inputting these features into a support vector machine (SVM) and random forest (RF) algorithm and comparing the prediction performances of these methods on training data and independent test data, it is found that the SVM-based classifier has the greatest potential to identify HBP. The model could produce an auROC of 0.981 ± 0.028 on training data using 10-fold cross-validation and an overall accuracy of 95.0% on independent test data. As the first model for HBP recognition, it will provide some help for infectious diseases and stimulate further research in related fields.
Collapse
Affiliation(s)
- Wen Zhu
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou 571158, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou 571158, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou 571158, China
| | - Shi-Shi Yuan
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Jian Li
- School of Basic Medical Sciences, Chengdu University, Chengdu 610106, China
| | - Cheng-Bing Huang
- School of Computer Science and Technology, ABa Teachers University, Chengdu 623002, China
| | - Hao Lin
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Bo Liao
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou 571158, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou 571158, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou 571158, China
| |
Collapse
|
9
|
Luo X, Wang Y, Zou Q, Xu L. Recall DNA methylation levels at low coverage sites using a CNN model in WGBS. PLoS Comput Biol 2023; 19:e1011205. [PMID: 37315069 DOI: 10.1371/journal.pcbi.1011205] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2022] [Accepted: 05/22/2023] [Indexed: 06/16/2023] Open
Abstract
DNA methylation is an important regulator of gene transcription. WGBS is the gold-standard approach for base-pair resolution quantitative of DNA methylation. It requires high sequencing depth. Many CpG sites with insufficient coverage in the WGBS data, resulting in inaccurate DNA methylation levels of individual sites. Many state-of-arts computation methods were proposed to predict the missing value. However, many methods required either other omics datasets or other cross-sample data. And most of them only predicted the state of DNA methylation. In this study, we proposed the RcWGBS, which can impute the missing (or low coverage) values from the DNA methylation levels on the adjacent sides. Deep learning techniques were employed for the accurate prediction. The WGBS datasets of H1-hESC and GM12878 were down-sampled. The average difference between the DNA methylation level at 12× depth predicted by RcWGBS and that at >50× depth in the H1-hESC and GM2878 cells are less than 0.03 and 0.01, respectively. RcWGBS performed better than METHimpute even though the sequencing depth was as low as 12×. Our work would help to process methylation data of low sequencing depth. It is beneficial for researchers to save sequencing costs and improve data utilization through computational methods.
Collapse
Affiliation(s)
- Ximei Luo
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, Guangdong, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| | - Yansu Wang
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, Guangdong, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, Guangdong, China
| |
Collapse
|
10
|
Zhang L, Bai T, Wu H. sgRNA-2wPSM: Identify sgRNAs on-target activity by combining two-window-based position specific mismatch and synthetic minority oversampling technique. Comput Biol Med 2023; 155:106489. [PMID: 36841059 DOI: 10.1016/j.compbiomed.2022.106489] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2022] [Accepted: 12/27/2022] [Indexed: 12/30/2022]
Abstract
MOTIVATION sgRNAs on-target activity prediction is a critical step in the CRISPR-Cas9 system. Due to its importance to RNA function research and genome editing application, some computational methods were introduced, treating it as a binary classification task or a regression task. Among these methods, sgRNA-PSM is a state-of-the-art method. In this work, we improved this method by proposing a new feature extraction method called two-window-based PSM, which divides the DNA sequences into two non-overlapping segments so as to extract different patterns in the two different segments. The two-window-based PSM were fed into Support Vector Machines (SVMs), and a new method called sgRNA-2wPSM was proposed. Furthermore, a new oversampling method called SCORE-SVM-SMOTE was proposed to solve the imbalanced training set problem based on the SVM-SMOTE algorithm. Results on the benchmark datasets indicated that sgRNA-2wPSM is superior to other methods.
Collapse
Affiliation(s)
- Lichao Zhang
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology, Shenzhen, China.
| | - Tao Bai
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China; School of Mathematics & Computer Science, Yanan University, Shanxi, 716000, China.
| | - Hao Wu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China.
| |
Collapse
|
11
|
Yan TC, Yue ZX, Xu HQ, Liu YH, Hong YF, Chen GX, Tao L, Xie T. A systematic review of state-of-the-art strategies for machine learning-based protein function prediction. Comput Biol Med 2023; 154:106446. [PMID: 36680931 DOI: 10.1016/j.compbiomed.2022.106446] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2022] [Revised: 12/07/2022] [Accepted: 12/19/2022] [Indexed: 12/24/2022]
Abstract
New drug discovery is inseparable from the discovery of drug targets, and the vast majority of the known targets are proteins. At the same time, proteins are essential structural and functional elements of living cells necessary for the maintenance of all forms of life. Therefore, protein functions have become the focus of many pharmacological and biological studies. Traditional experimental techniques are no longer adequate for rapidly growing annotation of protein sequences, and approaches to protein function prediction using computational methods have emerged and flourished. A significant trend has been to use machine learning to achieve this goal. In this review, approaches to protein function prediction based on the sequence, structure, protein-protein interaction (PPI) networks, and fusion of multi-information sources are discussed. The current status of research on protein function prediction using machine learning is considered, and existing challenges and prominent breakthroughs are discussed to provide ideas and methods for future studies.
Collapse
Affiliation(s)
- Tian-Ci Yan
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Zi-Xuan Yue
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Hong-Quan Xu
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Yu-Hong Liu
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Yan-Feng Hong
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Gong-Xing Chen
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Lin Tao
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China.
| | - Tian Xie
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China.
| |
Collapse
|
12
|
Zhu D, Yang W, Xu D, Li H, Zhao Y, Li D. A deep learning based two-layer predictor to identify enhancers and their strength. Methods 2023; 211:23-30. [PMID: 36740001 DOI: 10.1016/j.ymeth.2023.01.007] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2022] [Revised: 01/03/2023] [Accepted: 01/30/2023] [Indexed: 02/05/2023] Open
Abstract
The enhancer is a DNA sequence that can increase the activity of promoters and thus speed up the frequency of gene transcription. The enhancer plays an essential role in activating gene expression. Currently, gene sequencing technology has been developed for 30 years from the first generation to the third generation, and a variety of biological sequence data have increased significantly every year. Due to the importance of enhancer functions, it is very expensive to identify enhancers through biochemical experiments. Therefore, we need to study new methods for the identification and classification of enhancers. Based on the K-mer principle this study proposed a feature extraction method that others have not used in convolutional neural networks. Then, we combined it with one-hot encoding to build an efficient one-dimensional convolutional neural network ensemble model for predicting enhancers and their strengths. Finally, we used five commonly used classification problem evaluation indicators to compare with the models proposed by other researchers. The model proposed in this paper has a better performance by using the same independent test dataset as other models.
Collapse
Affiliation(s)
- Di Zhu
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Wen Yang
- International Medical Center, Shenzhen University General Hospital, Shenzhen, China
| | - Dali Xu
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Hongfei Li
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Yuming Zhao
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China.
| | - Dan Li
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China.
| |
Collapse
|
13
|
Wang H, Zhang Z, Li H, Li J, Li H, Liu M, Liang P, Xi Q, Xing Y, Yang L, Zuo Y. A cost-effective machine learning-based method for preeclampsia risk assessment and driver genes discovery. Cell Biosci 2023; 13:41. [PMID: 36849879 PMCID: PMC9972636 DOI: 10.1186/s13578-023-00991-y] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2022] [Accepted: 02/15/2023] [Indexed: 03/01/2023] Open
Abstract
BACKGROUND The placenta, as a unique exchange organ between mother and fetus, is essential for successful human pregnancy and fetal health. Preeclampsia (PE) caused by placental dysfunction contributes to both maternal and infant morbidity and mortality. Accurate identification of PE patients plays a vital role in the formulation of treatment plans. However, the traditional clinical methods of PE have a high misdiagnosis rate. RESULTS Here, we first designed a computational biology method that used single-cell transcriptome (scRNA-seq) of healthy pregnancy (38 wk) and early-onset PE (28-32 wk) to identify pathological cell subpopulations and predict PE risk. Based on machine learning methods and feature selection techniques, we observed that the Tuning ReliefF (TURF) score hybrid with XGBoost (TURF_XGB) achieved optimal performance, with 92.61% accuracy and 92.46% recall for classifying nine cell subpopulations of healthy placentas. Biological landscapes of placenta heterogeneity could be mapped by the 110 marker genes screened by TURF_XGB, which revealed the superiority of the TURF feature mining. Moreover, we processed the PE dataset with LASSO to obtain 497 biomarkers. Integration analysis of the above two gene sets revealed that dendritic cells were closely associated with early-onset PE, and C1QB and C1QC might drive preeclampsia by mediating inflammation. In addition, an ensemble model-based risk stratification card was developed to classify preeclampsia patients, and its area under the receiver operating characteristic curve (AUC) could reach 0.99. For broader accessibility, we designed an accessible online web server ( http://bioinfor.imu.edu.cn/placenta ). CONCLUSION Single-cell transcriptome-based preeclampsia risk assessment using an ensemble machine learning framework is a valuable asset for clinical decision-making. C1QB and C1QC may be involved in the development and progression of early-onset PE by affecting the complement and coagulation cascades pathway that mediate inflammation, which has important implications for better understanding the pathogenesis of PE.
Collapse
Affiliation(s)
- Hao Wang
- grid.411643.50000 0004 1761 0411The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, 010070 China ,Digital College, Inner Mongolia Intelligent Union Big Data Academy, Inner Mongolia Wesure Date Technology Co., Ltd., Hohhot, 010010 China
| | - Zhaoyue Zhang
- grid.54549.390000 0004 0369 4060School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, 610054 China
| | - Haicheng Li
- grid.411643.50000 0004 1761 0411The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, 010070 China ,Digital College, Inner Mongolia Intelligent Union Big Data Academy, Inner Mongolia Wesure Date Technology Co., Ltd., Hohhot, 010010 China
| | - Jinzhao Li
- grid.411643.50000 0004 1761 0411The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, 010070 China
| | - Hanshuang Li
- grid.411643.50000 0004 1761 0411The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, 010070 China
| | - Mingzhu Liu
- grid.411643.50000 0004 1761 0411The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, 010070 China ,Digital College, Inner Mongolia Intelligent Union Big Data Academy, Inner Mongolia Wesure Date Technology Co., Ltd., Hohhot, 010010 China
| | - Pengfei Liang
- grid.411643.50000 0004 1761 0411The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, 010070 China
| | - Qilemuge Xi
- grid.411643.50000 0004 1761 0411The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, 010070 China
| | - Yongqiang Xing
- School of Life Science and Technology, Inner Mongolia University of Science and Technology, Baotou, 014010, China.
| | - Lei Yang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150081, China.
| | - Yongchun Zuo
- The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, 010070, China. .,Digital College, Inner Mongolia Intelligent Union Big Data Academy, Inner Mongolia Wesure Date Technology Co., Ltd., Hohhot, 010010, China.
| |
Collapse
|
14
|
Shi H, Wu C, Bai T, Chen J, Li Y, Wu H. Identify essential genes based on clustering based synthetic minority oversampling technique. Comput Biol Med 2023; 153:106523. [PMID: 36652869 DOI: 10.1016/j.compbiomed.2022.106523] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2022] [Revised: 12/13/2022] [Accepted: 12/31/2022] [Indexed: 01/03/2023]
Abstract
Prediction of essential genes in a life organism is one of the central tasks in synthetic biology. Computational predictors are desired because experimental data is often unavailable. Recently, some sequence-based predictors have been constructed to identify essential genes. However, their predictive performance should be further improved. One key problem is how to effectively extract the sequence-based features, which are able to discriminate the essential genes. Another problem is the imbalanced training set. The amount of essential genes in human cell lines is lower than that of non-essential genes. Therefore, predictors trained with such imbalanced training set tend to identify an unseen sequence as a non-essential gene. Here, a new over-sampling strategy was proposed called Clustering based Synthetic Minority Oversampling Technique (CSMOTE) to overcome the imbalanced data issue. Combining CSMOTE with the Z curve, the global features, and Support Vector Machines, a new protocol called iEsGene-CSMOTE was proposed to identify essential genes. The rigorous jackknife cross validation results indicated that iEsGene-CSMOTE is better than the other competing methods. The proposed method outperformed λ-interval Z curve by 35.48% and 11.25% in terms of Sn and BACC, respectively.
Collapse
Affiliation(s)
- Hua Shi
- School of Opto-electronic and Communication Engineering, Xiamen University of Technology, Xiamen, China.
| | - Chenjin Wu
- School of Opto-electronic and Communication Engineering, Xiamen University of Technology, Xiamen, China.
| | - Tao Bai
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China; School of Mathematics & Computer Science, Yanan University, Shanxi, 716000, China.
| | - Jiahai Chen
- Xiamen Sankuai Online Technology Co., Ltd, Xiamen, China.
| | - Yan Li
- School of Opto-electronic and Communication Engineering, Xiamen University of Technology, Xiamen, China.
| | - Hao Wu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China.
| |
Collapse
|
15
|
Cheng N, Liu J, Chen C, Zheng T, Li C, Huang J. Prediction of lung cancer metastasis by gene expression. Comput Biol Med 2023; 153:106490. [PMID: 36638618 DOI: 10.1016/j.compbiomed.2022.106490] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2022] [Revised: 12/14/2022] [Accepted: 12/27/2022] [Indexed: 12/31/2022]
Abstract
Tumor metastasis is the main cause of death in cancer patients. Early prediction of tumor metastasis can allow for timely intervention. At present, research on tumor metastasis mainly focuses on manual diagnosis by imaging or diagnosis by computational methods. With the deterioration of the tumor, gene expression levels in blood change greatly. It is feasible to measure the transcripts of key genes to predict whether cancer will metastasize. Therefore, in this paper, we obtained gene expression data from 226 patients from TCGA. These data included 239,322 transcripts. Background screening and LASSO analysis were used to select 31 transcripts as features. Finally, a deep neural network (DNN) was used to determine whether or not lung cancer would metastasize. We compared our methods with several other methods and found that our method achieved the best precision. In addition, in a previous study, we identified 7 genes that play a vital role in lung cancer. We added those gene transcripts into the DNN and found that the AUC and AUPR of the model were increased.
Collapse
Affiliation(s)
- Nitao Cheng
- Department of Thoracic Surgery, Zhongnan Hospital of Wuhan University, Wuhan, China
| | - Junliang Liu
- Faculty of Computing, Harbin Institute of Technology, Harbin, China
| | - Chen Chen
- Department of Biological Repositories, Zhongnan Hospital of Wuhan University, China
| | - Tang Zheng
- Department of Thoracic Surgery, Zhongnan Hospital of Wuhan University, Wuhan, China
| | - Changsheng Li
- Department of Thoracic Surgery, Zhongnan Hospital of Wuhan University, Wuhan, China
| | - Jingyu Huang
- Department of Thoracic Surgery, Zhongnan Hospital of Wuhan University, Wuhan, China.
| |
Collapse
|
16
|
Zhang YF, Wang YH, Gu ZF, Pan XR, Li J, Ding H, Zhang Y, Deng KJ. Bitter-RF: A random forest machine model for recognizing bitter peptides. Front Med (Lausanne) 2023; 10:1052923. [PMID: 36778738 PMCID: PMC9909039 DOI: 10.3389/fmed.2023.1052923] [Citation(s) in RCA: 14] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2022] [Accepted: 01/05/2023] [Indexed: 01/27/2023] Open
Abstract
Introduction Bitter peptides are short peptides with potential medical applications. The huge potential behind its bitter taste remains to be tapped. To better explore the value of bitter peptides in practice, we need a more effective classification method for identifying bitter peptides. Methods In this study, we developed a Random forest (RF)-based model, called Bitter-RF, using sequence information of the bitter peptide. Bitter-RF covers more comprehensive and extensive information by integrating 10 features extracted from the bitter peptides and achieves better results than the latest generation model on independent validation set. Results The proposed model can improve the accurate classification of bitter peptides (AUROC = 0.98 on independent set test) and enrich the practical application of RF method in protein classification tasks which has not been used to build a prediction model for bitter peptides. Discussion We hope the Bitter-RF could provide more conveniences to scholars for bitter peptide research.
Collapse
Affiliation(s)
- Yu-Fei Zhang
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Yu-Hao Wang
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Zhi-Feng Gu
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Xian-Run Pan
- Innovative Institute of Chinese Medicine and Pharmacy, Academy for Interdiscipline, Chengdu University of Traditional Chinese Medicine, Chengdu, China
| | - Jian Li
- School of Basic Medical Sciences, Chengdu University, Chengdu, China
| | - Hui Ding
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Yang Zhang
- Innovative Institute of Chinese Medicine and Pharmacy, Academy for Interdiscipline, Chengdu University of Traditional Chinese Medicine, Chengdu, China
| | - Ke-Jun Deng
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
17
|
Liang Q, Zhang W, Wu H, Liu B. LncRNA-disease association identification using graph auto-encoder and learning to rank. Brief Bioinform 2023; 24:6955271. [PMID: 36545805 DOI: 10.1093/bib/bbac539] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2022] [Revised: 10/18/2022] [Accepted: 11/08/2022] [Indexed: 12/24/2022] Open
Abstract
Discovering the relationships between long non-coding RNAs (lncRNAs) and diseases is significant in the treatment, diagnosis and prevention of diseases. However, current identified lncRNA-disease associations are not enough because of the expensive and heavy workload of wet laboratory experiments. Therefore, it is greatly important to develop an efficient computational method for predicting potential lncRNA-disease associations. Previous methods showed that combining the prediction results of the lncRNA-disease associations predicted by different classification methods via Learning to Rank (LTR) algorithm can be effective for predicting potential lncRNA-disease associations. However, when the classification results are incorrect, the ranking results will inevitably be affected. We propose the GraLTR-LDA predictor based on biological knowledge graphs and ranking framework for predicting potential lncRNA-disease associations. Firstly, homogeneous graph and heterogeneous graph are constructed by integrating multi-source biological information. Then, GraLTR-LDA integrates graph auto-encoder and attention mechanism to extract embedded features from the constructed graphs. Finally, GraLTR-LDA incorporates the embedded features into the LTR via feature crossing statistical strategies to predict priority order of diseases associated with query lncRNAs. Experimental results demonstrate that GraLTR-LDA outperforms the other state-of-the-art predictors and can effectively detect potential lncRNA-disease associations. Availability and implementation: Datasets and source codes are available at http://bliulab.net/GraLTR-LDA.
Collapse
Affiliation(s)
- Qi Liang
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
| | - Wenxiang Zhang
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
| | - Hao Wu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China.,Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
18
|
Su W, Deng S, Gu Z, Yang K, Ding H, Chen H, Zhang Z. Prediction of apoptosis protein subcellular location based on amphiphilic pseudo amino acid composition. Front Genet 2023; 14:1157021. [PMID: 36926588 PMCID: PMC10011625 DOI: 10.3389/fgene.2023.1157021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2023] [Accepted: 02/20/2023] [Indexed: 03/08/2023] Open
Abstract
Introduction: Apoptosis proteins play an important role in the process of cell apoptosis, which makes the rate of cell proliferation and death reach a relative balance. The function of apoptosis protein is closely related to its subcellular location, it is of great significance to study the subcellular locations of apoptosis proteins. Many efforts in bioinformatics research have been aimed at predicting their subcellular location. However, the subcellular localization of apoptotic proteins needs to be carefully studied. Methods: In this paper, based on amphiphilic pseudo amino acid composition and support vector machine algorithm, a new method was proposed for the prediction of apoptosis proteins\x{2019} subcellular location. Results and Discussion: The method achieved good performance on three data sets. The Jackknife test accuracy of the three data sets reached 90.5%, 93.9% and 84.0%, respectively. Compared with previous methods, the prediction accuracies of APACC_SVM were improved.
Collapse
Affiliation(s)
- Wenxia Su
- College of Science, Inner Mongolia Agriculture University, Hohhot, China
| | - Shuyi Deng
- School of Life Science and Technology, Center for Information Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Zhifeng Gu
- School of Life Science and Technology, Center for Information Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Keli Yang
- Nonlinear Research Institute, Baoji University of Arts and Sciences, Baoji, China
| | - Hui Ding
- School of Life Science and Technology, Center for Information Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Hui Chen
- School of Healthcare Technology, Chengdu Neusoft University, Chengdu, China
| | - Zhaoyue Zhang
- School of Life Science and Technology, Center for Information Biology, University of Electronic Science and Technology of China, Chengdu, China.,School of Healthcare Technology, Chengdu Neusoft University, Chengdu, China
| |
Collapse
|
19
|
Ao C, Jiao S, Wang Y, Yu L, Zou Q. Biological Sequence Classification: A Review on Data and General Methods. RESEARCH (WASHINGTON, D.C.) 2022; 2022:0011. [PMID: 39285948 PMCID: PMC11404319 DOI: 10.34133/research.0011] [Citation(s) in RCA: 32] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Accepted: 10/25/2022] [Indexed: 09/19/2024]
Abstract
With the rapid development of biotechnology, the number of biological sequences has grown exponentially. The continuous expansion of biological sequence data promotes the application of machine learning in biological sequences to construct predictive models for mining biological sequence information. There are many branches of biological sequence classification research. In this review, we mainly focus on the function and modification classification of biological sequences based on machine learning. Sequence-based prediction and analysis are the basic tasks to understand the biological functions of DNA, RNA, proteins, and peptides. However, there are hundreds of classification models developed for biological sequences, and the quite varied specific methods seem dizzying at first glance. Here, we aim to establish a long-term support website (http://lab.malab.cn/~acy/BioseqData/home.html), which provides readers with detailed information on the classification method and download links to relevant datasets. We briefly introduce the steps to build an effective model framework for biological sequence data. In addition, a brief introduction to single-cell sequencing data analysis methods and applications in biology is also included. Finally, we discuss the current challenges and future perspectives of biological sequence classification research.
Collapse
Affiliation(s)
- Chunyan Ao
- School of Computer Science and Technology, Xidian University, Xi'an, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Shihu Jiao
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Yansu Wang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Liang Yu
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Quan Zou
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
20
|
Zhang T, Lin Y, He W, Yuan F, Zeng Y, Zhang S. GCN-GENE: A novel method for prediction of coronary heart disease-related genes. Comput Biol Med 2022; 150:105918. [PMID: 36215847 DOI: 10.1016/j.compbiomed.2022.105918] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2022] [Revised: 07/19/2022] [Accepted: 07/30/2022] [Indexed: 11/22/2022]
Abstract
Coronary heart disease is the most common heart disease, it can induce myocardial infarction, and the cause of the disease has a lot to do with life and eating habits. The results of a large number of epidemiological studies at home and abroad show that the incidence of coronary heart disease has an obvious familial tendency. However, little is known about the genetic factors of coronary heart disease. Although genome-wide association analysis and gene knockout experiments have found some genes related to coronary heart disease, there are still a large number of genes potentially related to coronary heart disease that have not been discovered. If it is confirmed by biological experimental means, the time and money cost is too high. Therefore, it is urgent to identify genes related to coronary heart disease on a large scale by computational means, so as to conduct targeted biological experimental verification. This paper proposes a deep learning method based on biological networks for the identification of coronary heart disease-related genes. We constructed gene interaction networks and extracted gene expression levels in different tissues as features. Through the association information and expression characteristics between genes, we constructed a model of coronary heart disease-related genes. Through cross-validation, we found that our proposed GCN-GENE that has AUC as 0.75 and AUPR as 0.78, which is more accurate than other methods and is a reliable method for predicting coronary heart disease-related genes.
Collapse
Affiliation(s)
- Tong Zhang
- Department of Cardiology, The Sixth Affiliated Hospital, School of Medicine, South China University of Technology, Guangdong, China.
| | - Yixuan Lin
- Department of Cardiology, The Sixth Affiliated Hospital, School of Medicine, South China University of Technology, Guangdong, China.
| | - Weimin He
- Department of Cardiology, The Sixth Affiliated Hospital, School of Medicine, South China University of Technology, Guangdong, China.
| | - FengXin Yuan
- Department of Cardiology, The Sixth Affiliated Hospital, School of Medicine, South China University of Technology, Guangdong, China.
| | - Yu Zeng
- Department of Cardiology, The Sixth Affiliated Hospital, School of Medicine, South China University of Technology, Guangdong, China.
| | - Shihua Zhang
- College of Life Science and Health, Wuhan University of Science and Technology, Wuhan, China.
| |
Collapse
|
21
|
Su Q, Wang F, Chen D, Chen G, Li C, Wei L. Deep convolutional neural networks with ensemble learning and transfer learning for automated detection of gastrointestinal diseases. Comput Biol Med 2022; 150:106054. [PMID: 36244302 DOI: 10.1016/j.compbiomed.2022.106054] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Revised: 08/12/2022] [Accepted: 08/27/2022] [Indexed: 11/22/2022]
Abstract
Gastrointestinal (GI) diseases are serious health threats to human health, and the related detection and treatment of gastrointestinal diseases place a huge burden on medical institutions. Imaging-based methods are one of the most important approaches for automated detection of gastrointestinal diseases. Although deep neural networks have shown impressive performance in a number of imaging tasks, its application to detection of gastrointestinal diseases has not been sufficiently explored. In this study, we propose a novel and practical method to detect gastrointestinal disease from wireless capsule endoscopy (WCE) images by convolutional neural networks. The proposed method utilizes three backbone networks modified and fine-tuned by transfer learning as the feature extractors, and an integrated classifier using ensemble learning is trained to detection of gastrointestinal diseases. The proposed method outperforms existing computational methods on the benchmark dataset. The case study results show that the proposed method captures discriminative information of wireless capsule endoscopy images. This work shows the potential of using deep learning-based computer vision models for effective GI disease screening.
Collapse
Affiliation(s)
- Qiaosen Su
- School of Software, Shandong University, Jinan, China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, China
| | - Fengsheng Wang
- School of Software, Shandong University, Jinan, China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, China
| | | | | | - Chao Li
- Beidahuang Industry Group General Hospital, Harbin, China.
| | - Leyi Wei
- School of Software, Shandong University, Jinan, China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, China.
| |
Collapse
|
22
|
Xiao J, Liu M, Huang Q, Sun Z, Ning L, Duan J, Zhu S, Huang J, Lin H, Yang H. Analysis and modeling of myopia-related factors based on questionnaire survey. Comput Biol Med 2022; 150:106162. [PMID: 36252365 DOI: 10.1016/j.compbiomed.2022.106162] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Revised: 09/12/2022] [Accepted: 10/01/2022] [Indexed: 11/03/2022]
Abstract
With the rapid development of science and technology, the trend of low age myopia is becoming increasingly significant. The latest national survey done by the Chinese government found that more than 80% of Chinese teenagers suffer from myopia. Adolescent myopia is closely related to living environment, heredity, and living habits. Quantifying the relationship between myopia and living environment, heredity, and living habits is conductive to the prevention and intervention of adolescent myopia. In this study, we investigated the relationships between four main factors (environment, habits, parental vision, and demographic) and myopia status by analyzing the questionnaire data. Data were collected from Chengdu, China in 2021, including 2808 myopia samples and 5693 non-myopia samples, with a total of 22 features. Then, these 22 features were inputted into three machine learning algorithms to discriminate the two classes of samples. Results show that the computational model could produce an AUC of 0.768. To pick out the most important features which play important roles in classification, we used incremental feature selection strategy to screen the 22 features. As a result, we found that the 4 most influential features with XGBoost could achieve a competitive AUC of 0.764. To further investigate the risk and protective factors affecting adolescent myopia, we used OR values derived from MLE-LR to analyze the relationship between 22 features and adolescent myopia. Results showed that the age variable was the most significant risk factor for myopia, followed by the myopia status of parents. The most protective factor for eyesight is the measure taken by the children, followed by the distance between books and eyes when reading. These discoveries can guide the prevention and control of myopia in children and adolescents.
Collapse
Affiliation(s)
- Jianqiang Xiao
- Eye School, Chengdu University of Traditional Chinese Medicine, Ineye Hospital of Chengdu University of TCM, China
| | - Mujiexin Liu
- Eye School, Chengdu University of Traditional Chinese Medicine, Ineye Hospital of Chengdu University of TCM, China
| | - Qinlai Huang
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, 610054, China
| | - Zijie Sun
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, 610054, China
| | - Lin Ning
- School of Healthcare Technology, Chengdu Neusoft University, Chengdu, 611844, China
| | - Junguo Duan
- Eye School, Chengdu University of Traditional Chinese Medicine, Ineye Hospital of Chengdu University of TCM, China
| | - Siquan Zhu
- Eye School, Chengdu University of Traditional Chinese Medicine, Ineye Hospital of Chengdu University of TCM, China
| | - Jian Huang
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, 610054, China
| | - Hao Lin
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, 610054, China.
| | - Hui Yang
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, 610054, China; School of Computer Science, Chengdu University of Information Technology, Chengdu, 611844, China.
| |
Collapse
|
23
|
Jia J, Wu G, Li M, Qiu W. pSuc-EDBAM: Predicting lysine succinylation sites in proteins based on ensemble dense blocks and an attention module. BMC Bioinformatics 2022; 23:450. [PMID: 36316638 PMCID: PMC9620660 DOI: 10.1186/s12859-022-05001-5] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2022] [Accepted: 10/25/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Lysine succinylation is a newly discovered protein post-translational modifications. Predicting succinylation sites helps investigate the metabolic disease treatments. However, the biological experimental approaches are costly and inefficient, it is necessary to develop efficient computational approaches. RESULTS In this paper, we proposed a novel predictor based on ensemble dense blocks and an attention module, called as pSuc-EDBAM, which adopted one hot encoding to derive the feature maps of protein sequences, and generated the low-level feature maps through 1-D CNN. Afterward, the ensemble dense blocks were used to capture feature information at different levels in the process of feature learning. We also introduced an attention module to evaluate the importance degrees of different features. The experimental results show that Acc reaches 74.25%, and MCC reaches 0.2927 on the testing dataset, which suggest that the pSuc-EDBAM outperforms the existing predictors. CONCLUSIONS The experimental results of ten-fold cross-validation on the training dataset and independent test on the testing dataset showed that pSuc-EDBAM outperforms the existing succinylation site predictors and can predict potential succinylation sites effectively. The pSuc-EDBAM is feasible and obtains the credible predictive results, which may also provide valuable references for other related research. To make the convenience of the experimental scientists, a user-friendly web server has been established ( http://bioinfo.wugenqiang.top/pSuc-EDBAM/ ), by which the desired results can be easily obtained.
Collapse
Affiliation(s)
- Jianhua Jia
- Computer Department, Jingdezhen Ceramic University, Jingdezhen, 333403 China
| | - Genqiang Wu
- Computer Department, Jingdezhen Ceramic University, Jingdezhen, 333403 China
| | - Meifang Li
- Computer Department, Nanchang Institute of Technology, Nanchang, 330044 China
| | - Wangren Qiu
- Computer Department, Jingdezhen Ceramic University, Jingdezhen, 333403 China
| |
Collapse
|
24
|
Zhao D, Wang L, Chen Z, Zhang L, Xu L. KRAS is a prognostic biomarker associated with diagnosis and treatment in multiple cancers. Front Genet 2022; 13:1024920. [PMID: 36330448 PMCID: PMC9624065 DOI: 10.3389/fgene.2022.1024920] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2022] [Accepted: 09/20/2022] [Indexed: 11/21/2022] Open
Abstract
KRAS encodes K-Ras proteins, which take part in the MAPK pathway. The expression level of KRAS is high in tumor patients. Our study compared KRAS expression levels between 33 kinds of tumor tissues. Additionally, we studied the association of KRAS expression levels with diagnostic and prognostic values, clinicopathological features, and tumor immunity. We established 22 immune-infiltrating cell expression datasets to calculate immune and stromal scores to evaluate the tumor microenvironment. KRAS genes, immune check-point genes and interacting genes were selected to construct the PPI network. We selected 79 immune checkpoint genes and interacting related genes to calculate the correlation. Based on the 33 tumor expression datasets, we conducted GSEA (genome set enrichment analysis) to show the KRAS and other co-expressed genes associated with cancers. KRAS may be a reliable prognostic biomarker in the diagnosis of cancer patients and has the potential to be included in cancer-targeted drugs.
Collapse
Affiliation(s)
- Da Zhao
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- School of food and drug, Shenzhen Polytechnic, Shenzhen, China
| | - Lizhuang Wang
- Beidahuang Industry Group General Hospital, Harbin, China
| | - Zheng Chen
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- School of food and drug, Shenzhen Polytechnic, Shenzhen, China
| | - Lijun Zhang
- School of food and drug, Shenzhen Polytechnic, Shenzhen, China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
- *Correspondence: Lei Xu,
| |
Collapse
|
25
|
Chen D, Li S, Chen Y. ISTRF: Identification of sucrose transporter using random forest. Front Genet 2022; 13:1012828. [PMID: 36171889 PMCID: PMC9511101 DOI: 10.3389/fgene.2022.1012828] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2022] [Accepted: 08/22/2022] [Indexed: 12/05/2022] Open
Abstract
Sucrose transporter (SUT) is a type of transmembrane protein that exists widely in plants and plays a significant role in the transportation of sucrose and the specific signal sensing process of sucrose. Therefore, identifying sucrose transporter is significant to the study of seed development and plant flowering and growth. In this study, a random forest-based model named ISTRF was proposed to identify sucrose transporter. First, a database containing 382 SUT proteins and 911 non-SUT proteins was constructed based on the UniProt and PFAM databases. Second, k-separated-bigrams-PSSM was exploited to represent protein sequence. Third, to overcome the influence of imbalance of samples on identification performance, the Borderline-SMOTE algorithm was used to overcome the shortcoming of imbalance training data. Finally, the random forest algorithm was used to train the identification model. It was proved by 10-fold cross-validation results that k-separated-bigrams-PSSM was the most distinguishable feature for identifying sucrose transporters. The Borderline-SMOTE algorithm can improve the performance of the identification model. Furthermore, random forest was superior to other classifiers on almost all indicators. Compared with other identification models, ISTRF has the best general performance and makes great improvements in identifying sucrose transporter proteins.
Collapse
Affiliation(s)
- Dong Chen
- College of Electrical and Information Engineering, Qu Zhou University, Quzhou, China
| | - Sai Li
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Yu Chen
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| |
Collapse
|
26
|
Duan T, Kuang Z, Deng L. SVMMDR: Prediction of miRNAs-drug resistance using support vector machines based on heterogeneous network. Front Oncol 2022; 12:987609. [PMID: 36338674 PMCID: PMC9632662 DOI: 10.3389/fonc.2022.987609] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2022] [Accepted: 09/14/2022] [Indexed: 11/21/2022] Open
Abstract
In recent years, the miRNA is considered as a potential high-value therapeutic target because of its complex and delicate mechanism of gene regulation. The abnormal expression of miRNA can cause drug resistance, affecting the therapeutic effect of the disease. Revealing the associations between miRNAs-drug resistance can help in the design of effective drugs or possible drug combinations. However, current conventional experiments for identification of miRNAs-drug resistance are time-consuming and high-cost. Therefore, it’s of pretty realistic value to develop an accurate and efficient computational method to predicting miRNAs-drug resistance. In this paper, a method based on the Support Vector Machines (SVM) to predict the association between MiRNA and Drug Resistance (SVMMDR) is proposed. The SVMMDR integrates miRNAs-drug resistance association, miRNAs sequence similarity, drug chemical structure similarity and other similarities, extracts path-based Hetesim features, and obtains inclined diffusion feature through restart random walk. By combining the multiple feature, the prediction score between miRNAs and drug resistance is obtained based on the SVM. The innovation of the SVMMDR is that the inclined diffusion feature is obtained by inclined restart random walk, the node information and path information in heterogeneous network are integrated, and the SVM is used to predict potential miRNAs-drug resistance associations. The average AUC of SVMMDR obtained is 0.978 in 10-fold cross-validation.
Collapse
|
27
|
Huang D, Chen K, Song B, Wei Z, Su J, Coenen F, de Magalhães JP, Rigden DJ, Meng J. Geographic encoding of transcripts enabled high-accuracy and isoform-aware deep learning of RNA methylation. Nucleic Acids Res 2022; 50:10290-10310. [PMID: 36155798 PMCID: PMC9561283 DOI: 10.1093/nar/gkac830] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2022] [Revised: 08/26/2022] [Accepted: 09/15/2022] [Indexed: 12/25/2022] Open
Abstract
As the most pervasive epigenetic mark present on mRNA and lncRNA, N6-methyladenosine (m6A) RNA methylation regulates all stages of RNA life in various biological processes and disease mechanisms. Computational methods for deciphering RNA modification have achieved great success in recent years; nevertheless, their potential remains underexploited. One reason for this is that existing models usually consider only the sequence of transcripts, ignoring the various regions (or geography) of transcripts such as 3′UTR and intron, where the epigenetic mark forms and functions. Here, we developed three simple yet powerful encoding schemes for transcripts to capture the submolecular geographic information of RNA, which is largely independent from sequences. We show that m6A prediction models based on geographic information alone can achieve comparable performances to classic sequence-based methods. Importantly, geographic information substantially enhances the accuracy of sequence-based models, enables isoform- and tissue-specific prediction of m6A sites, and improves m6A signal detection from direct RNA sequencing data. The geographic encoding schemes we developed have exhibited strong interpretability, and are applicable to not only m6A but also N1-methyladenosine (m1A), and can serve as a general and effective complement to the widely used sequence encoding schemes in deep learning applications concerning RNA transcripts.
Collapse
Affiliation(s)
- Daiyun Huang
- Department of Biological Sciences, Xi'an Jiaotong-Liverpool University, Suzhou 215123, PR China.,Department of Computer Sciences, University of Liverpool, Liverpool L69 7ZB, UK
| | - Kunqi Chen
- Key Laboratory of Gastrointestinal Cancer (Fujian Medical University), Ministry of Education, School of Basic Medical Sciences, Fujian Medical University, Fuzhou 350004, PR China
| | - Bowen Song
- Department of Mathematical Sciences, Xi'an Jiaotong-Liverpool University, Suzhou 215123, PR China.,Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool L69 7ZB, UK
| | - Zhen Wei
- Department of Biological Sciences, Xi'an Jiaotong-Liverpool University, Suzhou 215123, PR China.,Institute of Life Course and Medical Sciences, University of Liverpool, Liverpool L69 7ZB, UK
| | - Jionglong Su
- Department of Mathematical Sciences, Xi'an Jiaotong-Liverpool University, Suzhou 215123, PR China.,School of AI and Advanced Computing, Xi'an Jiaotong-Liverpool University, Suzhou 215123, PR China
| | - Frans Coenen
- Department of Computer Sciences, University of Liverpool, Liverpool L69 7ZB, UK
| | - João Pedro de Magalhães
- Institute of Life Course and Medical Sciences, University of Liverpool, Liverpool L69 7ZB, UK
| | - Daniel J Rigden
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool L69 7ZB, UK
| | - Jia Meng
- Department of Biological Sciences, Xi'an Jiaotong-Liverpool University, Suzhou 215123, PR China.,Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool L69 7ZB, UK.,AI University Research Centre, Xi'an Jiaotong-Liverpool University, Suzhou 215123, PR China
| |
Collapse
|
28
|
A Statistical Analysis of the Sequence and Structure of Thermophilic and Non-Thermophilic Proteins. Int J Mol Sci 2022; 23:ijms231710116. [PMID: 36077513 PMCID: PMC9456548 DOI: 10.3390/ijms231710116] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2022] [Revised: 08/29/2022] [Accepted: 08/31/2022] [Indexed: 11/17/2022] Open
Abstract
Thermophilic proteins have various practical applications in theoretical research and in industry. In recent years, the demand for thermophilic proteins on an industrial scale has been increasing; therefore, the engineering of thermophilic proteins has become a hot direction in the field of protein engineering. However, the exact mechanism of thermostability of proteins is not yet known, for engineering thermophilic proteins knowing the basis of thermostability is necessary. In order to understand the basis of the thermostability in proteins, we have made a statistical analysis of the sequences, secondary structures, hydrogen bonds, salt bridges, DHA (Donor-Hydrogen-Accepter) angles, and bond lengths of ten pairs of thermophilic proteins and their non-thermophilic orthologous. Our findings suggest that polar amino acids contribute to thermostability in proteins by forming hydrogen bonds and salt bridges which provide resistance against protein denaturation. Short bond length and a wider DHA angle provide greater bond stability in thermophilic proteins. Moreover, the increased frequency of aromatic amino acids in thermophilic proteins contributes to thermal stability by forming more aromatic interactions. Additionally, the coil, helix, and loop in the secondary structure also contribute to thermostability.
Collapse
|
29
|
Yuan SS, Gao D, Xie XQ, Ma CY, Su W, Zhang ZY, Zheng Y, Ding H. IBPred: A sequence-based predictor for identifying ion binding protein in phage. Comput Struct Biotechnol J 2022; 20:4942-4951. [PMID: 36147670 PMCID: PMC9474292 DOI: 10.1016/j.csbj.2022.08.053] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2022] [Revised: 08/23/2022] [Accepted: 08/24/2022] [Indexed: 11/16/2022] Open
Abstract
Ion binding proteins (IBPs) can selectively and non-covalently interact with ions. IBPs in phages also play an important role in biological processes. Therefore, accurate identification of IBPs is necessary for understanding their biological functions and molecular mechanisms that involve binding to ions. Since molecular biology experimental methods are still labor-intensive and cost-ineffective in identifying IBPs, it is helpful to develop computational methods to identify IBPs quickly and efficiently. In this work, a random forest (RF)-based model was constructed to quickly identify IBPs. Based on the protein sequence information and residues' physicochemical properties, the dipeptide composition combined with the physicochemical correlation between two residues were proposed for the extraction of features. A feature selection technique called analysis of variance (ANOVA) was used to exclude redundant information. By comparing with other classified methods, we demonstrated that our method could identify IBPs accurately. Based on the model, a Python package named IBPred was built with the source code which can be accessed at https://github.com/ShishiYuan/IBPred.
Collapse
Affiliation(s)
- Shi-Shi Yuan
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Dong Gao
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Xue-Qin Xie
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Cai-Yi Ma
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Wei Su
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Zhao-Yue Zhang
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
- School of Healthcare Technology, Chengdu Neusoft University, Chengdu 611844, China
| | - Yan Zheng
- Baotou Medical College, Baotou 014040, China
| | - Hui Ding
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| |
Collapse
|
30
|
Role of Posttranslational Modifications of Proteins in Cardiovascular Disease. OXIDATIVE MEDICINE AND CELLULAR LONGEVITY 2022; 2022:3137329. [PMID: 35855865 PMCID: PMC9288287 DOI: 10.1155/2022/3137329] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/22/2022] [Accepted: 06/23/2022] [Indexed: 01/03/2023]
Abstract
Cardiovascular disease (CVD) has become a leading cause of mortality and morbidity globally, making it an urgent concern. Although some studies have been performed on CVD, its molecular mechanism remains largely unknown for all types of CVD. However, recent in vivo and in vitro studies have successfully identified the important roles of posttranslational modifications (PTMs) in various diseases, including CVD. Protein modification, also known as PTMs, refers to the chemical modification of specific amino acid residues after protein biosynthesis, which is a key process that can influence the activity or expression level of proteins. Studies on PTMs have contributed directly to improving the therapeutic strategies for CVD. In this review, we examined recent progress on PTMs and highlighted their importance in both physiological and pathological conditions of the cardiovascular system. Overall, the findings of this review contribute to the understanding of PTMs and their potential roles in the treatment of CVD.
Collapse
|
31
|
Liu L, Song B, Chen K, Zhang Y, de Magalhães JP, Rigden DJ, Lei X, Wei Z. WHISTLE server: A high-accuracy genomic coordinate-based machine learning platform for RNA modification prediction. Methods 2022; 203:378-382. [PMID: 34245870 DOI: 10.1016/j.ymeth.2021.07.003] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2021] [Revised: 06/28/2021] [Accepted: 07/05/2021] [Indexed: 01/12/2023] Open
Abstract
The primary sequences of DNA, RNA and protein have been used as the dominant information source of existing machine learning tools, especially for contexts not fully explored by wet-experimental approaches. Since molecular markers are profoundly orchestrated in the living organisms, those markers that cannot be unambiguously recovered from the primary sequence often help to predict other biological events. To the best of our knowledge, there is no current tool to build and deploy machine learning models that consider genomic evidence. We therefore developed the WHISTLE server, the first machine learning platform based on genomic coordinates. It features convenient covariate extraction and model web deployment with 46 distinct genomic features integrated along with the conventional sequence features. We showed that, when predicting m6A sites from SRAMP project, the model integrating genomic features substantially outperformed those based on only sequence features. The WHISTLE server should be a useful tool for studying biological attributes specifically associated with genomic coordinates, and is freely accessible at: www.xjtlu.edu.cn/biologicalsciences/whi2.
Collapse
Affiliation(s)
- Lian Liu
- School of Computer Sciences, Shannxi Normal University, Xi'an, Shaanxi 710119, China
| | - Bowen Song
- Department of Mathematical Sciences, University of Liverpool, L69 7ZB Liverpool, United Kingdom; Institute of Ageing & Chronic Disease, University of Liverpool, L69 7ZB Liverpool, United Kingdom
| | - Kunqi Chen
- Department of Biological Sciences, Xi'an Jiaotong-Liverpool University, Suzhou 215123, China
| | - Yuxin Zhang
- Department of Biological Sciences, Xi'an Jiaotong-Liverpool University, Suzhou 215123, China
| | - João Pedro de Magalhães
- Institute of Ageing & Chronic Disease, University of Liverpool, L69 7ZB Liverpool, United Kingdom
| | - Daniel J Rigden
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, L69 7ZB Liverpool, United Kingdom
| | - Xiujuan Lei
- School of Computer Sciences, Shannxi Normal University, Xi'an, Shaanxi 710119, China.
| | - Zhen Wei
- Department of Biological Sciences, Xi'an Jiaotong-Liverpool University, Suzhou 215123, China; Institute of Systems, Molecular and Integrative Biology, University of Liverpool, L69 7ZB Liverpool, United Kingdom.
| |
Collapse
|
32
|
Xia Y, Jiang M, Luo Y, Feng G, Jia G, Zhang H, Wang P, Ge R. SuccSPred2.0: A Two-Step Model to Predict Succinylation Sites Based on Multifeature Fusion and Selection Algorithm. J Comput Biol 2022; 29:1085-1094. [PMID: 35714347 DOI: 10.1089/cmb.2022.0109] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Protein succinylation is a novel type of post-translational modification in recent decade years. It played an important role in biological structure and functions verified by experiments. However, it is time consuming and laborious for the wet experimental identification of succinylation sites. Traditional technology cannot adapt to the rapid growth of the biological sequence data sets. In this study, a new computational method named SuccSPred2.0 was proposed to identify succinylation sites in the protein sequences based on multifeature fusion and maximal information coefficient (MIC) method. SuccSPred2.0 was implemented based on a two-step strategy. At first, high-dimension features were reduced by linear discriminant analysis to prevent overfitting. Subsequently, MIC method was employed to select the important features binding classifiers to predict succinylation sites. From the compared experiments on 10-fold cross-validation and independent test data sets, SuccSPred2.0 obtained promising improvements. Comparative experiments showed that SuccSPred2.0 was superior to previous tools in identifying succinylation sites in the given proteins.
Collapse
Affiliation(s)
- Yixiao Xia
- School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China
| | - Minchao Jiang
- School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China
| | - Yizhang Luo
- School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China
| | - Guanwen Feng
- Xi'an Key Laboratory of Big Data and Intelligent Vision, School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Gangyong Jia
- School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China
| | - Hua Zhang
- School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China
| | - Pu Wang
- Computer School, Hubei University of Arts and Science, Xiangyang, China
| | - Ruiquan Ge
- School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China
| |
Collapse
|
33
|
Wang N, Yan K, Zhang J, Liu B. iDRNA-ITF: identifying DNA- and RNA-binding residues in proteins based on induction and transfer framework. Brief Bioinform 2022; 23:6609520. [PMID: 35709747 DOI: 10.1093/bib/bbac236] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2022] [Revised: 05/06/2022] [Accepted: 05/20/2022] [Indexed: 11/14/2022] Open
Abstract
Protein-DNA and protein-RNA interactions are involved in many biological activities. In the post-genome era, accurate identification of DNA- and RNA-binding residues in protein sequences is of great significance for studying protein functions and promoting new drug design and development. Therefore, some sequence-based computational methods have been proposed for identifying DNA- and RNA-binding residues. However, they failed to fully utilize the functional properties of residues, leading to limited prediction performance. In this paper, a sequence-based method iDRNA-ITF was proposed to incorporate the functional properties in residue representation by using an induction and transfer framework. The properties of nucleic acid-binding residues were induced by the nucleic acid-binding residue feature extraction network, and then transferred into the feature integration modules of the DNA-binding residue prediction network and the RNA-binding residue prediction network for the final prediction. Experimental results on four test sets demonstrate that iDRNA-ITF achieves the state-of-the-art performance, outperforming the other existing sequence-based methods. The webserver of iDRNA-ITF is freely available at http://bliulab.net/iDRNA-ITF.
Collapse
Affiliation(s)
- Ning Wang
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
| | - Ke Yan
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
| | - Jun Zhang
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China.,Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing 100081, China
| |
Collapse
|
34
|
Jia J, Wu G, Qiu W. pSuc-FFSEA: Predicting Lysine Succinylation Sites in Proteins Based on Feature Fusion and Stacking Ensemble Algorithm. Front Cell Dev Biol 2022; 10:894874. [PMID: 35686053 PMCID: PMC9170990 DOI: 10.3389/fcell.2022.894874] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2022] [Accepted: 04/25/2022] [Indexed: 11/13/2022] Open
Abstract
Being a new type of widespread protein post-translational modifications discovered in recent years, succinylation plays a key role in protein conformational regulation and cellular function regulation. Numerous studies have shown that succinylation modifications are closely associated with the development of many diseases. In order to gain insight into the mechanism of succinylation, it is vital to identify lysine succinylation sites. However, experimental identification of succinylation sites is time-consuming and laborious, and traditional identification tools are unable to meet the rapid growth of datasets. Therefore, to solve this problem, we developed a new predictor named pSuc-FFSEA, which can predict succinylation sites in protein sequences by feature fusion and stacking ensemble algorithm. Specifically, the sequence information and physicochemical properties were first extracted using EBGW, One-Hot, continuous bag-of-words, chaos game representation, and AAF_DWT. Following that, feature selection was performed, which applied LASSO to select the optimal subset of features for the classifier, and then, stacking ensemble classifier was designed using two-layer stacking ensemble, selecting three classifiers, SVM, broad learning system and LightGBM classifier, as the base classifiers of the first layer, using logistic regression classifier as the meta classifier of the second layer. In order to further improve the model prediction accuracy and reduce the computational effort, bayesian optimization algorithm and grid search algorithm were utilized to optimize the hyperparameters of the classifier. Finally, the results of rigorous 10-fold cross-validation indicated our predictor showed excellent robustness and performed better than the previous prediction tools, which achieved an average prediction accuracy of 0.7773 ± 0.0120. Besides, for the convenience of the most experimental scientists, a user-friendly and comprehensive web-server for pSuc-FFSEA has been established at https://bio.cangmang.xyz/pSuc-FFSEA, by which one can easily obtain the expected data and results without going through the complicated mathematics.
Collapse
Affiliation(s)
- Jianhua Jia
- Computer Department, Jingdezhen Ceramic University, Jingdezhen, China
| | - Genqiang Wu
- Computer Department, Jingdezhen Ceramic University, Jingdezhen, China
| | - Wangren Qiu
- Computer Department, Jingdezhen Ceramic University, Jingdezhen, China
| |
Collapse
|
35
|
Zhang Y, Bao W, Cao Y, Cong H, Chen B, Chen Y. A survey on protein–DNA-binding sites in computational biology. Brief Funct Genomics 2022; 21:357-375. [DOI: 10.1093/bfgp/elac009] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2022] [Revised: 04/07/2022] [Accepted: 04/22/2022] [Indexed: 01/08/2023] Open
Abstract
Abstract
Transcription factors are important cellular components of the process of gene expression control. Transcription factor binding sites are locations where transcription factors specifically recognize DNA sequences, targeting gene-specific regions and recruiting transcription factors or chromatin regulators to fine-tune spatiotemporal gene regulation. As the common proteins, transcription factors play a meaningful role in life-related activities. In the face of the increase in the protein sequence, it is urgent how to predict the structure and function of the protein effectively. At present, protein–DNA-binding site prediction methods are based on traditional machine learning algorithms and deep learning algorithms. In the early stage, we usually used the development method based on traditional machine learning algorithm to predict protein–DNA-binding sites. In recent years, methods based on deep learning to predict protein–DNA-binding sites from sequence data have achieved remarkable success. Various statistical and machine learning methods used to predict the function of DNA-binding proteins have been proposed and continuously improved. Existing deep learning methods for predicting protein–DNA-binding sites can be roughly divided into three categories: convolutional neural network (CNN), recursive neural network (RNN) and hybrid neural network based on CNN–RNN. The purpose of this review is to provide an overview of the computational and experimental methods applied in the field of protein–DNA-binding site prediction today. This paper introduces the methods of traditional machine learning and deep learning in protein–DNA-binding site prediction from the aspects of data processing characteristics of existing learning frameworks and differences between basic learning model frameworks. Our existing methods are relatively simple compared with natural language processing, computational vision, computer graphics and other fields. Therefore, the summary of existing protein–DNA-binding site prediction methods will help researchers better understand this field.
Collapse
|
36
|
Zhang H, Zou Q, Ju Y, Song C, Chen D. Distance-based support vector machine to predict DNA N6-methyladenine modification. Curr Bioinform 2022. [DOI: 10.2174/1574893617666220404145517] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
DNA N6-methyladenine plays an important role in the restriction-modification system to isolate invasion from adventive DNA. The shortcomings of the high time-consumption and high costs of experimental methods have been exposed, and some computational methods have emerged. The support vector machine theory has received extensive attention in the bioinformatics field due to its solid theoretical foundation and many good characteristics.
Objective:
General machine learning methods include an important step of extracting features. The research has omitted this step and replaced with easy-to-obtain sequence distances matrix to obtain better results
Method:
First sequence alignment technology was used to achieve the similarity matrix. Then a novel transformation turned the similarity matrix into a distance matrix. Next, the similarity-distance matrix is made positive semi-definite so that it can be used in the kernel matrix. Finally, the LIBSVM software was applied to solve the support vector machine.
Results:
The five-fold cross-validation of this model on rice and mouse data has achieved excellent accuracy rates of 92.04% and 96.51%, respectively. This shows that the DB-SVM method has obvious advantages compared with traditional machine learning methods. Meanwhile this model achieved 0.943,0.982 and 0.818 accuracy,0.944, 0.982, and 0.838 Matthews correlation coefficient and 0.942, 0.982 and 0.840 F1 scores for the rice, M. musculus and cross-species genome datasets, respectively.
Conclusion:
These outcomes show that this model outperforms the iIM-CNN and csDMA in the prediction of DNA 6mA modification, which are the lastest research on DNA 6mA.
Collapse
Affiliation(s)
- Haoyu Zhang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610051, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610051, China
| | - Ying Ju
- School of Informatics, Xiamen University, Xiamen 361005, China
| | - Chenggang Song
- Beidahuang Industry Group General Hospital, Harbin 150001, China
| | - Dong Chen
- College of Electrical and Information Engineering, Quzhou University, Quzhou 324000, China
| |
Collapse
|
37
|
Xia W, Zheng L, Fang J, Li F, Zhou Y, Zeng Z, Zhang B, Li Z, Li H, Zhu F. PFmulDL: a novel strategy enabling multi-class and multi-label protein function annotation by integrating diverse deep learning methods. Comput Biol Med 2022; 145:105465. [PMID: 35366467 DOI: 10.1016/j.compbiomed.2022.105465] [Citation(s) in RCA: 38] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Revised: 03/22/2022] [Accepted: 03/25/2022] [Indexed: 02/06/2023]
Abstract
Bioinformatic annotation of protein function is essential but extremely sophisticated, which asks for extensive efforts to develop effective prediction method. However, the existing methods tend to amplify the representativeness of the families with large number of proteins by misclassifying the proteins in the families with small number of proteins. That is to say, the ability of the existing methods to annotate proteins in the 'rare classes' remains limited. Herein, a new protein function annotation strategy, PFmulDL, integrating multiple deep learning methods, was thus constructed. First, the recurrent neural network was integrated, for the first time, with the convolutional neural network to facilitate the function annotation. Second, a transfer learning method was introduced to the model construction for further improving the prediction performances. Third, based on the latest data of Gene Ontology, the newly constructed model could annotate the largest number of protein families comparing with the existing methods. Finally, this newly constructed model was found capable of significantly elevating the prediction performance for the 'rare classes' without sacrificing that for the 'major classes'. All in all, due to the emerging requirements on improving the prediction performance for the proteins in 'rare classes', this new strategy would become an essential complement to the existing methods for protein function prediction. All the models and source codes are freely available and open to all users at: https://github.com/idrblab/PFmulDL.
Collapse
Affiliation(s)
- Weiqi Xia
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China
| | - Lingyan Zheng
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China; Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou, 330110, China
| | - Jiebin Fang
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China
| | - Fengcheng Li
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China
| | - Ying Zhou
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China
| | - Zhenyu Zeng
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou, 330110, China
| | - Bing Zhang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou, 330110, China
| | - Zhaorong Li
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou, 330110, China
| | - Honglin Li
- School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China.
| | - Feng Zhu
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China; Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou, 330110, China.
| |
Collapse
|
38
|
Jiao S, Chen Z, Zhang L, Zhou X, Shi L. ATGPred-FL: sequence-based prediction of autophagy proteins with feature representation learning. Amino Acids 2022; 54:799-809. [PMID: 35286461 DOI: 10.1007/s00726-022-03145-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2021] [Accepted: 01/28/2022] [Indexed: 11/26/2022]
Abstract
Autophagy plays an important role in biological evolution and is regulated by many autophagy proteins. Accurate identification of autophagy proteins is crucially important to reveal their biological functions. Due to the expense and labor cost of experimental methods, it is urgent to develop automated, accurate and reliable sequence-based computational tools to enable the identification of novel autophagy proteins among numerous proteins and peptides. For this purpose, a new predictor named ATGPred-FL was proposed for the efficient identification of autophagy proteins. We investigated various sequence-based feature descriptors and adopted the feature learning method to generate corresponding, more informative probability features. Then, a two-step feature selection strategy based on accuracy was utilized to remove irrelevant and redundant features, leading to the most discriminative 14-dimensional feature set. The final predictor was built using a support vector machine classifier, which performed favorably on both the training and testing sets with accuracy values of 94.40% and 90.50%, respectively. ATGPred-FL is the first ATG machine learning predictor based on protein primary sequences. We envision that ATGPred-FL will be an effective and useful tool for autophagy protein identification, and it is available for free at http://lab.malab.cn/~acy/ATGPred-FL , the source code and datasets are accessible at https://github.com/jiaoshihu/ATGPred .
Collapse
Affiliation(s)
- Shihu Jiao
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Zheng Chen
- School of Applied Chemistry and Biological Technology, Shenzhen Polytechnic, 7098 Liuxian Street, Shenzhen, 518055, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, No.4 Block 2 North Jianshe Road, Chengdu, 61005, China
| | - Lichao Zhang
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology, Shenzhen, 518172, China
| | - Xun Zhou
- Beidahuang Industry Group General Hospital, Harbin, 150001, China.
| | - Lei Shi
- Department of Spine Surgery, Changzheng Hospital, Naval Medical University, No 415, Fengyang Road, Huangpu District, Shanghai, 210000, China.
| |
Collapse
|
39
|
Han YM, Yang H, Huang QL, Sun ZJ, Li ML, Zhang JB, Deng KJ, Chen S, Lin H. Risk prediction of diabetes and pre-diabetes based on physical examination data. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2022; 19:3597-3608. [PMID: 35341266 DOI: 10.3934/mbe.2022166] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Diabetes is a metabolic disorder caused by insufficient insulin secretion and insulin secretion disorders. From health to diabetes, there are generally three stages: health, pre-diabetes and type 2 diabetes. Early diagnosis of diabetes is the most effective way to prevent and control diabetes and its complications. In this work, we collected the physical examination data from Beijing Physical Examination Center from January 2006 to December 2017, and divided the population into three groups according to the WHO (1999) Diabetes Diagnostic Standards: normal fasting plasma glucose (NFG) (FPG < 6.1 mmol/L), mildly impaired fasting plasma glucose (IFG) (6.1 mmol/L ≤ FPG < 7.0 mmol/L) and type 2 diabetes (T2DM) (FPG > 7.0 mmol/L). Finally, we obtained1,221,598 NFG samples, 285,965 IFG samples and 387,076 T2DM samples, with a total of 15 physical examination indexes. Furthermore, taking eXtreme Gradient Boosting (XGBoost), random forest (RF), Logistic Regression (LR), and Fully connected neural network (FCN) as classifiers, four models were constructed to distinguish NFG, IFG and T2DM. The comparison results show that XGBoost has the best performance, with AUC (macro) of 0.7874 and AUC (micro) of 0.8633. In addition, based on the XGBoost classifier, three binary classification models were also established to discriminate NFG from IFG, NFG from T2DM, IFG from T2DM. On the independent dataset, the AUCs were 0.7808, 0.8687, 0.7067, respectively. Finally, we analyzed the importance of the features and identified the risk factors associated with diabetes.
Collapse
Affiliation(s)
- Yu-Mei Han
- Beijing Physical Examination Center, Beijing, China
| | - Hui Yang
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Qin-Lai Huang
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Zi-Jie Sun
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | | | | | - Ke-Jun Deng
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Shuo Chen
- Beijing Physical Examination Center, Beijing, China
| | - Hao Lin
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| |
Collapse
|
40
|
Cui F, Zhang Z, Cao C, Zou Q, Chen D, Su X. Protein-DNA/RNA interactions: Machine intelligence tools and approaches in the era of artificial intelligence and big data. Proteomics 2022; 22:e2100197. [PMID: 35112474 DOI: 10.1002/pmic.202100197] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2021] [Revised: 01/02/2022] [Accepted: 01/17/2022] [Indexed: 11/09/2022]
Abstract
With the development of artificial intelligence technologies and the availability of large amounts of biological data, computational methods for proteomics have undergone a developmental process from traditional machine learning to deep learning. This review focuses on computational approaches and tools for the prediction of protein-DNA/RNA interactions using machine intelligence techniques. We provide an overview of the development progress of computational methods and summarize the advantages and shortcomings of these methods. We further compiled applications in tasks related to the protein-DNA/RNA interactions, and pointed out possible future application trends. Moreover, biological sequence-digitizing representation strategies used in different types of computational methods are also summarized and discussed. This article is protected by copyright. All rights reserved.
Collapse
Affiliation(s)
- Feifei Cui
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, 610054, China.,Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, 324000, China
| | - Zilong Zhang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, 610054, China.,Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, 324000, China
| | - Chen Cao
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, 324000, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, 610054, China.,Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, 324000, China
| | - Dong Chen
- College of Electrical and Information Engineering, Quzhou University, Quzhou, 324000, China
| | - Xi Su
- Foshan Maternal and Child Health Hospital, Foshan, Guangdong, China
| |
Collapse
|
41
|
Zhang S, Jiang H, Gao B, Yang W, Wang G. Identification of Diagnostic Markers for Breast Cancer Based on Differential Gene Expression and Pathway Network. Front Cell Dev Biol 2022; 9:811585. [PMID: 35096840 PMCID: PMC8790293 DOI: 10.3389/fcell.2021.811585] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2021] [Accepted: 12/13/2021] [Indexed: 11/13/2022] Open
Abstract
Background: Breast cancer is the second largest cancer in the world, the incidence of breast cancer continues to rise worldwide, and women's health is seriously threatened. Therefore, it is very important to explore the characteristic changes of breast cancer from the gene level, including the screening of differentially expressed genes and the identification of diagnostic markers. Methods: The gene expression profiles of breast cancer were obtained from the TCGA database. The edgeR R software package was used to screen the differentially expressed genes between breast cancer patients and normal samples. The function and pathway enrichment analysis of these genes revealed significant enrichment of functions and pathways. Next, download these pathways from KEGG website, extract the gene interaction relations, construct the KEGG pathway gene interaction network. The potential diagnostic markers of breast cancer were obtained by combining the differentially expressed genes with the key genes in the network. Finally, these markers were used to construct the diagnostic prediction model of breast cancer, and the predictive ability of the model and the diagnostic ability of the markers were verified by internal and external data. Results: 1060 differentially expressed genes were identified between breast cancer patients and normal controls. Enrichment analysis revealed 28 significantly enriched pathways (p < 0.05). They were downloaded from KEGG website, and the gene interaction relations were extracted to construct the gene interaction network of KEGG pathway, which contained 1277 nodes and 7345 edges. The key nodes with a degree greater than 30 were extracted from the network, containing 154 genes. These 154 key genes shared 23 genes with differentially expressed genes, which serve as potential diagnostic markers for breast cancer. The 23 genes were used as features to construct the SVM classification model, and the model had good predictive ability in both the training dataset and the validation dataset (AUC = 0.960 and 0.907, respectively). Conclusion: This study showed that the difference of gene expression level is important for the diagnosis of breast cancer, and identified 23 breast cancer diagnostic markers, which provides valuable information for clinical diagnosis and basic treatment experiments.
Collapse
Affiliation(s)
- Shumei Zhang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Haoran Jiang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Bo Gao
- Department of Radiology, The Second Affiliated Hospital, Harbin Medical University, Harbin, China
| | - Wen Yang
- International Medical Center, Shenzhen University General Hospital, Shenzhen, China
| | - Guohua Wang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| |
Collapse
|
42
|
Ma D, Chen Z, He Z, Huang X. A SNARE Protein Identification Method Based on iLearnPlus to Efficiently Solve the Data Imbalance Problem. Front Genet 2022; 12:818841. [PMID: 35154261 PMCID: PMC8832978 DOI: 10.3389/fgene.2021.818841] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2021] [Accepted: 12/14/2021] [Indexed: 11/13/2022] Open
Abstract
Machine learning has been widely used to solve complex problems in engineering applications and scientific fields, and many machine learning-based methods have achieved good results in different fields. SNAREs are key elements of membrane fusion and required for the fusion process of stable intermediates. They are also associated with the formation of some psychiatric disorders. This study processes the original sequence data with the synthetic minority oversampling technique (SMOTE) to solve the problem of data imbalance and produces the most suitable machine learning model with the iLearnPlus platform for the identification of SNARE proteins. Ultimately, a sensitivity of 66.67%, specificity of 93.63%, accuracy of 91.33%, and MCC of 0.528 were obtained in the cross-validation dataset, and a sensitivity of 66.67%, specificity of 93.63%, accuracy of 91.33%, and MCC of 0.528 were obtained in the independent dataset (the adaptive skip dipeptide composition descriptor was used for feature extraction, and LightGBM with proper parameters was used as the classifier). These results demonstrate that this combination can perform well in the classification of SNARE proteins and is superior to other methods.
Collapse
|
43
|
Zulfiqar H, Huang QL, Lv H, Sun ZJ, Dao FY, Lin H. Deep-4mCGP: A Deep Learning Approach to Predict 4mC Sites in Geobacter pickeringii by Using Correlation-Based Feature Selection Technique. Int J Mol Sci 2022; 23:1251. [PMID: 35163174 PMCID: PMC8836036 DOI: 10.3390/ijms23031251] [Citation(s) in RCA: 24] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2021] [Revised: 01/19/2022] [Accepted: 01/20/2022] [Indexed: 12/15/2022] Open
Abstract
4mC is a type of DNA alteration that has the ability to synchronize multiple biological movements, for example, DNA replication, gene expressions, and transcriptional regulations. Accurate prediction of 4mC sites can provide exact information to their hereditary functions. The purpose of this study was to establish a robust deep learning model to recognize 4mC sites in Geobacter pickeringii. In the anticipated model, two kinds of feature descriptors, namely, binary and k-mer composition were used to encode the DNA sequences of Geobacter pickeringii. The obtained features from their fusion were optimized by using correlation and gradient-boosting decision tree (GBDT)-based algorithm with incremental feature selection (IFS) method. Then, these optimized features were inserted into 1D convolutional neural network (CNN) to classify 4mC sites from non-4mC sites in Geobacter pickeringii. The performance of the anticipated model on independent data exhibited an accuracy of 0.868, which was 4.2% higher than the existing model.
Collapse
Affiliation(s)
| | | | | | | | | | - Hao Lin
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China; (H.Z.); (Q.-L.H.); (H.L.); (Z.-J.S.); (F.-Y.D.)
| |
Collapse
|
44
|
Liu M, Chen H, Gao D, Ma CY, Zhang ZY. Identification of Helicobacter pylori Membrane Proteins Using Sequence-Based Features. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2022; 2022:7493834. [PMID: 35069791 PMCID: PMC8769816 DOI: 10.1155/2022/7493834] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/07/2021] [Accepted: 12/16/2021] [Indexed: 11/28/2022]
Abstract
Helicobacter pylori (H. pylori) is the most common risk factor for gastric cancer worldwide. The membrane proteins of the H. pylori are involved in bacterial adherence and play a vital role in the field of drug discovery. Thus, an accurate and cost-effective computational model is needed to predict the uncharacterized membrane proteins of H. pylori. In this study, a reliable benchmark dataset consisted of 114 membrane and 219 nonmembrane proteins was constructed based on UniProt. A support vector machine- (SVM-) based model was developed for discriminating H. pylori membrane proteins from nonmembrane proteins by using sequence information. Cross-validation showed that our method achieved good performance with an accuracy of 91.29%. It is anticipated that the proposed model will be useful for the annotation of H. pylori membrane proteins and the development of new anti-H. pylori agents.
Collapse
Affiliation(s)
- Mujiexin Liu
- Ineye Hospital of Chengdu University of TCM, Chengdu University of TCM, Chengdu 610084, China
| | - Hui Chen
- School of Healthcare Technology, Chengdu Neusoft University, 611844 Chengdu, China
| | - Dong Gao
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Cai-Yi Ma
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Zhao-Yue Zhang
- School of Healthcare Technology, Chengdu Neusoft University, 611844 Chengdu, China
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| |
Collapse
|
45
|
Lin C, Wang L, Shi L. AAPred-CNN: accurate predictor based on deep convolution neural network for identification of anti-angiogenic peptides. Methods 2022; 204:442-448. [PMID: 35031486 DOI: 10.1016/j.ymeth.2022.01.004] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2021] [Revised: 12/28/2021] [Accepted: 01/09/2022] [Indexed: 12/13/2022] Open
Abstract
Recently, deep learning techniques have been developed for various bioactive peptide prediction tasks. However, there are only conventional machine learning-based methods for the prediction of anti-angiogenic peptides (AAP), which play an important role in cancer treatment. The main reason why no deep learning method has been involved in this field is that there are too few experimentally validated AAPs to support the training of deep models but researchers have believed that deep learning seriously depends on the amounts of labeled data. In this paper, as a tentative work, we try to predict AAP by constructing different classical deep learning models and propose the first deep convolution neural network-based predictor (AAPred-CNN) for AAP. Contrary to intuition, the experimental results show that deep learning models can achieve superior or comparable performance to the state-of-the-art model, although they are given a few labeled sequences to train. We also decipher the influence of hyper-parameters and training samples on the performance of deep learning models to help understand how the model work. Furthermore, we also visualize the learned embeddings by dimension reduction to increase the model interpretability and reveal the residue propensity of AAP through the statistics of convolutional features for different residues. In summary, this work demonstrates the powerful representation ability of AAPred-CNNfor AAP prediction, further improving the prediction accuracy of AAP.
Collapse
Affiliation(s)
- Changhang Lin
- School of Big Data and Artificial Intelligence, Fujian Polytechnic Normal University, Fuzhou, China
| | - Lei Wang
- Beidahuang Industry Group General Hospital, Harbin, China.
| | - Lei Shi
- Department of Spine Surgery, Changzheng Hospital, Naval Medical University, Shanghai, China.
| |
Collapse
|
46
|
Cui F, Li S, Zhang Z, Sui M, Cao C, El-Latif Hesham A, Zou Q. DeepMC-iNABP: Deep learning for multiclass identification and classification of nucleic acid-binding proteins. Comput Struct Biotechnol J 2022; 20:2020-2028. [PMID: 35521556 PMCID: PMC9065708 DOI: 10.1016/j.csbj.2022.04.029] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Revised: 04/06/2022] [Accepted: 04/20/2022] [Indexed: 11/29/2022] Open
Abstract
Nucleic acid-binding proteins (NABPs), including DNA-binding proteins (DBPs) and RNA-binding proteins (RBPs), play vital roles in gene expression. Accurate identification of these proteins is crucial. However, there are two existing challenges: one is the problem of ignoring DNA- and RNA-binding proteins (DRBPs), and the other is a cross-predicting problem referring to DBP predictors predicting DBPs as RBPs, and vice versa. In this study, we proposed a computational predictor, called DeepMC-iNABP, with the goal of solving these difficulties by utilizing a multiclass classification strategy and deep learning approaches. DBPs, RBPs, DRBPs and non-NABPs as separate classes of data were used for training the DeepMC-iNABP model. The results on test data collected in this study and two independent test datasets showed that DeepMC-iNABP has a strong advantage in identifying the DRBPs and has the ability to alleviate the cross-prediction problem to a certain extent. The web-server of DeepMC-iNABP is freely available at http://www.deepmc-inabp.net/. The datasets used in this research can also be downloaded from the website.
Collapse
Affiliation(s)
- Feifei Cui
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China
| | - Shuang Li
- Beidahuang Industry Group General Hospital, Harbin 150001, China
| | - Zilong Zhang
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China
| | - Miaomiao Sui
- Graduate School Agricultural and Life Science, The University of Tokyo, Tokyo 1138657, Japan
| | - Chen Cao
- School of Biomedical Engineering and Informatics, Nanjing Medical University, Nanjing 211166, China
| | - Abd El-Latif Hesham
- Genetics Department, Faculty of Agriculture, Beni-Suef University, Beni-Suef 62511, Egypt
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China
- Corresponding author at: Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China.
| |
Collapse
|
47
|
Li H, Gong Y, Liu Y, Lin H, Wang G. Detection of transcription factors binding to methylated DNA by deep recurrent neural network. Brief Bioinform 2021; 23:6484512. [PMID: 34962264 DOI: 10.1093/bib/bbab533] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Revised: 10/23/2021] [Accepted: 11/19/2021] [Indexed: 12/13/2022] Open
Abstract
Transcription factors (TFs) are proteins specifically involved in gene expression regulation. It is generally accepted in epigenetics that methylated nucleotides could prevent the TFs from binding to DNA fragments. However, recent studies have confirmed that some TFs have capability to interact with methylated DNA fragments to further regulate gene expression. Although biochemical experiments could recognize TFs binding to methylated DNA sequences, these wet experimental methods are time-consuming and expensive. Machine learning methods provide a good choice for quickly identifying these TFs without experimental materials. Thus, this study aims to design a robust predictor to detect methylated DNA-bound TFs. We firstly proposed using tripeptide word vector feature to formulate protein samples. Subsequently, based on recurrent neural network with long short-term memory, a two-step computational model was designed. The first step predictor was utilized to discriminate transcription factors from non-transcription factors. Once proteins were predicted as TFs, the second step predictor was employed to judge whether the TFs can bind to methylated DNA. Through the independent dataset test, the accuracies of the first step and the second step are 86.63% and 73.59%, respectively. In addition, the statistical analysis of the distribution of tripeptides in training samples showed that the position and number of some tripeptides in the sequence could affect the binding of TFs to methylated DNA. Finally, on the basis of our model, a free web server was established based on the proposed model, which can be available at https://bioinfor.nefu.edu.cn/TFPM/.
Collapse
Affiliation(s)
- Hongfei Li
- College of Information and Computer Engineering at Northeast Forestry University of China
| | - Yue Gong
- College of Information and Computer Engineering at Northeast Forestry University of China
| | - Yifeng Liu
- School of management at Henan Institute of Technology of China
| | - Hao Lin
- Center for Informational Biology at University of Electronic Science and Technology of China
| | - Guohua Wang
- College of Information and Computer Engineering at Northeast Forestry University of China
| |
Collapse
|
48
|
Guo Y, Ju Y, Chen D, Wang L. Research on the Computational Prediction of Essential Genes. Front Cell Dev Biol 2021; 9:803608. [PMID: 34938741 PMCID: PMC8685449 DOI: 10.3389/fcell.2021.803608] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2021] [Accepted: 11/22/2021] [Indexed: 11/19/2022] Open
Abstract
Genes, the nucleotide sequences that encode a polypeptide chain or functional RNA, are the basic genetic unit controlling biological traits. They are the guarantee of the basic structures and functions in organisms, and they store information related to biological factors and processes such as blood type, gestation, growth, and apoptosis. The environment and genetics jointly affect important physiological processes such as reproduction, cell division, and protein synthesis. Genes are related to a wide range of phenomena including growth, decline, illness, aging, and death. During the evolution of organisms, there is a class of genes that exist in a conserved form in multiple species. These genes are often located on the dominant strand of DNA and tend to have higher expression levels. The protein encoded by it usually either performs very important functions or is responsible for maintaining and repairing these essential functions. Such genes are called persistent genes. Among them, the irreplaceable part of the body’s life activities is the essential gene. For example, when starch is the only source of energy, the genes related to starch digestion are essential genes. Without them, the organism will die because it cannot obtain enough energy to maintain basic functions. The function of the proteins encoded by these genes is thought to be fundamental to life. Nowadays, DNA can be extracted from blood, saliva, or tissue cells for genetic testing, and detailed genetic information can be obtained using the most advanced scientific instruments and technologies. The information gained from genetic testing is useful to assess the potential risks of disease, and to help determine the prognosis and development of diseases. Such information is also useful for developing personalized medication and providing targeted health guidance to improve the quality of life. Therefore, it is of great theoretical and practical significance to identify important and essential genes. In this paper, the research status of essential genes and the essential genome database of bacteria are reviewed, the computational prediction method of essential genes based on communication coding theory is expounded, and the significance and practical application value of essential genes are discussed.
Collapse
Affiliation(s)
- Yuxin Guo
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China.,Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China.,Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China.,School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Ying Ju
- School of Informatics, Xiamen University, Xiamen, China
| | - Dong Chen
- College of Electrical and Information Engineering, Quzhou University, Quzhou, China
| | - Lihong Wang
- Beidahuang Industry Group General Hospital, Harbin, China
| |
Collapse
|
49
|
Gu X, Guo L, Liao B, Jiang Q. Pseudo-188D: Phage Protein Prediction Based on a Model of Pseudo-188D. Front Genet 2021; 12:796327. [PMID: 34925468 PMCID: PMC8672092 DOI: 10.3389/fgene.2021.796327] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2021] [Accepted: 11/15/2021] [Indexed: 11/13/2022] Open
Abstract
Phages have seriously affected the biochemical systems of the world, and not only are phages related to our health, but medical treatments for many cancers and skin infections are related to phages; therefore, this paper sought to identify phage proteins. In this paper, a Pseudo-188D model was established. The digital features of the phage were extracted by PseudoKNC, an appropriate vector was selected by the AdaBoost tool, and features were extracted by 188D. Then, the extracted digital features were combined together, and finally, the viral proteins of the phage were predicted by a stochastic gradient descent algorithm. Our model effect reached 93.4853%. To verify the stability of our model, we randomly selected 80% of the downloaded data to train the model and used the remaining 20% of the data to verify the robustness of our model.
Collapse
Affiliation(s)
- Xiaomei Gu
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China.,Institute of Yangtze River Delta, University of Electronic Science and Technology of China, Haikou, China.,Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China.,School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Lina Guo
- Beidahuang Industry Group General Hospital, Harbin, China
| | - Bo Liao
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China.,Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China.,School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Qinghua Jiang
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China.,Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China.,School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| |
Collapse
|
50
|
Jia Y, Huang S, Zhang T. KK-DBP: A Multi-Feature Fusion Method for DNA-Binding Protein Identification Based on Random Forest. Front Genet 2021; 12:811158. [PMID: 34912382 PMCID: PMC8667860 DOI: 10.3389/fgene.2021.811158] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2021] [Accepted: 11/15/2021] [Indexed: 02/04/2023] Open
Abstract
DNA-binding protein (DBP) is a protein with a special DNA binding domain that is associated with many important molecular biological mechanisms. Rapid development of computational methods has made it possible to predict DBP on a large scale; however, existing methods do not fully integrate DBP-related features, resulting in rough prediction results. In this article, we develop a DNA-binding protein identification method called KK-DBP. To improve prediction accuracy, we propose a feature extraction method that fuses multiple PSSM features. The experimental results show a prediction accuracy on the independent test dataset PDB186 of 81.22%, which is the highest of all existing methods.
Collapse
Affiliation(s)
- Yuran Jia
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Shan Huang
- Department of Neurology, The Second Affiliated Hospital of Harbin Medical University, Harbin, China
| | - Tianjiao Zhang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| |
Collapse
|