1
|
Meng C, Pei Y, Bu Y, Liu Q, Li Q, Zou Q, Zhang Y. IIFS2.0: An Improved Incremental Feature Selection Method for Protein Sequence Processing Based on a Caching Strategy. J Mol Biol 2024:168741. [PMID: 39122168 DOI: 10.1016/j.jmb.2024.168741] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2024] [Revised: 07/08/2024] [Accepted: 08/05/2024] [Indexed: 08/12/2024]
Abstract
The purpose of feature selection in protein sequence recognition problems is to select the optimal feature set and use it as training input for classifiers and discover key sequence features of specific proteins. In the feature selection process, relevant features associated with the target task will be retained, and irrelevant and redundant features will be removed. Therefore, in an ideal state, a feature combination with smaller feature dimensions and higher performance indicators is desired. This paper proposes an algorithm called IIFS2.0 based on the cache elimination strategy, which takes the local optimal combination of cached feature subsets as a breakthrough point. It searches for a new feature combination method through the cache elimination strategy to avoid the drawbacks of human factors and excessive reliance on feature sorting results. We validated and analyzed its effectiveness on the protein dataset, demonstrating that IIFS2.0 significantly reduces the dimensionality of feature combinations while also improving various evaluation indicators. In addition, we provide IIFS2.0 on https://112.124.26.17:8006/ for researchers to use.
Collapse
Affiliation(s)
- Chaolu Meng
- College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, China; Inner Mongolia Autonomous Region Key Laboratory of Big Data Research and Application of Agriculture and Animal Husbandry, China
| | - Yue Pei
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Yongbo Bu
- College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, China; Inner Mongolia Autonomous Region Key Laboratory of Big Data Research and Application of Agriculture and Animal Husbandry, China
| | - Qing Liu
- Department of Pain, The Affiliated Traditional Chinese Medicine Hospital of Southwest Medical University, Luzhou 646000, Sichuan, China
| | - Qun Li
- Department of Pain, The Affiliated Traditional Chinese Medicine Hospital of Southwest Medical University, Luzhou 646000, Sichuan, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, China; Department of Anesthesiology, The Affiliated Traditional Chinese Medicine Hospital of Southwest Medical University, Luzhou 646000, Sichuan, China.
| | - Ying Zhang
- Department of Pain, The Affiliated Traditional Chinese Medicine Hospital of Southwest Medical University, Luzhou 646000, Sichuan, China; Department of Anesthesiology, The Affiliated Traditional Chinese Medicine Hospital of Southwest Medical University, Luzhou 646000, Sichuan, China.
| |
Collapse
|
2
|
Rukh G, Akbar S, Rehman G, Alarfaj FK, Zou Q. StackedEnC-AOP: prediction of antioxidant proteins using transform evolutionary and sequential features based multi-scale vector with stacked ensemble learning. BMC Bioinformatics 2024; 25:256. [PMID: 39098908 PMCID: PMC11298090 DOI: 10.1186/s12859-024-05884-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2024] [Accepted: 07/29/2024] [Indexed: 08/06/2024] Open
Abstract
BACKGROUND Antioxidant proteins are involved in several biological processes and can protect DNA and cells from the damage of free radicals. These proteins regulate the body's oxidative stress and perform a significant role in many antioxidant-based drugs. The current invitro-based medications are costly, time-consuming, and unable to efficiently screen and identify the targeted motif of antioxidant proteins. METHODS In this model, we proposed an accurate prediction method to discriminate antioxidant proteins namely StackedEnC-AOP. The training sequences are formulation encoded via incorporating a discrete wavelet transform (DWT) into the evolutionary matrix to decompose the PSSM-based images via two levels of DWT to form a Pseudo position-specific scoring matrix (PsePSSM-DWT) based embedded vector. Additionally, the Evolutionary difference formula and composite physiochemical properties methods are also employed to collect the structural and sequential descriptors. Then the combined vector of sequential features, evolutionary descriptors, and physiochemical properties is produced to cover the flaws of individual encoding schemes. To reduce the computational cost of the combined features vector, the optimal features are chosen using Minimum redundancy and maximum relevance (mRMR). The optimal feature vector is trained using a stacking-based ensemble meta-model. RESULTS Our developed StackedEnC-AOP method reported a prediction accuracy of 98.40% and an AUC of 0.99 via training sequences. To evaluate model validation, the StackedEnC-AOP training model using an independent set achieved an accuracy of 96.92% and an AUC of 0.98. CONCLUSION Our proposed StackedEnC-AOP strategy performed significantly better than current computational models with a ~ 5% and ~ 3% improved accuracy via training and independent sets, respectively. The efficacy and consistency of our proposed StackedEnC-AOP make it a valuable tool for data scientists and can execute a key role in research academia and drug design.
Collapse
Affiliation(s)
- Gul Rukh
- Department of Zoology, Abdul Wali Khan University Mardan, Mardan, 23200, KP, Pakistan
| | - Shahid Akbar
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, 610054, People's Republic of China
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, 23200, KP, Pakistan
| | - Gauhar Rehman
- Department of Zoology, Abdul Wali Khan University Mardan, Mardan, 23200, KP, Pakistan
| | - Fawaz Khaled Alarfaj
- Department of Management Information Systems (MIS), School of Business, King Faisal University (KFU), 31982, Al-Ahsa, Saudi Arabia
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, 610054, People's Republic of China.
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, 324000, People's Republic of China.
| |
Collapse
|
3
|
Chen W, Zhang Y, Wu W, Yang H, Huang W. Machine learning-based predictive model for abdominal diseases using physical examination datasets. Comput Biol Med 2024; 173:108249. [PMID: 38531251 DOI: 10.1016/j.compbiomed.2024.108249] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2024] [Revised: 02/21/2024] [Accepted: 03/06/2024] [Indexed: 03/28/2024]
Abstract
Abdominal ultrasound is a key non-invasive imaging method for diagnosing liver, kidney, and gallbladder diseases, despite its clinical significance, not all individuals can undergo abdominal ultrasonography during routine health check-ups due to limitations in equipment, cost, and time. This study aims to use basic physical examination data to predict the risk of diseases of the liver, kidney, and gallbladder that can be diagnosed via abdominal ultrasound. Basic physical examination data contain gender, age, height, weight, BMI, pulse, systolic blood pressure (SBP), diastolic blood pressure (DBP), high-density lipoprotein (HDL), low-density lipoprotein (LDL), total cholesterol, triglycerides, fasting blood glucose (FBG), and uric acid-we established seven single-label predictive models and one multi-label predictive model. These models were specifically designed to predict a range of abdominal diseases. The single-label models, utilizing the XGBoost algorithm, targeted diseases such as fatty liver (with an Area Under the Curve (AUC) of 0.9344), liver deposits (AUC: 0.8221), liver cysts (AUC: 0.7928), gallbladder polyps (AUC: 0.7508), kidney stones (AUC: 0.7853), kidney cysts (AUC: 0.8241), and kidney crystals (AUC: 0.7536). Furthermore, a comprehensive multi-label model, capable of predicting multiple conditions simultaneously, was established by FCN and achieved an AUC of 0.6344. We conducted interpretability analysis on these models to enhance their understanding and applicability in clinical settings. The insights gained from this analysis are crucial for the development of targeted disease prevention strategies. This study represents a significant advancement in utilizing physical examination data to predict ultrasound results, offering a novel approach to early diagnosis and prevention of abdominal diseases.
Collapse
Affiliation(s)
- Wei Chen
- Zhejiang Academy of Traditional Chinese Medicine Culture, Zhejiang Chinese Medical University, Hangzhou, China; Four Provincial Marginal Traditional Chinese Medicine Hospitals (Quzhou Traditional Chinese Medicine Hospital) Affiliated to Zhejiang University of Traditional Chinese Medicine, Quzhou, China
| | - YuJie Zhang
- Zhejiang Academy of Traditional Chinese Medicine Culture, Zhejiang Chinese Medical University, Hangzhou, China
| | - Weili Wu
- Four Provincial Marginal Traditional Chinese Medicine Hospitals (Quzhou Traditional Chinese Medicine Hospital) Affiliated to Zhejiang University of Traditional Chinese Medicine, Quzhou, China
| | - Hui Yang
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China.
| | - Wenxiu Huang
- Zhejiang Academy of Traditional Chinese Medicine Culture, Zhejiang Chinese Medical University, Hangzhou, China.
| |
Collapse
|
4
|
Zhang ZY, Zhang Z, Ye X, Sakurai T, Lin H. A BERT-based model for the prediction of lncRNA subcellular localization in Homo sapiens. Int J Biol Macromol 2024; 265:130659. [PMID: 38462114 DOI: 10.1016/j.ijbiomac.2024.130659] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2024] [Revised: 02/19/2024] [Accepted: 03/04/2024] [Indexed: 03/12/2024]
Abstract
Understanding the subcellular localization of lncRNAs is crucial for comprehending their regulation activities. The conventional detection of lncRNA subcellular location usually uses in situ detection techniques, which are resource intensive. Some machine learning-based algorithms have been proposed for lncRNA subcellular location prediction in mammals. However, due to the low level of conservation of lncRNA sequence, the performance of cross-species models remains unsatisfactory. In this study, we curated a novel dataset containing subcellular location information of lncRNAs in Homo sapiens. Subsequently, based on the BERT pre-trained language algorithm, we developed a model for lncRNA subcellular location prediction. Our model achieved a micro-average area under the receiver operating characteristic (AUROC) of 0.791 on the training set and an AUROC of 0.700 on the testing nucleus set. Additionally, we conducted cross-species validation and motif discovery to further investigate underlying patterns. In summary, our study provides valuable guidance and computational analysis tools for exploring the mechanisms of lncRNA subcellular localization and the dynamic spatial changes of RNA in abnormal physiological states.
Collapse
Affiliation(s)
- Zhao-Yue Zhang
- Tsukuba Life Science Innovation Program, University of Tsukuba, Tsukuba 3058577, Japan
| | - Zheng Zhang
- Department of Computer Science and Software Engineering, Auburn University, Auburn, AL 36849, USA
| | - Xiucai Ye
- Department of Computer Science, University of Tsukuba, Tsukuba 3058577, Japan.
| | - Tetsuya Sakurai
- Department of Computer Science, University of Tsukuba, Tsukuba 3058577, Japan
| | - Hao Lin
- Center for Information Biology, University of Electronic Science and Technology of China, Chengdu 611731, China.
| |
Collapse
|
5
|
Zhang ZY, Sun ZJ, Gao D, Hao YD, Lin H, Liu F. Excavation of gene markers associated with pancreatic ductal adenocarcinoma based on interrelationships of gene expression. IET Syst Biol 2024. [PMID: 38530028 DOI: 10.1049/syb2.12090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2023] [Revised: 02/06/2024] [Accepted: 03/10/2024] [Indexed: 03/27/2024] Open
Abstract
Pancreatic ductal adenocarcinoma (PDAC) accounts for 95% of all pancreatic cancer cases, posing grave challenges to its diagnosis and treatment. Timely diagnosis is pivotal for improving patient survival, necessitating the discovery of precise biomarkers. An innovative approach was introduced to identify gene markers for precision PDAC detection. The core idea of our method is to discover gene pairs that display consistent opposite relative expression and differential co-expression patterns between PDAC and normal samples. Reversal gene pair analysis and differential partial correlation analysis were performed to determine reversal differential partial correlation (RDC) gene pairs. Using incremental feature selection, the authors refined the selected gene set and constructed a machine-learning model for PDAC recognition. As a result, the approach identified 10 RDC gene pairs. And the model could achieve a remarkable accuracy of 96.1% during cross-validation, surpassing gene expression-based models. The experiment on independent validation data confirmed the model's performance. Enrichment analysis revealed the involvement of these genes in essential biological processes and shed light on their potential roles in PDAC pathogenesis. Overall, the findings highlight the potential of these 10 RDC gene pairs as effective diagnostic markers for early PDAC detection, bringing hope for improving patient prognosis and survival.
Collapse
Affiliation(s)
- Zhao-Yue Zhang
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
- School of Healthcare Technology, Chengdu Neusoft University, Chengdu, China
| | - Zi-Jie Sun
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Dong Gao
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Yu-Duo Hao
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Hao Lin
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Fen Liu
- Department of Radiation Oncology, Peking University Cancer Hospital (Inner Mongolia Campus), Affiliated Cancer Hospital of Inner Mongolia Medical University, Inner Mongolia Cancer Hospital, Hohhot, China
| |
Collapse
|
6
|
Fu X, Yuan Y, Qiu H, Suo H, Song Y, Li A, Zhang Y, Xiao C, Li Y, Dou L, Zhang Z, Cui F. AGF-PPIS: A protein-protein interaction site predictor based on an attention mechanism and graph convolutional networks. Methods 2024; 222:142-151. [PMID: 38242383 DOI: 10.1016/j.ymeth.2024.01.006] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Revised: 01/04/2024] [Accepted: 01/13/2024] [Indexed: 01/21/2024] Open
Abstract
Protein-protein interactions play an important role in various biological processes. Interaction among proteins has a wide range of applications. Therefore, the correct identification of protein-protein interactions sites is crucial. In this paper, we propose a novel predictor for protein-protein interactions sites, AGF-PPIS, where we utilize a multi-head self-attention mechanism (introducing a graph structure), graph convolutional network, and feed-forward neural network. We use the Euclidean distance between each protein residue to generate the corresponding protein graph as the input of AGF-PPIS. On the independent test dataset Test_60, AGF-PPIS achieves superior performance over comparative methods in terms of seven different evaluation metrics (ACC, precision, recall, F1-score, MCC, AUROC, AUPRC), which fully demonstrates the validity and superiority of the proposed AGF-PPIS model. The source codes and the steps for usage of AGF-PPIS are available at https://github.com/fxh1001/AGF-PPIS.
Collapse
Affiliation(s)
- Xiuhao Fu
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Ye Yuan
- Beidahuang Industry Group General Hospital, Harbin 150001, China
| | - Haoye Qiu
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Haodong Suo
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Yingying Song
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Anqi Li
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Yupeng Zhang
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Cuilin Xiao
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Yazi Li
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Lijun Dou
- Genomic Medicine Institute, Lerner Research Institute, Cleveland, OH 44106, USA
| | - Zilong Zhang
- School of Computer Science and Technology, Hainan University, Haikou 570228, China.
| | - Feifei Cui
- School of Computer Science and Technology, Hainan University, Haikou 570228, China.
| |
Collapse
|
7
|
Ye Y, Li M, Pan Q, Fang X, Yang H, Dong B, Yang J, Zheng Y, Zhang R, Liao Z. Machine learning-based classification of deubiquitinase USP26 and its cell proliferation inhibition through stabilizing KLF6 in cervical cancer. Comput Biol Med 2024; 168:107745. [PMID: 38064851 DOI: 10.1016/j.compbiomed.2023.107745] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2023] [Revised: 10/31/2023] [Accepted: 11/20/2023] [Indexed: 01/10/2024]
Abstract
OBJECTIVE We aim to accurately distinguish ubiquitin-specific proteases (USPs) from other members within the deubiquitinating enzyme families based on protein sequences. Additionally, we seek to elucidate the specific regulatory mechanisms through which USP26 modulates Krüppel-like factor 6 (KLF6) and assess the subsequent effects of this regulation on both the proliferation and migration of cervical cancer cells. METHODS All the deubiquitinase (DUB) sequences were classified into USPs and non-USPs. Feature vectors, including 188D, n-gram, and 400D dimensions, were extracted from these sequences and subjected to binary classification via the Weka software. Next, thirty human USPs were also analyzed to identify conserved motifs and ascertained evolutionary relationships. Experimentally, more than 90 unique DUB-encoding plasmids were transfected into HeLa cell lines to assess alterations in KLF6 protein levels and to isolate a specific DUB involved in KLF6 regulation. Subsequent experiments utilized both wild-type (WT) USP26 overexpression and shRNA-mediated USP26 knockdown to examine changes in KLF6 protein levels. The half-life experiment was performed to assess the influence of USP26 on KLF6 protein stability. Immunoprecipitation was applied to confirm the USP26-KLF6 interaction, and ubiquitination assays to explore the role of USP26 in KLF6 deubiquitination. Additional cellular assays were conducted to evaluate the effects of USP26 on HeLa cell proliferation and migration. RESULTS 1. Among the extracted feature vectors of 188D, 400D, and n-gram, all 12 classifiers demonstrated excellent performance. The RandomForest classifier demonstrated superior performance in this assessment. Phylogenetic analysis of 30 human USPs revealed the presence of nine unique motifs, comprising zinc finger and ubiquitin-specific protease domains. 2. Through a systematic screening of the deubiquitinase library, USP26 was identified as the sole DUB associated with KLF6. 3. USP26 positively regulated the protein level of KLF6, as evidenced by the decrease in KLF6 protein expression upon shUSP26 knockdown in both 293T and Hela cell lines. Additionally, half-life experiments demonstrated that USP26 prolonged the stability of KLF6. 4. Immunoprecipitation experiments revealed a strong interaction between USP26 and KLF6. Notably, the functional interaction domain was mapped to amino acids 285-913 of USP26, as opposed to the 1-295 region. 5. WT USP26 was found to attenuate the ubiquitination levels of KLF6. However, the mutant USP26 abrogated its deubiquitination activity. 6. Functional biological assays demonstrated that overexpression of USP26 inhibited both proliferation and migration of HeLa cells. Conversely, knockdown of USP26 was shown to promote these oncogenic properties. CONCLUSIONS 1. At the protein sequence level, members of the USP family can be effectively differentiated from non-USP proteins. Furthermore, specific functional motifs have been identified within the sequences of human USPs. 2. The deubiquitinating enzyme USP26 has been shown to target KLF6 for deubiquitination, thereby modulating its stability. Importantly, USP26 plays a pivotal role in the modulation of proliferation and migration in cervical cancer cells.
Collapse
Affiliation(s)
- Ying Ye
- Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Fujian Medical University, Fuzhou, 350122, China
| | - Meng Li
- Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Fujian Medical University, Fuzhou, 350122, China
| | - Qilong Pan
- Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Fujian Medical University, Fuzhou, 350122, China
| | - Xin Fang
- Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Fujian Medical University, Fuzhou, 350122, China; Laboratory of Non-communicable Chronic Disease Control, Fujian Provincial Center for Disease Control and Prevention, Fuzhou, 350012, China
| | - Hong Yang
- Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Fujian Medical University, Fuzhou, 350122, China
| | - Bingying Dong
- Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Fujian Medical University, Fuzhou, 350122, China
| | - Jiaying Yang
- Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Fujian Medical University, Fuzhou, 350122, China
| | - Yuan Zheng
- Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Fujian Medical University, Fuzhou, 350122, China
| | - Renxiang Zhang
- Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Fujian Medical University, Fuzhou, 350122, China
| | - Zhijun Liao
- Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Fujian Medical University, Fuzhou, 350122, China.
| |
Collapse
|
8
|
Ma Y, Pei Y, Li C. Predictive Recognition of DNA-binding Proteins Based on Pre-trained Language Model BERT. J Bioinform Comput Biol 2023; 21:2350028. [PMID: 38248912 DOI: 10.1142/s0219720023500282] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2024]
Abstract
Identifying proteins is crucial for disease diagnosis and treatment. With the increase of known proteins, large-scale batch predictions are essential. However, traditional biological experiments being time-consuming and expensive are difficult to accomplish this task efficiently. Nevertheless, deep learning algorithms based on big data analysis have manifested potential in this aspect. In recent years, language representation models, especially BERT, have made significant advancements in natural language processing. In this paper, using three protein segmentation methods and three encoder numbers, nine BERT models with different sizes are constructed to predict whether known proteins are DNA-binding proteins or not. Furthermore, based on the concept of protein motifs, multi-scale convolutional networks are fused into the models to extract the local features of DNA-binding proteins. Finally, we find that the larger the number of encoders, the better the model predictions under the condition of considering each amino acid in the protein as a word. Our proposed algorithm achieves 81.88% sensitivity and 0.39 MCC value on the test set. Furthermore, it achieves 62.41% accuracy on the independent test set PDB2272. It is evident that our proposed method can be a tool to assist in the identification of DNA-binding proteins.
Collapse
Affiliation(s)
- Yue Ma
- School of Computer Science and Technology, Tiangong University, Tianjin, P. R. China
| | - Yongzhen Pei
- School of Mathematical Sciences, Tiangong University, Tianjin, P. R. China
| | - Changguo Li
- Department of Basic Science, Army Military Transportation University, Tianjin, P. R. China
| |
Collapse
|
9
|
Meng C, Pei Y, Bu Y, Zou Q, Ju Y. Machine learning-based antioxidant protein identification model: Progress and evaluation. J Cell Biochem 2023; 124:1825-1834. [PMID: 37877550 DOI: 10.1002/jcb.30491] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Revised: 09/30/2023] [Accepted: 10/06/2023] [Indexed: 10/26/2023]
Abstract
Efficient and accurate identification of antioxidant proteins is of great significance. In recent years, many models for identifying antioxidant proteins have been proposed, but the low sensitivity and high dimensionality of the models are common problems. The generalization ability of the model needs to be improved. Researchers have tried different feature extraction algorithms and feature selection algorithms to obtain the most effective feature combination and have chosen more appropriate classification algorithms and tools to improve model performance. In this article, we systematically reviewed the data set of the most frequently used antioxidant proteins and the method selection for each step of model establishment and discussed the characteristics of each method. We have conducted a detailed analysis of recent research and believe that the practical ability and efficiency of model application can be improved by reducing model dimensions. The key to improving the performance of antioxidant protein recognition models in the future may lie in feature selection, so this paper also focuses on the combination of feature extraction and selection steps in the analysis of the model building process.
Collapse
Affiliation(s)
- Chaolu Meng
- College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, China
- Inner Mongolia Autonomous Region Key Laboratory of Big Data Research and Application of Agriculture and Animal Husbandry, Hohhot, China
| | - Yue Pei
- Computer Network Information Center, Chinese Academy of Sciences, Beijing, China
| | - Yongbo Bu
- College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Ying Ju
- School of Informatics, Xiamen University, Xiamen, China
| |
Collapse
|
10
|
Zhang Y, Liu P, Tang LJ, Lin PM, Li R, Luo HR, Luo P. Basing on the machine learning model to analyse the coronary calcification score and the coronary flow reserve score to evaluate the degree of coronary artery stenosis. Comput Biol Med 2023; 163:107130. [PMID: 37329614 DOI: 10.1016/j.compbiomed.2023.107130] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2023] [Revised: 05/23/2023] [Accepted: 06/01/2023] [Indexed: 06/19/2023]
Abstract
AIM To obtain the coronary artery calcium score (CACS) for each branch in coronary artery computed tomography angiography (CCTA) examination combined with the flow fraction reserve (FFR) of each branch in the coronary artery detected by CT and apply a machine learning model (ML) to analyse and predict the severity of coronary artery stenosis. METHODS All patients who underwent coronary computed tomography angiography (CCTA) from January 2019 to April 2022 in the HOSPITAL (T.C.M) AFFILIATED TO SOUTHWEST MEDICAL UNIVERSITY) were retrospectively screened, and their sex, age, characteristics of lipid-containing lesions, coronary calcium score (CACS) and CT-FFR values were collected. Five machine learning models, random forest (RF), k-nearest neighbour algorithm (KNN), kernel logistic regression, support vector machine (SVM) and radial basis function neural network (RBFNN), were used as predictive models to evaluate the severity of coronary stenosis. RESULTS Among the five machine learning models, the SVM model achieved the best prediction performance, and the prediction accuracy of mild stenosis was up to 90%. Second, age and male sex were important influencing factors of increasing CACS and decreasing CT-FFR. Moreover, the critical CACS value of myocardial ischemia >200.70 was calculated. CONCLUSION Through computer machine learning model analysis, we prove the importance of CACS and FFR in predicting coronary stenosis, especially the prominent vector machine model, which promotes the application of artificial intelligence computer learning methods in the field of medical analysis.
Collapse
Affiliation(s)
- Ying Zhang
- State Key Laboratories for Quality Research in Chinese Medicines, Faculty of Pharmacy, Macau University of Science and Technology, Macau; Department of Anaesthesiology, HOSPITAL (T.C.M) AFFILIATED TO SOUTHWEST MEDICAL UNIVERSITY), Lu Zhou, (646000), Sichuan, China.
| | - Ping Liu
- Department of Anaesthesiology, HOSPITAL (T.C.M) AFFILIATED TO SOUTHWEST MEDICAL UNIVERSITY), Lu Zhou, (646000), Sichuan, China.
| | - Li-Jia Tang
- Department of Anaesthesiology, HOSPITAL (T.C.M) AFFILIATED TO SOUTHWEST MEDICAL UNIVERSITY), Lu Zhou, (646000), Sichuan, China.
| | - Pei-Min Lin
- Department of Anaesthesiology, HOSPITAL (T.C.M) AFFILIATED TO SOUTHWEST MEDICAL UNIVERSITY), Lu Zhou, (646000), Sichuan, China.
| | - Run Li
- Department of Anaesthesiology, HOSPITAL (T.C.M) AFFILIATED TO SOUTHWEST MEDICAL UNIVERSITY), Lu Zhou, (646000), Sichuan, China.
| | - Huai-Rong Luo
- State Key Laboratories for Quality Research in Chinese Medicines, Faculty of Pharmacy, Macau University of Science and Technology, Macau.
| | - Pei Luo
- State Key Laboratories for Quality Research in Chinese Medicines, Faculty of Pharmacy, Macau University of Science and Technology, Macau.
| |
Collapse
|
11
|
Meng C, Pei Y, Zou Q, Yuan L. DP-AOP: A novel SVM-based antioxidant proteins identifier. Int J Biol Macromol 2023; 247:125499. [PMID: 37414318 DOI: 10.1016/j.ijbiomac.2023.125499] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2023] [Revised: 06/01/2023] [Accepted: 06/19/2023] [Indexed: 07/08/2023]
Abstract
The identification of antioxidant proteins is a challenging yet meaningful task, as they can protect against the damage caused by some free radicals. In addition to time-consuming, laborious, and expensive experimental identification methods, efficient identification of antioxidant proteins through machine learning algorithms has become increasingly common. In recent years, researchers have proposed models for identifying antioxidant proteins; unfortunately, although the accuracy of models is already high, their sensitivity is too low, indicating the possibility of overfitting in the model. Therefore, we developed a new model called DP-AOP for the recognition of antioxidant proteins. We used the SMOTE algorithm to balance the dataset, selected Wei's proposed feature extraction algorithm to obtain 473 dimensional feature vectors, and based on the sorting function in MRMD, scored and ranked each feature to obtain a feature set with contribution values ranging from high to low. To effectively reduce the feature dimension, we combined the dynamic programming idea to make the local eight features the optimal subset. After obtaining the 36 dimensional feature vectors, we finally selected 17 features through experimental analysis. The SVM classification algorithm was used to implement the model through the libsvm tool. The model achieved satisfactory performance, with an accuracy rate of 91.076 %, SN of 96.4 %, SP of 85.8 %, MCC of 82.6 %, and F1 core of 91.5 %. Furthermore, we built a free web server to facilitate researchers' subsequent unfolding studies of antioxidant protein recognition. The website is http://112.124.26.17:8003/#/.
Collapse
Affiliation(s)
- Chaolu Meng
- College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, China; Inner Mongolia Autonomous Region Key Laboratory of Big Data Research and Application of Agriculture and Animal Husbandry, China.
| | - Yue Pei
- College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, China.
| | - Lei Yuan
- Department of Hepatobiliary Surgery, Quzhou People's Hospital, China.
| |
Collapse
|
12
|
Ju H, Bai J, Jiang J, Che Y, Chen X. Comparative evaluation and analysis of DNA N4-methylcytosine methylation sites using deep learning. Front Genet 2023; 14:1254827. [PMID: 37671040 PMCID: PMC10476523 DOI: 10.3389/fgene.2023.1254827] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2023] [Accepted: 07/31/2023] [Indexed: 09/07/2023] Open
Abstract
DNA N4-methylcytosine (4mC) is significantly involved in biological processes, such as DNA expression, repair, and replication. Therefore, accurate prediction methods are urgently needed. Deep learning methods have transformed applications that previously require sequencing expertise into engineering challenges that do not require expertise to solve. Here, we compare a variety of state-of-the-art deep learning models on six benchmark datasets to evaluate their performance in 4mC methylation site detection. We visualize the statistical analysis of the datasets and the performance of different deep-learning models. We conclude that deep learning can greatly expand the potential of methylation site prediction.
Collapse
Affiliation(s)
- Hong Ju
- Heilongjiang Agricultural Engineering Vocational College, Harbin, China
| | - Jie Bai
- Engineering Research Center of Integration and Application of Digital Learning Technology, Ministry of Education, Hangzhou, China
| | - Jing Jiang
- Beidahuang Industry Group General Hospital, Harbin, China
| | - Yusheng Che
- Heilongjiang Agricultural Engineering Vocational College, Harbin, China
| | - Xin Chen
- Department of Neurosurgical Laboratory, The First Affiliated Hospital of Harbin Medical University, Harbin, China
| |
Collapse
|
13
|
Su W, Qian X, Yang K, Ding H, Huang C, Zhang Z. Recognition of outer membrane proteins using multiple feature fusion. Front Genet 2023; 14:1211020. [PMID: 37351347 PMCID: PMC10284346 DOI: 10.3389/fgene.2023.1211020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2023] [Accepted: 05/24/2023] [Indexed: 06/24/2023] Open
Abstract
Introduction: Outer membrane proteins are crucial in maintaining the structural stability and permeability of the outer membrane. Outer membrane proteins exhibit several functions such as antigenicity and strong immunogenicity, which have potential applications in clinical diagnosis and disease prevention. However, wet experiments for studying OMPs are time and capital-intensive, thereby necessitating the use of computational methods for their identification. Methods: In this study, we developed a computational model to predict outer membrane proteins. The non-redundant dataset consists of a positive set of 208 outer membrane proteins and a negative set of 876 non-outer membrane proteins. In this study, we employed the pseudo amino acid composition method to extract feature vectors and subsequently utilized the support vector machine for prediction. Results and Discussion: In the Jackknife cross-validation, the overall accuracy and the area under receiver operating characteristic curve were observed to be 93.19% and 0.966, respectively. These results demonstrate that our model can produce accurate predictions, and could serve as a valuable guide for experimental research on outer membrane proteins.
Collapse
Affiliation(s)
- Wenxia Su
- College of Science, Inner Mongolia Agriculture University, Hohhot, China
| | - Xiaojun Qian
- School of Life Science and Technology, Center for Information Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Keli Yang
- Nonlinear Research Institute, Baoji University of Arts and Sciences, Baoji, China
| | - Hui Ding
- School of Life Science and Technology, Center for Information Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Chengbing Huang
- School of Computer Science and Technology, Aba Teachers University, Aba, China
| | - Zhaoyue Zhang
- School of Life Science and Technology, Center for Information Biology, University of Electronic Science and Technology of China, Chengdu, China
- School of Healthcare Technology, Chengdu Neusoft University, Chengdu, China
| |
Collapse
|
14
|
Li Y, Ma D, Chen D, Chen Y. ACP-GBDT: An improved anticancer peptide identification method with gradient boosting decision tree. Front Genet 2023; 14:1165765. [PMID: 37065496 PMCID: PMC10090421 DOI: 10.3389/fgene.2023.1165765] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2023] [Accepted: 03/09/2023] [Indexed: 03/31/2023] Open
Abstract
Cancer is one of the most dangerous diseases in the world, killing millions of people every year. Drugs composed of anticancer peptides have been used to treat cancer with low side effects in recent years. Therefore, identifying anticancer peptides has become a focus of research. In this study, an improved anticancer peptide predictor named ACP-GBDT, based on gradient boosting decision tree (GBDT) and sequence information, is proposed. To encode the peptide sequences included in the anticancer peptide dataset, ACP-GBDT uses a merged-feature composed of AAIndex and SVMProt-188D. A GBDT is adopted to train the prediction model in ACP-GBDT. Independent testing and ten-fold cross-validation show that ACP-GBDT can effectively distinguish anticancer peptides from non-anticancer ones. The comparison results of the benchmark dataset show that ACP-GBDT is simpler and more effective than other existing anticancer peptide prediction methods.
Collapse
Affiliation(s)
- Yanjuan Li
- College of Electrical and Information Engineering, Quzhou University, Quzhou, China
| | - Di Ma
- College of Computer, Hangzhou Dianzi University, Hangzhou, China
| | - Dong Chen
- College of Electrical and Information Engineering, Quzhou University, Quzhou, China
- *Correspondence: Dong Chen, ; Yu Chen,
| | - Yu Chen
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
- *Correspondence: Dong Chen, ; Yu Chen,
| |
Collapse
|
15
|
Wang X, Ding Z, Wang R, Lin X. Deepro-Glu: combination of convolutional neural network and Bi-LSTM models using ProtBert and handcrafted features to identify lysine glutarylation sites. Brief Bioinform 2023; 24:6991122. [PMID: 36653898 DOI: 10.1093/bib/bbac631] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2022] [Revised: 12/11/2022] [Accepted: 12/28/2022] [Indexed: 01/20/2023] Open
Abstract
Lysine glutarylation (Kglu) is a newly discovered post-translational modification of proteins with important roles in mitochondrial functions, oxidative damage, etc. The established biological experimental methods to identify glutarylation sites are often time-consuming and costly. Therefore, there is an urgent need to develop computational methods for efficient and accurate identification of glutarylation sites. Most of the existing computational methods only utilize handcrafted features to construct the prediction model and do not consider the positive impact of the pre-trained protein language model on the prediction performance. Based on this, we develop an ensemble deep-learning predictor Deepro-Glu that combines convolutional neural network and bidirectional long short-term memory network using the deep learning features and traditional handcrafted features to predict lysine glutaryation sites. The deep learning features are generated from the pre-trained protein language model called ProtBert, and the handcrafted features consist of sequence-based features, physicochemical property-based features and evolution information-based features. Furthermore, the attention mechanism is used to efficiently integrate the deep learning features and the handcrafted features by learning the appropriate attention weights. 10-fold cross-validation and independent tests demonstrate that Deepro-Glu achieves competitive or superior performance than the state-of-the-art methods. The source codes and data are publicly available at https://github.com/xwanggroup/Deepro-Glu.
Collapse
Affiliation(s)
- Xiao Wang
- School of Computer and Communication Engineering, Zhengzhou University of Light Industry, No. 136, Science Avenue, 450002, Zhengzhou, China
| | - Zhaoyuan Ding
- School of Computer and Communication Engineering, Zhengzhou University of Light Industry, No. 136, Science Avenue, 450002, Zhengzhou, China
| | - Rong Wang
- School of Computer and Communication Engineering, Zhengzhou University of Light Industry, No. 136, Science Avenue, 450002, Zhengzhou, China
| | - Xi Lin
- Instiute of Artificial Intelligence, Xiamen University, No.4221, Xiang'an South Road, 361000, Xiamen, China
| |
Collapse
|
16
|
Su W, Deng S, Gu Z, Yang K, Ding H, Chen H, Zhang Z. Prediction of apoptosis protein subcellular location based on amphiphilic pseudo amino acid composition. Front Genet 2023; 14:1157021. [PMID: 36926588 PMCID: PMC10011625 DOI: 10.3389/fgene.2023.1157021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2023] [Accepted: 02/20/2023] [Indexed: 03/08/2023] Open
Abstract
Introduction: Apoptosis proteins play an important role in the process of cell apoptosis, which makes the rate of cell proliferation and death reach a relative balance. The function of apoptosis protein is closely related to its subcellular location, it is of great significance to study the subcellular locations of apoptosis proteins. Many efforts in bioinformatics research have been aimed at predicting their subcellular location. However, the subcellular localization of apoptotic proteins needs to be carefully studied. Methods: In this paper, based on amphiphilic pseudo amino acid composition and support vector machine algorithm, a new method was proposed for the prediction of apoptosis proteins\x{2019} subcellular location. Results and Discussion: The method achieved good performance on three data sets. The Jackknife test accuracy of the three data sets reached 90.5%, 93.9% and 84.0%, respectively. Compared with previous methods, the prediction accuracies of APACC_SVM were improved.
Collapse
Affiliation(s)
- Wenxia Su
- College of Science, Inner Mongolia Agriculture University, Hohhot, China
| | - Shuyi Deng
- School of Life Science and Technology, Center for Information Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Zhifeng Gu
- School of Life Science and Technology, Center for Information Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Keli Yang
- Nonlinear Research Institute, Baoji University of Arts and Sciences, Baoji, China
| | - Hui Ding
- School of Life Science and Technology, Center for Information Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Hui Chen
- School of Healthcare Technology, Chengdu Neusoft University, Chengdu, China
| | - Zhaoyue Zhang
- School of Life Science and Technology, Center for Information Biology, University of Electronic Science and Technology of China, Chengdu, China.,School of Healthcare Technology, Chengdu Neusoft University, Chengdu, China
| |
Collapse
|
17
|
Zhanga S, Yao Y, Wang J, Liang Y. Identification of DNA N4-methylcytosine sites based on multi-source features and gradient boosting decision tree. Anal Biochem 2022; 652:114746. [DOI: 10.1016/j.ab.2022.114746] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2022] [Revised: 05/13/2022] [Accepted: 05/18/2022] [Indexed: 11/16/2022]
|