1
|
Bałdysz S, Nawrot R, Barylski J. "Tear down that wall"-a critical evaluation of bioinformatic resources available for lysin researchers. Appl Environ Microbiol 2024; 90:e0236123. [PMID: 38842338 PMCID: PMC11267937 DOI: 10.1128/aem.02361-23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/07/2024] Open
Abstract
Lytic enzymes, or lysins for short, break down peptidoglycan and interrupt the continuity of the cell wall, which, in turn, causes osmotic lysis of the bacterium. Their ability to destroy bacteria from within makes them promising antimicrobial agents that can be used as alternatives or supplements to antibiotics. In this paper, we briefly summarize basic terms and concepts used to describe lysin sequences and delineate major lysin groups. More importantly, we describe the domain repertoire found in lysins and critically review bioinformatic tools or databases which are used in studies of these enzymes (with particular emphasis on the repositories of Hidden Markov models). Finally, we present a novel comprehensive, meticulously curated set of lysin-related family and domain models, sort them into clusters that reflect major families, and demonstrate that the selected models can be used to efficiently search for new lysins.
Collapse
Affiliation(s)
- Sophia Bałdysz
- Department of Molecular Virology, Institute of Experimental Biology, Adam Mickiewicz University, Poznań, Poland
| | - Robert Nawrot
- Department of Molecular Virology, Institute of Experimental Biology, Adam Mickiewicz University, Poznań, Poland
| | - Jakub Barylski
- Department of Molecular Virology, Institute of Experimental Biology, Adam Mickiewicz University, Poznań, Poland
| |
Collapse
|
2
|
Lin L, Long Y, Liu J, Deng D, Yuan Y, Liu L, Tan B, Qi H. FRP-XGBoost: Identification of ferroptosis-related proteins based on multi-view features. Int J Biol Macromol 2024; 262:130180. [PMID: 38360239 DOI: 10.1016/j.ijbiomac.2024.130180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2023] [Revised: 02/11/2024] [Accepted: 02/12/2024] [Indexed: 02/17/2024]
Abstract
Ferroptosis represents a novel form of programmed cell death. Pan-cancer bioinformatics analysis indicates that identifying and modulating ferroptosis offer innovative approaches for preventing and treating diverse tumor pathologies. However, the precise detection of ferroptosis-related proteins via conventional wet-laboratory techniques remains a formidable challenge, largely due to the constraints of existing methodologies. These traditional approaches are not only labor-intensive but also financially burdensome. Consequently, there is an imperative need for the development of more sophisticated and efficient computational tools to facilitate the detection of these proteins. In this paper, we presented a XGBoost and multi-view features-based machine learning prediction method for predicting ferroptosis-related proteins, which was referred to as FRP-XGBoost. In this study, we explored four types of protein feature extraction methods and evaluated their effectiveness in predicting ferroptosis-related proteins using six of the most commonly used traditional classifiers. To enhance the representational power of the hybrid features, we employed a two-step feature selection technique to identify the optimal subset of features. Subsequently, we constructed a prediction model using the XGBoost algorithm. The FRP-XGBoost achieved an accuracy of 96.74 % in 10-fold cross-validation and a further accuracy of 91.52 % in an independent test. The implementation source code of FRP-XGBoost is available at https://github.com/linli5417/FRP-XGBoost.
Collapse
Affiliation(s)
- Li Lin
- Department of Obstetrics and Gynecology, Women and Children's Hospital of Chongqing Medical University, Chongqing 401147, China; Department of Obstetrics and Gynecology, Chongqing Health Center for Women and Children, Chongqing 401147, China
| | - Yao Long
- Chongqing Key Laboratory of Maternal and Fetal Medicine, Chongqing Medical University, Chongqing 400016, China; Joint International Research Laboratory of Reproduction and Development, Chinese Ministry of Education, Chongqing Medical University, 400016, China; Department of Obstetrics, The First Affiliated Hospital of Chongqing Medical University, Chongqing 400016, China
| | - Jinkai Liu
- Chongqing Key Laboratory of Maternal and Fetal Medicine, Chongqing Medical University, Chongqing 400016, China; Joint International Research Laboratory of Reproduction and Development, Chinese Ministry of Education, Chongqing Medical University, 400016, China; Department of Obstetrics, The First Affiliated Hospital of Chongqing Medical University, Chongqing 400016, China
| | - Dongliang Deng
- Department of Oncology, Chongqing Traditional Chinese Medicine Hospital, Chongqing 400021, China
| | - Yu Yuan
- Department of Obstetrics and Gynecology, Women and Children's Hospital of Chongqing Medical University, Chongqing 401147, China; Department of Obstetrics and Gynecology, Chongqing Health Center for Women and Children, Chongqing 401147, China
| | - Lubin Liu
- Department of Obstetrics and Gynecology, Women and Children's Hospital of Chongqing Medical University, Chongqing 401147, China; Department of Obstetrics and Gynecology, Chongqing Health Center for Women and Children, Chongqing 401147, China
| | - Bin Tan
- Chongqing Key Laboratory of Maternal and Fetal Medicine, Chongqing Medical University, Chongqing 400016, China; Joint International Research Laboratory of Reproduction and Development, Chinese Ministry of Education, Chongqing Medical University, 400016, China; Department of Obstetrics, The First Affiliated Hospital of Chongqing Medical University, Chongqing 400016, China.
| | - Hongbo Qi
- Department of Obstetrics and Gynecology, Women and Children's Hospital of Chongqing Medical University, Chongqing 401147, China; Department of Obstetrics and Gynecology, Chongqing Health Center for Women and Children, Chongqing 401147, China; Chongqing Key Laboratory of Maternal and Fetal Medicine, Chongqing Medical University, Chongqing 400016, China; Joint International Research Laboratory of Reproduction and Development, Chinese Ministry of Education, Chongqing Medical University, 400016, China.
| |
Collapse
|
3
|
Sun A, Li H, Dong G, Zhao Y, Zhang D. DBPboost:A method of classification of DNA-binding proteins based on improved differential evolution algorithm and feature extraction. Methods 2024; 223:56-64. [PMID: 38237792 DOI: 10.1016/j.ymeth.2024.01.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2023] [Revised: 12/29/2023] [Accepted: 01/13/2024] [Indexed: 02/01/2024] Open
Abstract
DNA-binding proteins are a class of proteins that can interact with DNA molecules through physical and chemical interactions. Their main functions include regulating gene expression, maintaining chromosome structure and stability, and more. DNA-binding proteins play a crucial role in cellular and molecular biology, as they are essential for maintaining normal cellular physiological functions and adapting to environmental changes. The prediction of DNA-binding proteins has been a hot topic in the field of bioinformatics. The key to accurately classifying DNA-binding proteins is to find suitable feature sources and explore the information they contain. Although there are already many models for predicting DNA-binding proteins, there is still room for improvement in mining feature source information and calculation methods. In this study, we created a model called DBPboost to better identify DNA-binding proteins. The innovation of this study lies in the use of eight feature extraction methods, the improvement of the feature selection step, which involves selecting some features first and then performing feature selection again after feature fusion, and the optimization of the differential evolution algorithm in feature fusion, which improves the performance of feature fusion. The experimental results show that the prediction accuracy of the model on the UniSwiss dataset is 89.32%, and the sensitivity is 89.01%, which is better than most existing models.
Collapse
Affiliation(s)
- Ailun Sun
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| | - Hongfei Li
- College of Life Science, Northeast Forestry University, Harbin 150040, China
| | - Guanghui Dong
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| | - Yuming Zhao
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| | - Dandan Zhang
- Department of Obstetrics and Gynecology, the First Affiliated Hospital of Harbin Medical University, Harbin, Heilongjiang, China.
| |
Collapse
|
4
|
Yu S, Liao B, Zhu W, Peng D, Wu F. Accurate prediction and key protein sequence feature identification of cyclins. Brief Funct Genomics 2023; 22:411-419. [PMID: 37118891 DOI: 10.1093/bfgp/elad014] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2023] [Revised: 03/03/2023] [Accepted: 03/17/2023] [Indexed: 04/30/2023] Open
Abstract
Cyclin proteins are a group of proteins that activate the cell cycle by forming complexes with cyclin-dependent kinases. Identifying cyclins correctly can provide key clues to understanding the function of cyclins. However, due to the low similarity between cyclin protein sequences, the advancement of a machine learning-based approach to identify cycles is urgently needed. In this study, cyclin protein sequence features were extracted using the profile-based auto-cross covariance method. Then the features were ranked and selected with maximum relevance-maximum distance (MRMD) 1.0 and MRMD2.0. Finally, the prediction model was assessed through 10-fold cross-validation. The computational experiments showed that the best protein sequence features generated by MRMD1.0 could correctly predict 98.2% of cyclins using the random forest (RF) classifier, whereas seven-dimensional key protein sequence features identified with MRMD2.0 could correctly predict 96.1% of cyclins, which was superior to previous studies on the same dataset both in terms of dimensionality and performance comparisons. Therefore, our work provided a valuable tool for identifying cyclins. The model data can be downloaded from https://github.com/YUshunL/cyclin.
Collapse
Affiliation(s)
- Shaoyou Yu
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Bo Liao
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Wen Zhu
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Dejun Peng
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Fangxiang Wu
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| |
Collapse
|
5
|
Ali Z, Alturise F, Alkhalifah T, Khan YD. IGPred-HDnet: Prediction of Immunoglobulin Proteins Using Graphical Features and the Hierarchal Deep Learning-Based Approach. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2023; 2023:2465414. [PMID: 36744119 PMCID: PMC9891831 DOI: 10.1155/2023/2465414] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/08/2022] [Revised: 09/16/2022] [Accepted: 10/12/2022] [Indexed: 01/26/2023]
Abstract
Motivation. Immunoglobulin proteins (IGP) (also called antibodies) are glycoproteins that act as B-cell receptors against external or internal antigens like viruses and bacteria. IGPs play a significant role in diverse cellular processes ranging from adhesion to cell recognition. IGP identifications via the in-silico approach are faster and more cost-effective than wet-lab technological methods. Methods. In this study, we developed an intelligent theoretical deep learning framework, "IGPred-HDnet" for the discrimination of IGPs and non-IGPs. Three types of promising descriptors are feature extraction based on graphical and statistical features (FEGS), amphiphilic pseudo-amino acid composition (Amp-PseAAC), and dipeptide composition (DPC) to extract the graphical, physicochemical, and sequential features. Next, the extracted attributes are evaluated through machine learning, i.e., decision tree (DT), support vector machine (SVM), k-nearest neighbour (KNN), and hierarchical deep network (HDnet) classifiers. The proposed predictor IGPred-HDnet was trained and tested using a 10-fold cross-validation and independent test. Results and Conclusion. The success rates in terms of accuracy (ACC) and Matthew's correlation coefficient (MCC) of IGPred-HDnet on training and independent dataset (Dtrain Dtest) are ACC = 98.00%, 99.10%, and MCC = 0.958, and 0.980 points, respectively. The empirical outcomes demonstrate that the IGPred-HDnet model efficacy on both datasets using the novel FEGS feature and HDnet algorithm achieved superior predictions to other existing computational models. We hope this research will provide great insights into the large-scale identification of IGPs and pharmaceutical companies in new drug design.
Collapse
Affiliation(s)
- Zakir Ali
- Department of Computer Science, School of Science and Technology, University of Management and Technology, Lahore, Pakistan
| | - Fahad Alturise
- Department of Computer, College of Science and Arts in Ar Rass, Qassim University, Ar Rass, Qassim, Saudi Arabia
| | - Tamim Alkhalifah
- Department of Computer, College of Science and Arts in Ar Rass, Qassim University, Ar Rass, Qassim, Saudi Arabia
| | - Yaser Daanial Khan
- Department of Computer Science, School of Science and Technology, University of Management and Technology, Lahore, Pakistan
| |
Collapse
|
6
|
Wan H, Liu Q, Ju Y. Utilize a few features to classify presynaptic and postsynaptic neurotoxins. Comput Biol Med 2023; 152:106380. [PMID: 36473343 DOI: 10.1016/j.compbiomed.2022.106380] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2022] [Revised: 10/21/2022] [Accepted: 11/28/2022] [Indexed: 12/02/2022]
Abstract
Neurotoxins are a class of proteins that have a significant damaging effect on nerve tissue. Neurotoxins are classified into presynaptic neurotoxins and postsynaptic neurotoxins, and accurate identification of neurotoxins plays a key role in drug development. In this study, 90 presynaptic neurotoxins and 165 postsynaptic neurotoxins were classified. The features of the presynaptic and postsynaptic neurotoxin sequences were extracted using the AutoProp feature extraction method and feature selection was performed using the maximum relevance maximum distance (MRMD) program, Finally, only two features were retained to achieve 84.7% classification accuracy. Moreover, it was found that the two retained features were present in the conserved sites and motifs of presynaptic neurotoxins and could represent the critical structures of presynaptic neurotoxins. This method demonstrates that using a few key features to classify proteins can effectively identify critical protein structures.
Collapse
Affiliation(s)
- Hao Wan
- Institute of Advanced Cross-field Science, College of Life Science, Qingdao University, Qingdao, China
| | - Qing Liu
- Department of Anesthesiology, Hospital (T.C.M) Affiliated to Southwest Medical University, Luzhou, China.
| | - Ying Ju
- School of Informatics, Xiamen University, Xiamen, China.
| |
Collapse
|
7
|
Yu S, Peng D, Zhu W, Liao B, Wang P, Yang D, Wu F. Hybrid_DBP: Prediction of DNA-binding proteins using hybrid features and convolutional neural networks. Front Pharmacol 2022; 13:1031759. [PMID: 36299898 PMCID: PMC9589247 DOI: 10.3389/fphar.2022.1031759] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2022] [Accepted: 09/27/2022] [Indexed: 11/21/2022] Open
Abstract
DNA-binding proteins (DBP) play an essential role in the genetics and evolution of organisms. A particular DNA sequence could provide underlying therapeutic benefits for hereditary diseases and cancers. Studying these proteins can timely and effectively understand their mechanistic analysis and play a particular function in disease prevention and treatment. The limitation of identifying DNA-binding protein members from the sequence database is time-consuming, costly, and ineffective. Therefore, efficient methods for improving DBP classification are crucial to disease research. In this paper, we developed a novel predictor Hybrid _DBP, which identified potential DBP by using hybrid features and convolutional neural networks. The method combines two feature selection methods, MonoDiKGap and Kmer, and then used MRMD2.0 to remove redundant features. According to the results, 94% of DBP were correctly recognized, and the accuracy of the independent test set reached 91.2%. This means Hybrid_ DBP can become a useful prediction tool for predicting DBP.
Collapse
Affiliation(s)
- Shaoyou Yu
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Dejun Peng
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Wen Zhu
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
- *Correspondence: Wen Zhu,
| | - Bo Liao
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Peng Wang
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Dongxuan Yang
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Fangxiang Wu
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| |
Collapse
|
8
|
Hu RS, Wu J, Zhang L, Zhou X, Zhang Y. CD8TCEI-EukPath: A Novel Predictor to Rapidly Identify CD8+ T-Cell Epitopes of Eukaryotic Pathogens Using a Hybrid Feature Selection Approach. Front Genet 2022; 13:935989. [PMID: 35937988 PMCID: PMC9354802 DOI: 10.3389/fgene.2022.935989] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2022] [Accepted: 05/24/2022] [Indexed: 12/02/2022] Open
Abstract
Computational prediction to screen potential vaccine candidates has been proven to be a reliable way to provide guarantees for vaccine discovery in infectious diseases. As an important class of organisms causing infectious diseases, pathogenic eukaryotes (such as parasitic protozoans) have evolved the ability to colonize a wide range of hosts, including humans and animals; meanwhile, protective vaccines are urgently needed. Inspired by the immunological idea that pathogen-derived epitopes are able to mediate the CD8+ T-cell-related host adaptive immune response and with the available positive and negative CD8+ T-cell epitopes (TCEs), we proposed a novel predictor called CD8TCEI-EukPath to detect CD8+ TCEs of eukaryotic pathogens. Our method integrated multiple amino acid sequence-based hybrid features, employed a well-established feature selection technique, and eventually built an efficient machine learning classifier to differentiate CD8+ TCEs from non-CD8+ TCEs. Based on the feature selection results, 520 optimal hybrid features were used for modeling by utilizing the LightGBM algorithm. CD8TCEI-EukPath achieved impressive performance, with an accuracy of 79.255% in ten-fold cross-validation and an accuracy of 78.169% in the independent test. Collectively, CD8TCEI-EukPath will contribute to rapidly screening epitope-based vaccine candidates, particularly from large peptide-coding datasets. To conduct the prediction of CD8+ TCEs conveniently, an online web server is freely accessible (http://lab.malab.cn/∼hrs/CD8TCEI-EukPath/).
Collapse
Affiliation(s)
- Rui-Si Hu
- Yangtze Delta Region Institute, University of Electronic Science and Technology of China, Quzhou, China
| | - Jin Wu
- School of Management, Shenzhen Polytechnic, Shenzhen, China
| | - Lichao Zhang
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology, Shenzhen, China
| | - Xun Zhou
- Beidahuang Industry Group General Hospital, Harbin, China
- *Correspondence: Xun Zhou, ; Ying Zhang,
| | - Ying Zhang
- Department of Anesthesiology, Hospital (T.C.M) Affiliated of Southwest Medical University, Luzhou, China
- *Correspondence: Xun Zhou, ; Ying Zhang,
| |
Collapse
|
9
|
Wan H, Zhang J, Ding Y, Wang H, Tian G. Immunoglobulin Classification Based on FC* and GC* Features. Front Genet 2022; 12:827161. [PMID: 35140745 PMCID: PMC8819591 DOI: 10.3389/fgene.2021.827161] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2021] [Accepted: 12/22/2021] [Indexed: 11/13/2022] Open
Abstract
Immunoglobulins have a pivotal role in disease regulation. Therefore, it is vital to accurately identify immunoglobulins to develop new drugs and research related diseases. Compared with utilizing high-dimension features to identify immunoglobulins, this research aimed to examine a method to classify immunoglobulins and non-immunoglobulins using two features, FC* and GC*. Classification of 228 samples (109 immunoglobulin samples and 119 non-immunoglobulin samples) revealed that the overall accuracy was 80.7% in 10-fold cross-validation using the J48 classifier implemented in Weka software. The FC* feature identified in this study was found in the immunoglobulin subtype domain, which demonstrated that this extracted feature could represent functional and structural properties of immunoglobulins for forecasting.
Collapse
Affiliation(s)
- Hao Wan
- Institute of Advanced Cross-field Science, College of Life Science, Qingdao University, Qingdao, China
| | - Jina Zhang
- Geneis (Beijing) Co., Ltd., Beijing, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Hetian Wang
- Beidahuang Industry Group General Hospital, Harbin, China
- *Correspondence: Hetian Wang, ; Geng Tian,
| | - Geng Tian
- Geneis (Beijing) Co., Ltd., Beijing, China
- *Correspondence: Hetian Wang, ; Geng Tian,
| |
Collapse
|
10
|
Jiao S, Zou Q, Guo H, Shi L. iTTCA-RF: a random forest predictor for tumor T cell antigens. J Transl Med 2021; 19:449. [PMID: 34706730 PMCID: PMC8554859 DOI: 10.1186/s12967-021-03084-x] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2021] [Accepted: 09/16/2021] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Cancer is one of the most serious diseases threatening human health. Cancer immunotherapy represents the most promising treatment strategy due to its high efficacy and selectivity and lower side effects compared with traditional treatment. The identification of tumor T cell antigens is one of the most important tasks for antitumor vaccines development and molecular function investigation. Although several machine learning predictors have been developed to identify tumor T cell antigen, more accurate tumor T cell antigen identification by existing methodology is still challenging. METHODS In this study, we used a non-redundant dataset of 592 tumor T cell antigens (positive samples) and 393 tumor T cell antigens (negative samples). Four types feature encoding methods have been studied to build an efficient predictor, including amino acid composition, global protein sequence descriptors and grouped amino acid and peptide composition. To improve the feature representation ability of the hybrid features, we further employed a two-step feature selection technique to search for the optimal feature subset. The final prediction model was constructed using random forest algorithm. RESULTS Finally, the top 263 informative features were selected to train the random forest classifier for detecting tumor T cell antigen peptides. iTTCA-RF provides satisfactory performance, with balanced accuracy, specificity and sensitivity values of 83.71%, 78.73% and 88.69% over tenfold cross-validation as well as 73.14%, 62.67% and 83.61% over independent tests, respectively. The online prediction server was freely accessible at http://lab.malab.cn/~acy/iTTCA . CONCLUSIONS We have proven that the proposed predictor iTTCA-RF is superior to the other latest models, and will hopefully become an effective and useful tool for identifying tumor T cell antigens presented in the context of major histocompatibility complex class I.
Collapse
Affiliation(s)
- Shihu Jiao
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Quan Zou
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Huannan Guo
- Department of Oncology, General Hospital of Heilongjiang Province Land Reclamation Bureau, Harbin, China.
| | - Lei Shi
- Department of Spine Surgery, Changzheng Hospital, Naval Medical University, Shanghai, China.
| |
Collapse
|
11
|
CWLy-RF: A novel approach for identifying cell wall lyases based on random forest classifier. Genomics 2021; 113:2919-2924. [PMID: 34186189 DOI: 10.1016/j.ygeno.2021.06.038] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2021] [Revised: 06/20/2021] [Accepted: 06/25/2021] [Indexed: 02/05/2023]
Abstract
Drug resistance of pathogenic bacteria has become increasingly serious due to the abuse of antibiotics in recent years. Researchers have found that cell wall lyases are effective antibacterial agents that can specifically recognize target bacteria and degrade bacterial peptidoglycan. Traditional wet experiments are usually expensive, time-consuming and laborious for the identification of lyases. Therefore, there is an urgent need to develop prediction tools based on computer methods to identify lyases quickly and accurately. In this paper, a new predictor, CWLy-RF, is proposed based on the random forest (RF) algorithm to identify cell wall lyases. In this method, we combined three features, namely, 400D, 188D and the composition of k-spaced amino acid group pairs, using mixed-feature representation methods. Afterward, we improved the feature representation ability with the selected top 100 features by using the information gain method and trained a predictive model using RF. The constructed prediction model is evaluated by using 10-fold cross-validation. The accuracy obtained was 96.09%, the AUC was 0.993, the MCC was 0.922, the sensitivity was 94.92%, and the specificity was 97.32%. We have proved that the proposed predictor CWLy-RF is superior to other latest models, and it will hopefully become an effective and useful tool for identifying lyases.
Collapse
|