1
|
Flamholz ZN, Biller SJ, Kelly L. Large language models improve annotation of prokaryotic viral proteins. Nat Microbiol 2024; 9:537-549. [PMID: 38287147 DOI: 10.1038/s41564-023-01584-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2023] [Accepted: 12/08/2023] [Indexed: 01/31/2024]
Abstract
Viral genomes are poorly annotated in metagenomic samples, representing an obstacle to understanding viral diversity and function. Current annotation approaches rely on alignment-based sequence homology methods, which are limited by the paucity of characterized viral proteins and divergence among viral sequences. Here we show that protein language models can capture prokaryotic viral protein function, enabling new portions of viral sequence space to be assigned biologically meaningful labels. When applied to global ocean virome data, our classifier expanded the annotated fraction of viral protein families by 29%. Among previously unannotated sequences, we highlight the identification of an integrase defining a mobile element in marine picocyanobacteria and a capsid protein that anchors globally widespread viral elements. Furthermore, improved high-level functional annotation provides a means to characterize similarities in genomic organization among diverse viral sequences. Protein language models thus enhance remote homology detection of viral proteins, serving as a useful complement to existing approaches.
Collapse
Affiliation(s)
- Zachary N Flamholz
- Department of Systems and Computational Biology, Albert Einstein College of Medicine, Bronx, NY, USA
| | - Steven J Biller
- Department of Biological Sciences, Wellesley College, Wellesley, MA, USA
| | - Libusha Kelly
- Department of Systems and Computational Biology, Albert Einstein College of Medicine, Bronx, NY, USA.
- Department of Microbiology and Immunology, Albert Einstein College of Medicine, Bronx, NY, USA.
| |
Collapse
|
2
|
Zou H, Yu W. Integrating Low-Order and High-Order Correlation Information for Identifying Phage Virion Proteins. J Comput Biol 2023; 30:1131-1143. [PMID: 37729064 DOI: 10.1089/cmb.2022.0237] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/22/2023] Open
Abstract
Phage virion proteins (PVPs) play an important role in the host cell. Fast and accurate identification of PVPs is beneficial for the discovery and development of related drugs. Although wet experimental approaches are the first choice to identify PVPs, they are costly and time-consuming. Thus, researchers have turned their attention to computational models, which can speed up related studies. Therefore, we proposed a novel machine-learning model to identify PVPs in the current study. First, 50 different types of physicochemical properties were used to denote protein sequences. Next, two different approaches, including Pearson's correlation coefficient (PCC) and maximal information coefficient (MIC), were employed to extract discriminative information. Further, to capture the high-order correlation information, we used PCC and MIC once again. After that, we adopted the least absolute shrinkage and selection operator algorithm to select the optimal feature subset. Finally, these chosen features were fed into a support vector machine to discriminate PVPs from phage non-virion proteins. We performed experiments on two different datasets to validate the effectiveness of our proposed method. Experimental results showed a significant improvement in performance compared with state-of-the-art approaches. It indicates that the proposed computational model may become a powerful predictor in identifying PVPs.
Collapse
Affiliation(s)
- Hongliang Zou
- School of Communications and Electronics, Jiangxi Science and Technology Normal University, Nanchang, China
| | - Wanting Yu
- College of Animal Science and Technology, Jiangxi Agricultural University, Nanchang, China
| |
Collapse
|
3
|
Shang J, Peng C, Tang X, Sun Y. PhaVIP: Phage VIrion Protein classification based on chaos game representation and Vision Transformer. Bioinformatics 2023; 39:i30-i39. [PMID: 37387136 DOI: 10.1093/bioinformatics/btad229] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION As viruses that mainly infect bacteria, phages are key players across a wide range of ecosystems. Analyzing phage proteins is indispensable for understanding phages' functions and roles in microbiomes. High-throughput sequencing enables us to obtain phages in different microbiomes with low cost. However, compared to the fast accumulation of newly identified phages, phage protein classification remains difficult. In particular, a fundamental need is to annotate virion proteins, the structural proteins, such as major tail, baseplate, etc. Although there are experimental methods for virion protein identification, they are too expensive or time-consuming, leaving a large number of proteins unclassified. Thus, there is a great demand to develop a computational method for fast and accurate phage virion protein (PVP) classification. RESULTS In this work, we adapted the state-of-the-art image classification model, Vision Transformer, to conduct virion protein classification. By encoding protein sequences into unique images using chaos game representation, we can leverage Vision Transformer to learn both local and global features from sequence "images". Our method, PhaVIP, has two main functions: classifying PVP and non-PVP sequences and annotating the types of PVP, such as capsid and tail. We tested PhaVIP on several datasets with increasing difficulty and benchmarked it against alternative tools. The experimental results show that PhaVIP has superior performance. After validating the performance of PhaVIP, we investigated two applications that can use the output of PhaVIP: phage taxonomy classification and phage host prediction. The results showed the benefit of using classified proteins over all proteins. AVAILABILITY AND IMPLEMENTATION The web server of PhaVIP is available via: https://phage.ee.cityu.edu.hk/phavip. The source code of PhaVIP is available via: https://github.com/KennthShang/PhaVIP.
Collapse
Affiliation(s)
- Jiayu Shang
- Department of Electrical Engineering, City University of Hong Kong, Hong Kong (SAR), China
| | - Cheng Peng
- Department of Electrical Engineering, City University of Hong Kong, Hong Kong (SAR), China
| | - Xubo Tang
- Department of Electrical Engineering, City University of Hong Kong, Hong Kong (SAR), China
| | - Yanni Sun
- Department of Electrical Engineering, City University of Hong Kong, Hong Kong (SAR), China
| |
Collapse
|
4
|
Prediction of Phage Virion Proteins Using Machine Learning Methods. Molecules 2023; 28:molecules28052238. [PMID: 36903484 PMCID: PMC10004995 DOI: 10.3390/molecules28052238] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2023] [Revised: 01/27/2023] [Accepted: 02/20/2023] [Indexed: 03/04/2023] Open
Abstract
Antimicrobial resistance (AMR) is a major problem and an immediate alternative to antibiotics is the need of the hour. Research on the possible alternative products to tackle bacterial infections is ongoing worldwide. One of the most promising alternatives to antibiotics is the use of bacteriophages (phage) or phage-driven antibacterial drugs to cure bacterial infections caused by AMR bacteria. Phage-driven proteins, including holins, endolysins, and exopolysaccharides, have shown great potential in the development of antibacterial drugs. Likewise, phage virion proteins (PVPs) might also play an important role in the development of antibacterial drugs. Here, we have developed a machine learning-based prediction method to predict PVPs using phage protein sequences. We have employed well-known basic and ensemble machine learning methods with protein sequence composition features for the prediction of PVPs. We found that the gradient boosting classifier (GBC) method achieved the best accuracy of 80% on the training dataset and an accuracy of 83% on the independent dataset. The performance on the independent dataset is better than other existing methods. A user-friendly web server developed by us is freely available to all users for the prediction of PVPs from phage protein sequences. The web server might facilitate the large-scale prediction of PVPs and hypothesis-driven experimental study design.
Collapse
|
5
|
Understanding Bacteriophage Tail Fiber Interaction with Host Surface Receptor: The Key “Blueprint” for Reprogramming Phage Host Range. Int J Mol Sci 2022; 23:ijms232012146. [PMID: 36292999 PMCID: PMC9603124 DOI: 10.3390/ijms232012146] [Citation(s) in RCA: 34] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2022] [Revised: 10/06/2022] [Accepted: 10/10/2022] [Indexed: 11/16/2022] Open
Abstract
Bacteriophages (phages), as natural antibacterial agents, are being rediscovered because of the growing threat of multi- and pan-drug-resistant bacterial pathogens globally. However, with an estimated 1031 phages on the planet, finding the right phage to recognize a specific bacterial host is like looking for a needle in a trillion haystacks. The host range of a phage is primarily determined by phage tail fibers (or spikes), which initially mediate reversible and specific recognition and adsorption by susceptible bacteria. Recent significant advances at single-molecule and atomic levels have begun to unravel the structural organization of tail fibers and underlying mechanisms of phage–host interactions. Here, we discuss the molecular mechanisms and models of the tail fibers of the well-characterized T4 phage’s interaction with host surface receptors. Structure–function knowledge of tail fibers will pave the way for reprogramming phage host range and will bring future benefits through more-effective phage therapy in medicine. Furthermore, the design strategies of tail fiber engineering are briefly summarized, including machine-learning-assisted engineering inspired by the increasingly enormous amount of phage genetic information.
Collapse
|
6
|
Fang Z, Feng T, Zhou H, Chen M. DeePVP: Identification and classification of phage virion proteins using deep learning. Gigascience 2022; 11:6661052. [PMID: 35950840 PMCID: PMC9366990 DOI: 10.1093/gigascience/giac076] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Revised: 06/08/2022] [Accepted: 07/11/2022] [Indexed: 01/04/2023] Open
Abstract
BACKGROUND Many biological properties of phages are determined by phage virion proteins (PVPs), and the poor annotation of PVPs is a bottleneck for many areas of viral research, such as viral phylogenetic analysis, viral host identification, and antibacterial drug design. Because of the high diversity of PVP sequences, the PVP annotation of a phage genome remains a particularly challenging bioinformatic task. FINDINGS Based on deep learning, we developed DeePVP. The main module of DeePVP aims to discriminate PVPs from non-PVPs within a phage genome, while the extended module of DeePVP can further classify predicted PVPs into the 10 major classes of PVPs. Compared with the present state-of-the-art tools, the main module of DeePVP performs better, with a 9.05% higher F1-score in the PVP identification task. Moreover, the overall accuracy of the extended module of DeePVP in the PVP classification task is approximately 3.72% higher than that of PhANNs. Two application cases show that the predictions of DeePVP are more reliable and can better reveal the compact PVP-enriched region than the current state-of-the-art tools. Particularly, in the Escherichia phage phiEC1 genome, a novel PVP-enriched region that is conserved in many other Escherichia phage genomes was identified, indicating that DeePVP will be a useful tool for the analysis of phage genomic structures. CONCLUSIONS DeePVP outperforms state-of-the-art tools. The program is optimized in both a virtual machine with graphical user interface and a docker so that the tool can be easily run by noncomputer professionals. DeePVP is freely available at https://github.com/fangzcbio/DeePVP/.
Collapse
Affiliation(s)
- Zhencheng Fang
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou 510280, China
| | - Tao Feng
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou 510280, China
| | - Hongwei Zhou
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou 510280, China
| | - Muxuan Chen
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou 510280, China
| |
Collapse
|
7
|
Ataee S, Brochet X, Peña-Reyes CA. Bacteriophage Genetic Edition Using LSTM. FRONTIERS IN BIOINFORMATICS 2022; 2:932319. [PMID: 36353213 PMCID: PMC9639385 DOI: 10.3389/fbinf.2022.932319] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2022] [Accepted: 06/06/2022] [Indexed: 09/16/2023] Open
Abstract
Bacteriophages are gaining increasing interest as antimicrobial tools, largely due to the emergence of multi-antibiotic-resistant bacteria. Although their huge diversity and virulence make them particularly attractive for targeting a wide range of bacterial pathogens, it is difficult to select suitable phages due to their high specificity which limits their host range. In addition, other challenges remain such as structural fragility under certain environmental conditions, immunogenicity of phage therapy, or development of bacterial resistance. The use of genetically engineered phages may reduce characteristics that hinder prophylactic and therapeutic applications of phages. Nowadays, there is no systematic method to modify a given phage genome conferring its sought characteristics. We explore the use of artificial intelligence for this purpose as it has the potential to both guide and accelerate genome modification to generate phage variants with unique properties that overcome the limitations of natural phages. We propose an original architecture composed of two deep learning-driven components: a phage-bacterium interaction predictor and a phage genome-sequence generator. The former is a multi-branch 1-D convolutional neural network (1D-CNN) that analyses phage and bacterial genomes to predict interactions. The latter is a recurrent neural network, more particularly a long short-term memory (LSTM), that performs genomic modifications to a phage to offer substantial host range improvement. For this component, we developed two different architectures composed of one or two stacked LSTM layers with 256 neurons each. These generators are used to modify, more precisely to rewrite, the genome sequence of 42 selected phages, while the predictor is used to estimate the host range of the modified bacteriophages across 46 strains of Pseudomonas aeruginosa. The proposed generators, trained with an average accuracy of 96.1%, are able to improve the host range for an average of 18 phages among the 42 under study, increasing both their average host range, by 73.0 and 103.7%, and the maximum host ranges from 21 to 24 and 29, respectively. These promising results showed that the use of deep learning methodologies allows genetic modification of phages to extend, for instance, their host range, confirming the potential of these approaches to guide bacteriophage engineering.
Collapse
Affiliation(s)
- Shabnam Ataee
- Institute of Information and Communication Technology (IICT), School of Management and Engineering Vaud (HEIG-VD), Yverdon-les-Bains, Switzerland
- HES-SO University of Applied Sciences and Arts Western Switzerland, Delémont, Switzerland
- CI4CB—Computational Intelligence for Computational Biology, SIB—Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Xavier Brochet
- Institute of Information and Communication Technology (IICT), School of Management and Engineering Vaud (HEIG-VD), Yverdon-les-Bains, Switzerland
- HES-SO University of Applied Sciences and Arts Western Switzerland, Delémont, Switzerland
- CI4CB—Computational Intelligence for Computational Biology, SIB—Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Carlos Andrés Peña-Reyes
- Institute of Information and Communication Technology (IICT), School of Management and Engineering Vaud (HEIG-VD), Yverdon-les-Bains, Switzerland
- HES-SO University of Applied Sciences and Arts Western Switzerland, Delémont, Switzerland
- CI4CB—Computational Intelligence for Computational Biology, SIB—Swiss Institute of Bioinformatics, Lausanne, Switzerland
| |
Collapse
|
8
|
Wang Y, Miao X, Xiao G, Huang C, Sun J, Wang Y, Li P, You X. Clinical Prediction of Heart Failure in Hemodialysis Patients: Based on the Extreme Gradient Boosting Method. Front Genet 2022; 13:889378. [PMID: 35559036 PMCID: PMC9086166 DOI: 10.3389/fgene.2022.889378] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Accepted: 03/15/2022] [Indexed: 11/18/2022] Open
Abstract
Background: Heart failure (HF) is the main cause of mortality in hemodialysis (HD) patients. However, it is still a challenge for the prediction of HF in HD patients. Therefore, we aimed to establish and validate a prediction model to predict HF events in HD patients. Methods: A total of 355 maintenance HD patients from two hospitals were included in this retrospective study. A total of 21 variables, including traditional demographic characteristics, medical history, and blood biochemical indicators, were used. Two classification models were established based on the extreme gradient boosting (XGBoost) algorithm and traditional linear logistic regression. The performance of the two models was evaluated based on calibration curves and area under the receiver operating characteristic curves (AUCs). Feature importance and SHapley Additive exPlanation (SHAP) were used to recognize risk factors from the variables. The Kaplan–Meier curve of each risk factor was constructed and compared with the log-rank test. Results: Compared with the traditional linear logistic regression, the XGBoost model had better performance in accuracy (78.5 vs. 74.8%), sensitivity (79.6 vs. 75.6%), specificity (78.1 vs. 74.4%), and AUC (0.814 vs. 0.722). The feature importance and SHAP value of XGBoost indicated that age, hypertension, platelet count (PLT), C-reactive protein (CRP), and white blood cell count (WBC) were risk factors of HF. These results were further confirmed by Kaplan–Meier curves. Conclusions: The HF prediction model based on XGBoost had a satisfactory performance in predicting HF events, which could prove to be a useful tool for the early prediction of HF in HD.
Collapse
Affiliation(s)
- Yanfeng Wang
- The School of Electrical and Information Engineering, Zhengzhou University of Light Industry, Zhengzhou, China
| | - Xisha Miao
- The School of Electrical and Information Engineering, Zhengzhou University of Light Industry, Zhengzhou, China
| | - Gang Xiao
- Department of Clinical Laboratory, The Third Affiliated Hospital, Southern Medical University, Guangzhou, China
| | - Chun Huang
- The School of Electrical and Information Engineering, Zhengzhou University of Light Industry, Zhengzhou, China
| | - Junwei Sun
- The School of Electrical and Information Engineering, Zhengzhou University of Light Industry, Zhengzhou, China
| | - Ying Wang
- Department of Medical Statistics, School of Public Health, Sun Yat-sen University, Guangzhou, China
| | - Panlong Li
- The School of Electrical and Information Engineering, Zhengzhou University of Light Industry, Zhengzhou, China
| | - Xu You
- Department of Clinical Laboratory, The Third Affiliated Hospital, Southern Medical University, Guangzhou, China
| |
Collapse
|
9
|
Phage_UniR_LGBM: Phage Virion Proteins Classification with UniRep Features and LightGBM Model. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2022; 2022:9470683. [PMID: 35465015 PMCID: PMC9033350 DOI: 10.1155/2022/9470683] [Citation(s) in RCA: 26] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/27/2022] [Accepted: 03/15/2022] [Indexed: 11/23/2022]
Abstract
Phage, the most prevalent creature on the planet, serves a variety of critical roles. Phage's primary role is to facilitate gene-to-gene communication. The phage proteins can be defined as the virion proteins and the nonvirion ones. Nowadays, experimental identification is a difficult process that necessitates a significant amount of laboratory time and expense. Considering such situation, it is critical to design practical calculating techniques and develop well-performance tools. In this work, the Phage_UniR_LGBM has been proposed to classify the virion proteins. In detailed, such model utilizes the UniRep as the feature and the LightGBM algorithm as the classification model. And then, the training data train the model, and the testing data test the model with the cross-validation. The Phage_UniR_LGBM was compared with the several state-of-the-art features and classification algorithms. The performances of the Phage_UniR_LGBM are 88.51% in Sp,89.89% in Sn, 89.18% in Acc, 0.7873 in MCC, and 0.8925 in F1 score.
Collapse
|
10
|
Kabir M, Nantasenamat C, Kanthawong S, Charoenkwan P, Shoombuatong W. Large-scale comparative review and assessment of computational methods for phage virion proteins identification. EXCLI JOURNAL 2022; 21:11-29. [PMID: 35145365 PMCID: PMC8822302 DOI: 10.17179/excli2021-4411] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/12/2021] [Accepted: 11/29/2021] [Indexed: 12/11/2022]
Abstract
Phage virion proteins (PVPs) are effective at recognizing and binding to host cell receptors while having no deleterious effects on human or animal cells. Understanding their functional mechanisms is regarded as a critical goal that will aid in rational antibacterial drug discovery and development. Although high-throughput experimental methods for identifying PVPs are considered the gold standard for exploring crucial PVP features, these procedures are frequently time-consuming and labor-intensive. Thusfar, more than ten sequence-based predictors have been established for the in silico identification of PVPs in conjunction with traditional experimental approaches. As a result, a revised and more thorough assessment is extremely desirable. With this purpose in mind, we first conduct a thorough survey and evaluation of a vast array of 13 state-of-the-art PVP predictors. Among these PVP predictors, they can be classified into three groups according to the types of machine learning (ML) algorithms employed (i.e. traditional ML-based methods, ensemble-based methods and deep learning-based methods). Subsequently, we explored which factors are important for building more accurate and stable predictors and this included training/independent datasets, feature encoding algorithms, feature selection methods, core algorithms, performance evaluation metrics/strategies and web servers. Finally, we provide insights and future perspectives for the design and development of new and more effective computational approaches for the detection and characterization of PVPs.
Collapse
Affiliation(s)
- Muhammad Kabir
- School of Systems and Technology, Department of Computer Science, University of Management and Technology, Lahore, Pakistan, 54770
| | - Chanin Nantasenamat
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, Thailand, 10700
| | - Sakawrat Kanthawong
- Department of Microbiology, Faculty of Medicine, Khon Kaen University, Khon Kaen, Thailand, 40002
| | - Phasit Charoenkwan
- Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai, Thailand, 50200
| | - Watshara Shoombuatong
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, Thailand, 10700
| |
Collapse
|
11
|
Gu X, Guo L, Liao B, Jiang Q. Pseudo-188D: Phage Protein Prediction Based on a Model of Pseudo-188D. Front Genet 2021; 12:796327. [PMID: 34925468 PMCID: PMC8672092 DOI: 10.3389/fgene.2021.796327] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2021] [Accepted: 11/15/2021] [Indexed: 11/13/2022] Open
Abstract
Phages have seriously affected the biochemical systems of the world, and not only are phages related to our health, but medical treatments for many cancers and skin infections are related to phages; therefore, this paper sought to identify phage proteins. In this paper, a Pseudo-188D model was established. The digital features of the phage were extracted by PseudoKNC, an appropriate vector was selected by the AdaBoost tool, and features were extracted by 188D. Then, the extracted digital features were combined together, and finally, the viral proteins of the phage were predicted by a stochastic gradient descent algorithm. Our model effect reached 93.4853%. To verify the stability of our model, we randomly selected 80% of the downloaded data to train the model and used the remaining 20% of the data to verify the robustness of our model.
Collapse
Affiliation(s)
- Xiaomei Gu
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China.,Institute of Yangtze River Delta, University of Electronic Science and Technology of China, Haikou, China.,Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China.,School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Lina Guo
- Beidahuang Industry Group General Hospital, Harbin, China
| | - Bo Liao
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China.,Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China.,School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Qinghua Jiang
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China.,Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China.,School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| |
Collapse
|
12
|
Wu Y, Guo Y, Ma J, Sa Y, Li Q, Zhang N. Research Progress of Gliomas in Machine Learning. Cells 2021; 10:cells10113169. [PMID: 34831392 PMCID: PMC8622230 DOI: 10.3390/cells10113169] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2021] [Revised: 11/04/2021] [Accepted: 11/05/2021] [Indexed: 12/29/2022] Open
Abstract
In the field of gliomas research, the broad availability of genetic and image information originated by computer technologies and the booming of biomedical publications has led to the advent of the big-data era. Machine learning methods were applied as possible approaches to speed up the data mining processes. In this article, we reviewed the present situation and future orientations of machine learning application in gliomas within the context of workflows to integrate analysis for precision cancer care. Publicly available tools or algorithms for key machine learning technologies in the literature mining for glioma clinical research were reviewed and compared. Further, the existing solutions of machine learning methods and their limitations in glioma prediction and diagnostics, such as overfitting and class imbalanced, were critically analyzed.
Collapse
|
13
|
Zhang S, Shi H. iR5hmcSC: Identifying RNA 5-hydroxymethylcytosine with multiple features based on stacking learning. Comput Biol Chem 2021; 95:107583. [PMID: 34562726 DOI: 10.1016/j.compbiolchem.2021.107583] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2021] [Revised: 09/02/2021] [Accepted: 09/12/2021] [Indexed: 01/27/2023]
Abstract
RNA 5-hydroxymethylcytosine (5hmC) modification is the basis of the translation of genetic information and the biological evolution. The study of its distribution in transcriptome is fundamentally crucial to reveal the biological significance of 5hmC. Biochemical experiments can use a variety of sequencing-based technologies to achieve high-throughput identification of 5hmC; however, they are labor-intensive, time-consuming, as well as expensive. Therefore, it is urgent to develop more effective and feasible computational methods. In this paper, a novel and powerful model called iR5hmcSC is designed for identifying 5hmC. Firstly, we extract the different features by K-mer, Pseudo Structure Status Composition and One-Hot encoding. Subsequently, the combination of chi-square test and logistic regression is utilized as the feature selection method to select the optimal feature sets. And then stacking learning, an ensemble learning method including random forest (RF), extra trees (EX), AdaBoost (Ada), gradient boosting decision tree (GBDT), and support vector machine (SVM), is used to recognize 5hmC and non-5hmC. Finally, 10-fold cross-validation test is performed to evaluate the model. The accuracy reaches 85.27% and 79.92% on benchmark dataset and independent dataset, respectively. The result is better than the state-of-the-art methods, which indicates that our model is a feasible tool to identify 5hmC. The datasets and source code are freely available at https://github.com/HongyanShi026/iR5hmcSC.
Collapse
Affiliation(s)
- Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, PR China.
| | - Hongyan Shi
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, PR China
| |
Collapse
|
14
|
iPVP-MCV: A Multi-Classifier Voting Model for the Accurate Identification of Phage Virion Proteins. Symmetry (Basel) 2021. [DOI: 10.3390/sym13081506] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
The classic structure of a bacteriophage is commonly characterized by complex symmetry. The head of the structure features icosahedral symmetry, whereas the tail features helical symmetry. The phage virion protein (PVP), a type of bacteriophage structural protein, is an essential material of the infectious viral particles and is responsible for multiple biological functions. Accurate identification of PVPs is of great significance for comprehending the interaction between phages and host bacteria and developing new antimicrobial drugs or antibiotics. However, traditional experimental approaches for identifying PVPs are often time-consuming and laborious. Therefore, the development of computational methods that can efficiently and accurately identify PVPs is desired. In this study, we proposed a multi-classifier voting model called iPVP-MCV to enhance the predictive performance of PVPs based on their amino acid sequences. First, three types of evolutionary features were extracted from the position-specific scoring matrix (PSSM) profiles to represent PVPs and non-PVPs. Then, a set of baseline models were trained based on the support vector machine (SVM) algorithm combined with each type of feature descriptors. Finally, the outputs of these baseline models were integrated to construct the proposed method iPVP-MCV by using the majority voting strategy. Our results demonstrated that the proposed iPVP-MCV model was superior to existing methods when performing the rigorous independent dataset test.
Collapse
|
15
|
Zong Y, Li X. Identification of Causal Genes of COVID-19 Using the SMR Method. Front Genet 2021; 12:690349. [PMID: 34290742 PMCID: PMC8287881 DOI: 10.3389/fgene.2021.690349] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2021] [Accepted: 05/07/2021] [Indexed: 01/03/2023] Open
Abstract
Since the first report of COVID-19 in December 2019, more than 100 million people have been infected with SARS-CoV-2. Despite ongoing research, there is still limited knowledge about the genetic causes of COVID-19. To resolve this problem, we applied the SMR method to analyze the genes involved in COVID-19 pathogenesis by the integration of multiple omics data. Here, we assessed the SNPs associated with COVID-19 risk from the GWAS data of Spanish and Italian patients and lung eQTL data from the GTEx project. Then, GWAS and eQTL data were integrated by summary-data-based (SMR) methods using SNPs as instrumental variables (IVs). As a result, six protein-coding and five non-protein-coding genes regulated by nine SNPs were identified as significant risk factors for COVID-19. Functional analysis of these genes showed that UQCRH participates in cardiac muscle contraction, PPA2 is closely related to sudden cardiac failure (SCD), and OGT, as the interacting gene partner of PANO1, is associated with neurological disease. Observational studies show that myocardial damage, SCD, and neurological disease often occur in COVID-19 patients. Thus, our findings provide a potential molecular mechanism for understanding the complications of COVID-19.
Collapse
Affiliation(s)
- Yan Zong
- Department of Infectious Diseases, Yiwu Central Hospital, Jinhua, China
| | - Xiaofei Li
- Department of Infectious Diseases, Yiwu Central Hospital, Jinhua, China
| |
Collapse
|
16
|
Nami Y, Imeni N, Panahi B. Application of machine learning in bacteriophage research. BMC Microbiol 2021; 21:193. [PMID: 34174831 PMCID: PMC8235560 DOI: 10.1186/s12866-021-02256-5] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2021] [Accepted: 06/08/2021] [Indexed: 12/20/2022] Open
Abstract
Phages are one of the key components in the structure, dynamics, and interactions of microbial communities in different bins. It has a clear impact on human health and the food industry. Bacteriophage characterization using in vitro approaches are time/cost consuming and laborious tasks. On the other hand, with the advent of new high-throughput sequencing technology, the development of a powerful computational framework to characterize the newly identified bacteriophages is inevitable for future research. Machine learning includes powerful techniques that enable the analysis of complex datasets for knowledge discovery and pattern recognition. In this study, we have conducted a comprehensive review of machine learning methods application using different types of features were applied in various aspects of bacteriophage research including, automated curation, identification, classification, host species recognition, virion protein identification, and life cycle prediction. Moreover, potential limitations and advantages of the developed frameworks were discussed.
Collapse
Affiliation(s)
- Yousef Nami
- Department of Food Biotechnology, Branch for Northwest & West Region, Agricultural Biotechnology Research Institute of Iran, Agricultural Research, Education and Extension Organization (AREEO), Tabriz, Iran
| | - Nazila Imeni
- Young Researchers and Elite Clube, Marand Branch, Islamic Azad University, Marand, Iran
| | - Bahman Panahi
- Department of Genomics, Branch for Northwest & West Region, Agricultural Biotechnology Research Institute of Iran, Agricultural Research, Education and Extension Organization (AREEO), Tabriz, Iran.
| |
Collapse
|
17
|
Component Parts of Bacteriophage Virions Accurately Defined by a Machine-Learning Approach Built on Evolutionary Features. mSystems 2021; 6:e0024221. [PMID: 34042467 PMCID: PMC8269216 DOI: 10.1128/msystems.00242-21] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Antimicrobial resistance (AMR) continues to evolve as a major threat to human health, and new strategies are required for the treatment of AMR infections. Bacteriophages (phages) that kill bacterial pathogens are being identified for use in phage therapies, with the intention to apply these bactericidal viruses directly into the infection sites in bespoke phage cocktails. Despite the great unsampled phage diversity for this purpose, an issue hampering the roll out of phage therapy is the poor quality annotation of many of the phage genomes, particularly for those from infrequently sampled environmental sources. We developed a computational tool called STEP3 to use the “evolutionary features” that can be recognized in genome sequences of diverse phages. These features, when integrated into an ensemble framework, achieved a stable and robust prediction performance when benchmarked against other prediction tools using phages from diverse sources. Validation of the prediction accuracy of STEP3 was conducted with high-resolution mass spectrometry analysis of two novel phages, isolated from a watercourse in the Southern Hemisphere. STEP3 provides a robust computational approach to distinguish specific and universal features in phages to improve the quality of phage cocktails and is available for use at http://step3.erc.monash.edu/. IMPORTANCE In response to the global problem of antimicrobial resistance, there are moves to use bacteriophages (phages) as therapeutic agents. Selecting which phages will be effective therapeutics relies on interpreting features contributing to shelf-life and applicability to diagnosed infections. However, the protein components of the phage virions that dictate these properties vary so much in sequence that best estimates suggest failure to recognize up to 90% of them. We have utilized this diversity in evolutionary features as an advantage, to apply machine learning for prediction accuracy for diverse components in phage virions. We benchmark this new tool showing the accurate recognition and evaluation of phage component parts using genome sequence data of phages from undersampled environments, where the richest diversity of phage still lies.
Collapse
|
18
|
Abstract
Bacteriophages are viruses whose ubiquity in nature and remarkable specificity to their host bacteria enable an impressive and growing field of tunable biotechnologies in agriculture and public health. Bacteriophage capsids, which house and protect their nucleic acids, have been modified with a range of functionalities (e.g., fluorophores, nanoparticles, antigens, drugs) to suit their final application. Functional groups naturally present on bacteriophage capsids can be used for electrostatic adsorption or bioconjugation, but their impermanence and poor specificity can lead to inconsistencies in coverage and function. To overcome these limitations, researchers have explored both genetic and chemical modifications to enable strong, specific bonds between phage capsids and their target conjugates. Genetic modification methods involve introducing genes for alternative amino acids, peptides, or protein sequences into either the bacteriophage genomes or capsid genes on host plasmids to facilitate recombinant phage generation. Chemical modification methods rely on reacting functional groups present on the capsid with activated conjugates under the appropriate solution pH and salt conditions. This review surveys the current state-of-the-art in both genetic and chemical bacteriophage capsid modification methodologies, identifies major strengths and weaknesses of methods, and discusses areas of research needed to propel bacteriophage technology in development of biosensors, vaccines, therapeutics, and nanocarriers.
Collapse
Affiliation(s)
| | - Julie M. Goddard
- Department of Food Science, Cornell University, Ithaca, NY 14853, USA
| | - Sam R. Nugen
- Department of Food Science, Cornell University, Ithaca, NY 14853, USA
| |
Collapse
|
19
|
Meng C, Wu J, Guo F, Dong B, Xu L. CWLy-pred: A novel cell wall lytic enzyme identifier based on an improved MRMD feature selection method. Genomics 2020; 112:4715-4721. [DOI: 10.1016/j.ygeno.2020.08.015] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2020] [Revised: 08/04/2020] [Accepted: 08/13/2020] [Indexed: 10/25/2022]
|
20
|
Meta-iPVP: a sequence-based meta-predictor for improving the prediction of phage virion proteins using effective feature representation. J Comput Aided Mol Des 2020; 34:1105-1116. [DOI: 10.1007/s10822-020-00323-z] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2020] [Accepted: 06/10/2020] [Indexed: 12/11/2022]
|