1
|
Peng S. HIV-1 M group subtype classification using deep learning approach. Comput Biol Med 2024; 183:109218. [PMID: 39369547 DOI: 10.1016/j.compbiomed.2024.109218] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2024] [Revised: 09/24/2024] [Accepted: 09/25/2024] [Indexed: 10/08/2024]
Abstract
Traditionally, the classification of HIV-1 M group subtypes has depended on statistical methods constrained by sample sizes. Here HIV-1-M-SPBEnv was proposed as the first deep learning-based method for classifying HIV-1 M group subtypes via env gene sequences. This approach overcomes sample size challenges by utilizing artificial molecular evolution techniques to generate a synthetic dataset suitable for machine learning. Employing a convolutional Autoencoder embedded with two residual blocks and two transpose residual blocks, followed by a full connected neural network block, HIV-1-M-SPBEnv simplifies complex, high-dimensional DNA sequence data into concise, information-rich, low-dimensional representations, achieving exceptional classification accuracy. Through independent data set validation, the precision, accuracy, recall and F1 score of the HIV-1-M-SPBEnv model predictions were all 100 %, confirming its capability to accurately identify all 12 subtypes of the HIV-1 M group. Deployed through a web server, it provides seamless HIV-1 M group subtype prediction capabilities for researchers and clinicians. HIV-1-M-SPBEnv web server is accessible at http://www.hivsubclass.com and all the code is available at https://github.com/pengsihua2023/HIV-1-M-SPBEnv.
Collapse
Affiliation(s)
- Sihua Peng
- Department of Epidemiology and Biostatistics, College of Public Health, University of Georgia, Athens, GA, 30602, United States.
| |
Collapse
|
2
|
Vello F, Filippini F, Righetto I. Bioinformatics Goes Viral: I. Databases, Phylogenetics and Phylodynamics Tools for Boosting Virus Research. Viruses 2024; 16:1425. [PMID: 39339901 PMCID: PMC11437414 DOI: 10.3390/v16091425] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2024] [Revised: 08/21/2024] [Accepted: 09/03/2024] [Indexed: 09/30/2024] Open
Abstract
Computer-aided analysis of proteins or nucleic acids seems like a matter of course nowadays; however, the history of Bioinformatics and Computational Biology is quite recent. The advent of high-throughput sequencing has led to the production of "big data", which has also affected the field of virology. The collaboration between the communities of bioinformaticians and virologists already started a few decades ago and it was strongly enhanced by the recent SARS-CoV-2 pandemics. In this article, which is the first in a series on how bioinformatics can enhance virus research, we show that highly useful information is retrievable from selected general and dedicated databases. Indeed, an enormous amount of information-both in terms of nucleotide/protein sequences and their annotation-is deposited in the general databases of international organisations participating in the International Nucleotide Sequence Database Collaboration (INSDC). However, more and more virus-specific databases have been established and are progressively enriched with the contents and features reported in this article. Since viruses are intracellular obligate parasites, a special focus is given to host-pathogen protein-protein interaction databases. Finally, we illustrate several phylogenetic and phylodynamic tools, combining information on algorithms and features with practical information on how to use them and case studies that validate their usefulness. Databases and tools for functional inference will be covered in the next article of this series: Bioinformatics goes viral: II. Sequence-based and structure-based functional analyses for boosting virus research.
Collapse
Affiliation(s)
| | - Francesco Filippini
- Synthetic Biology and Biotechnology Unit, Department of Biology, University of Padua, 35131 Padua, Italy; (F.V.); (I.R.)
| | | |
Collapse
|
3
|
Wade KE, Chen L, Deng C, Zhou G, Hu P. Investigating alignment-free machine learning methods for HIV-1 subtype classification. BIOINFORMATICS ADVANCES 2024; 4:vbae108. [PMID: 39228995 PMCID: PMC11371153 DOI: 10.1093/bioadv/vbae108] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/01/2024] [Accepted: 07/26/2024] [Indexed: 09/05/2024]
Abstract
Motivation Many viruses are organized into taxonomies of subtypes based on their genetic similarities. For human immunodeficiency virus 1 (HIV-1), subtype classification plays a crucial role in infection management. Sequence alignment-based methods for subtype classification are impractical for large datasets because they are costly and time-consuming. Alignment-free methods involve creating numerical representations for genetic sequences and applying statistical or machine learning methods. Despite their high overall accuracy, existing models perform poorly on less common subtypes. Furthermore, there is limited work investigating the impact of sequence vectorization methods, in particular natural language-inspired embedding methods, on HIV-1 subtype classification. Results We present a comprehensive analysis of sequence vectorization methods across machine learning methods. We report a k-mer-based XGBoost model with a balanced accuracy of 0.84, indicating that it has good overall performance for both common and uncommon HIV-1 subtypes. We also report a Word2Vec-based support vector machine that achieves promising results on precision and balanced accuracy. Our study sheds light on the effect of sequence vectorization methods on HIV-1 subtype classification and suggests that natural language-inspired encoding methods show promise. Our results could help to develop improved HIV-1 subtype classification methods, leading to improved individual patient outcomes, and the development of subtype-specific treatments. Availability and implementation Source code is available at https://www.github.com/kwade4/HIV_Subtypes.
Collapse
Affiliation(s)
- Kaitlyn E Wade
- Department of Computer Science, University of Western Ontario, London, ON N6A 3K7, Canada
| | - Lianghong Chen
- Department of Computer Science, University of Western Ontario, London, ON N6A 3K7, Canada
| | - Chutong Deng
- Department of Computer Science, University of Western Ontario, London, ON N6A 3K7, Canada
| | - Gen Zhou
- Department of Computer Science, University of Western Ontario, London, ON N6A 3K7, Canada
| | - Pingzhao Hu
- Department of Computer Science, University of Western Ontario, London, ON N6A 3K7, Canada
- Department of Biochemistry, University of Western Ontario, London, ON N6A 3K7, Canada
| |
Collapse
|
4
|
4D-Dynamic Representation of DNA/RNA Sequences: Studies on Genetic Diversity of Echinococcus multilocularis in Red Foxes in Poland. LIFE (BASEL, SWITZERLAND) 2022; 12:life12060877. [PMID: 35743908 PMCID: PMC9227292 DOI: 10.3390/life12060877] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/25/2022] [Revised: 05/20/2022] [Accepted: 06/08/2022] [Indexed: 11/17/2022]
Abstract
The 4D-Dynamic Representation of DNA/RNA Sequences, an alignment-free bioinformatics method recently developed by us, has been used to study the genetic diversity of Echinococcus multilocularis in red foxes in Poland. Sequences of three mitochondrial genes, i.e., NADH dehydrogenase subunit 2 (nad2), cytochrome b (cob), and cytochrome c oxidase subunit 1 (cox1), are analyzed. The sequences are represented by sets of material points in a 4D space, i.e., 4D-dynamic graphs. As a visualization of the sequences, projections of the graphs into 3D space are shown. The differences between 3D graphs corresponding to European, Asian, and American haplotypes are small. Numerical characteristics (sequence descriptors) applied in the studies can recognize the differences. The concept of creating descriptors of 4D-dynamic graphs has been borrowed from classical dynamics; these are coordinates of the centers or mass and moments of inertia of 4D-dynamic graphs. Based on these descriptors, classification maps are constructed. The concentrations of points in the maps indicate one Polish haplotype (EmPL9) of Asian origin.
Collapse
|
5
|
Bielińska-Wąż D, Wąż P. Non-standard bioinformatics characterization of SARS-CoV-2. Comput Biol Med 2021; 131:104247. [PMID: 33611129 PMCID: PMC7966820 DOI: 10.1016/j.compbiomed.2021.104247] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2020] [Revised: 01/22/2021] [Accepted: 01/26/2021] [Indexed: 12/16/2022]
Abstract
A non-standard bioinformatics method, 4D-Dynamic Representation of DNA/RNA Sequences, aiming at an analysis of the information available in nucleotide databases, has been formulated. The sequences are represented by sets of "material points" in a 4D space - 4D-dynamic graphs. The graphs representing the sequences are treated as "rigid bodies" and characterized by values analogous to the ones used in the classical dynamics. As the graphical representations of the sequences, the projections of the graphs into 2D and 3D spaces are used. The method has been applied to an analysis of the complete genome sequences of the 2019 novel coronavirus. As a result, 2D and 3D classification maps are obtained. The coordinate axes in the maps correspond to the values derived from the exact formulas characterizing the graphs: the coordinates of the centers of mass and the 4D moments of inertia. The points in the maps represent sequences and their coordinates are used as the classifiers. The main result of this work has been derived from the 3D classification maps. The distribution of clusters of points which emerged in these maps, supports the hypothesis that SARS-CoV-2 may have originated in bat and in pangolin. Pilot calculations for Zika virus sequence data prove that the proposed approach is also applicable to a description of time evolution of genome sequences of viruses.
Collapse
Affiliation(s)
- Dorota Bielińska-Wąż
- Department of Radiological Informatics and Statistics, Medical University of Gdańsk, 80-210, Gdańsk, Poland.
| | - Piotr Wąż
- Department of Nuclear Medicine, Medical University of Gdańsk, 80-210, Gdańsk, Poland.
| |
Collapse
|
6
|
Lan Y, He X, Li L, Zhou P, Huang X, Deng X, Li J, Fan Q, Li F, Tang X, Cai W, Hu F. Complicated genotypes circulating among treatment naïve HIV-1 patients in Guangzhou, China. INFECTION GENETICS AND EVOLUTION 2020; 87:104673. [PMID: 33309773 DOI: 10.1016/j.meegid.2020.104673] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/29/2020] [Revised: 12/03/2020] [Accepted: 12/07/2020] [Indexed: 01/07/2023]
Abstract
Guangzhou city is the political, economic, and cultural center of the Guangdong Province, China. The molecular epidemiological characteristics of HIV-1 in Guangzhou are not widely known. The aim of this study was to explore the characteristics of HIV-1 genotypes among treatment naïve HIV/AIDS patients living in Guangzhou. HIV-1 RNA was extracted from serum specimens. The partial pol gene of the HIV-1 genome was amplified and sequenced. The genotypes were screened using the subtyping tool COMET and further confirmed by phylogenetic analysis, with the exception of the URFs that were analyzed by jpHMM and RIP. The distributions of HIV genotypes in different risk populations were analyzed. Subsequently, pol sequences were used to construct transmission networks and analyze drug resistance. Twelve HIV-1 genotypes including 3 subtypes and 9 CRFs, with several URFs were identified from 1388 HIV-1 sequences, which were derived from 1490 patients. The main genotypes circulating in Guangzhou were CRF07_BC (38.3%), CRF01_AE (32.3%), and CRF55_01B (10.7%). CRF01_AE was the secondary dominant strain and multiple lineages of CRF01_AE had been identified in Guangzhou. The 01B recombinant forms, including CRF55_01B, CRF59_01B and CRF68_01B, have circulated widely in Guangzhou. 42.22% (586/1388) of the study sequences fell into 143 transmission networks, and the three main clusters revealed that sequences from MSM and HET populations were intermixed. 5.40% (75/1388) of patients had pre-treatment drug resistance. The HIV-1 strains that were present in Guangzhou have demonstrated complex genotypes. Particular attention should be given on these genotypes for the further strategy of prevention and intervention of HIV transmission.
Collapse
Affiliation(s)
- Yun Lan
- Guangzhou Eighth People's Hospital, Guangzhou Medical University, 627 Dongfeng East Road, Yuexiu District, Guangzhou 510030, China
| | - Xiang He
- Guangdong Provincial Institute of Public Health, Guangdong Provincial Center for Disease Control and Prevention, 160 Qunxian Road, Panyu District, Guangzhou 511430, China
| | - Linghua Li
- Guangzhou Eighth People's Hospital, Guangzhou Medical University, 627 Dongfeng East Road, Yuexiu District, Guangzhou 510030, China
| | - Pingping Zhou
- Guangdong Provincial Institute of Public Health, Guangdong Provincial Center for Disease Control and Prevention, 160 Qunxian Road, Panyu District, Guangzhou 511430, China
| | - Xuhe Huang
- Guangdong Provincial Institute of Public Health, Guangdong Provincial Center for Disease Control and Prevention, 160 Qunxian Road, Panyu District, Guangzhou 511430, China
| | - Xizi Deng
- Guangzhou Eighth People's Hospital, Guangzhou Medical University, 627 Dongfeng East Road, Yuexiu District, Guangzhou 510030, China
| | - Junbin Li
- Guangzhou Eighth People's Hospital, Guangzhou Medical University, 627 Dongfeng East Road, Yuexiu District, Guangzhou 510030, China
| | - Qinghong Fan
- Guangzhou Eighth People's Hospital, Guangzhou Medical University, 627 Dongfeng East Road, Yuexiu District, Guangzhou 510030, China
| | - Feng Li
- Guangzhou Eighth People's Hospital, Guangzhou Medical University, 627 Dongfeng East Road, Yuexiu District, Guangzhou 510030, China
| | - Xiaoping Tang
- Guangzhou Eighth People's Hospital, Guangzhou Medical University, 627 Dongfeng East Road, Yuexiu District, Guangzhou 510030, China
| | - Weiping Cai
- Guangzhou Eighth People's Hospital, Guangzhou Medical University, 627 Dongfeng East Road, Yuexiu District, Guangzhou 510030, China.
| | - Fengyu Hu
- Guangzhou Eighth People's Hospital, Guangzhou Medical University, 627 Dongfeng East Road, Yuexiu District, Guangzhou 510030, China.
| |
Collapse
|
7
|
Pei S, Dong R, Bao Y, He RL, Yau SST. Classification of genomic components and prediction of genes of Begomovirus based on subsequence natural vector and support vector machine. PeerJ 2020; 8:e9625. [PMID: 32832270 PMCID: PMC7409808 DOI: 10.7717/peerj.9625] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2020] [Accepted: 07/08/2020] [Indexed: 12/02/2022] Open
Abstract
Background Begomoviruses are widely distributed and causing devastating diseases in many crops. According to the number of genomic components, a begomovirus is known as either monopartite or bipartite begomovirus. Both the monopartite and bipartite begomoviruses have the DNA-A component which encodes all essential proteins for virus functions, while the bipartite begomoviruses still contain the DNA-B component. The satellite molecules, known as betasatellites, alphasatellites or deltasatellites, sometimes exist in the begomoviruses. So, the genomic components of begomoviruses are complex and varied. Different genomic components have different gene structures and functions. Classifying the components of begomoviruses is important for studying the virus origin and pathogenic mechanism. Methods We propose a model combining Subsequence Natural Vector (SNV) method with Support Vector Machine (SVM) algorithm, to classify the genomic components of begomoviruses and predict the genes of begomoviruses. First, the genome sequence is represented as a vector numerically by the SNV method. Then SVM is applied on the datasets to build the classification model. At last, recursive feature elimination (RFE) is used to select essential features of the subsequence natural vectors based on the importance of features. Results In the investigation, DNA-A, DNA-B, and different satellite DNAs are selected to build the model. To evaluate our model, the homology-based method BLAST and two machine learning algorithms Random Forest and Naive Bayes method are used to compare with our model. According to the results, our classification model can classify DNA-A, DNA-B, and different satellites with high accuracy. Especially, we can distinguish whether a DNA-A component is from a monopartite or a bipartite begomovirus. Then, based on the results of classification, we can also predict the genes of different genomic components. According to the selected features, we find that the content of four nucleotides in the second and tenth segments (approximately 150-350 bp and 1,450–1,650 bp) are the most different between DNA-A components of monopartite and bipartite begomoviruses, which may be related to the pre-coat protein (AV2) and the transcriptional activator protein (AC2) genes. Our results advance the understanding of the unique structures of the genomic components of begomoviruses.
Collapse
Affiliation(s)
- Shaojun Pei
- Department of Mathematical Sciences, Tsinghua University, Beijing, China
| | - Rui Dong
- Department of Mathematical Sciences, Tsinghua University, Beijing, China
| | - Yiming Bao
- National Genomics Data Center & CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, and China National Center for Bioinformation, Beijing, China.,University of Chinese Academy of Sciences, Beijing, China
| | - Rong Lucy He
- Department of Biological Sciences, Chicago State University, Chicago, United States of America
| | - Stephen S-T Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing, China
| |
Collapse
|