1
|
Thind AS, Sinha S. Using Chaos-Game-Representation for Analysing the SARS-CoV-2 Lineages, Newly Emerging Strains and Recombinants. Curr Genomics 2023; 24:187-195. [PMID: 38178984 PMCID: PMC10761335 DOI: 10.2174/0113892029264990231013112156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Revised: 08/09/2023] [Accepted: 09/15/2023] [Indexed: 01/06/2024] Open
Abstract
Background Viruses have high mutation rates, facilitating rapid evolution and the emergence of new species, subspecies, strains and recombinant forms. Accurate classification of these forms is crucial for understanding viral evolution and developing therapeutic applications. Phylogenetic classification is typically performed by analyzing molecular differences at the genomic and sub-genomic levels. This involves aligning homologous proteins or genes. However, there is growing interest in developing alignment-free methods for whole-genome comparisons that are computationally efficient. Methods Here we elaborate on the Chaos Game Representation (CGR) method, based on concepts of statistical physics and free of sequence alignment assumptions. We adopt the CGR method for classification of the closely related clades/lineages A and B of the SARS-Corona virus 2019 (SARS-CoV-2), which is one of the fastest evolving viruses. Results Our study shows that the CGR approach can easily yield the SARS-CoV-2 phylogeny from the available whole genomes of lineage A and lineage B sequences. It also shows an accurate classification of eight different strains and the newly evolved XBB variant from its parental strains. Compared to alignment-based methods (Neighbour-Joining and Maximum Likelihood), the CGR method requires low computational resources, is fast and accurate for long sequences, and, being a K-mer based approach, allows simultaneous comparison of a large number of closely-related sequences of different sizes. Further, we developed an R pipeline CGRphylo, available on GitHub, which integrates the CGR module with various other R packages to create phylogenetic trees and visualize them. Conclusion Our findings demonstrate the efficacy of the CGR method for accurate classification and tracking of rapidly evolving viruses, offering valuable insights into the evolution and emergence of new SARS-CoV-2 strains and recombinants.
Collapse
Affiliation(s)
- Amarinder Singh Thind
- Department of Biological Sciences, Indian Institute of Science Education & Research, Mohali, India
- Illawarra Shoalhaven Local Health District (ISLHD), NSW Health, Australia
| | - Somdatta Sinha
- Department of Biological Sciences, Indian Institute of Science Education & Research, Mohali, India
| |
Collapse
|
2
|
Huang HH, Girimurugan SB. Discrete Wavelet Packet Transform Based Discriminant Analysis for Whole Genome Sequences. Stat Appl Genet Mol Biol 2019; 18:/j/sagmb.ahead-of-print/sagmb-2018-0045/sagmb-2018-0045.xml. [PMID: 30772870 DOI: 10.1515/sagmb-2018-0045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
In recent years, alignment-free methods have been widely applied in comparing genome sequences, as these methods compute efficiently and provide desirable phylogenetic analysis results. These methods have been successfully combined with hierarchical clustering methods for finding phylogenetic trees. However, it may not be suitable to apply these alignment-free methods directly to existing statistical classification methods, because an appropriate statistical classification theory for integrating with the alignment-free representation methods is still lacking. In this article, we propose a discriminant analysis method which uses the discrete wavelet packet transform to classify whole genome sequences. The proposed alignment-free representation statistics of features follow a joint normal distribution asymptotically. The data analysis results indicate that the proposed method provides satisfactory classification results in real time.
Collapse
Affiliation(s)
- Hsin-Hsiung Huang
- University of Central Florida, Department of Statistics, Orlando, FL, USA
| | | |
Collapse
|
3
|
Computational Techniques for a Comprehensive Understanding of Different Genotype-Phenotype Factors in Biological Systems and Their Applications. Synth Biol (Oxf) 2018. [DOI: 10.1007/978-981-10-8693-9_8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open
|
4
|
Huang HH, Girimurugan SB. A Novel Real-Time Genome Comparison Method Using Discrete Wavelet Transform. J Comput Biol 2017; 25:405-416. [PMID: 29272149 DOI: 10.1089/cmb.2017.0115] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Real-time genome comparison is important for identifying unknown species and clustering organisms. We propose a novel method that can represent genome sequences of different lengths as a 12-dimensional numerical vector in real time for this purpose. Given a genome sequence, a binary indicator sequence of each nucleotide base location is computed, and then discrete wavelet transform is applied to these four binary indicator sequences to attain the respective power spectra. Afterward, moments of the power spectra are calculated. Consequently, the 12-dimensional numerical vectors are constructed from the first three order moments. Our experimental results on various data sets show that the proposed method is efficient and effective to cluster genes and genomes. It runs significantly faster than other alignment-free and alignment-based methods.
Collapse
Affiliation(s)
- Hsin-Hsiung Huang
- 1 Department of Statistics, University of Central Florida , Orlando, Florida
| | | |
Collapse
|
5
|
Karamichalis R, Kari L, Konstantinidis S, Kopecki S, Solis-Reyes S. Additive methods for genomic signatures. BMC Bioinformatics 2016; 17:313. [PMID: 27549194 PMCID: PMC4994249 DOI: 10.1186/s12859-016-1157-8] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2016] [Accepted: 07/19/2016] [Indexed: 01/09/2023] Open
Abstract
Background Studies exploring the potential of Chaos Game Representations (CGR) of genomic sequences to act as “genomic signatures” (to be species- and genome-specific) showed that CGR patterns of nuclear and organellar DNA sequences of the same organism can be very different. While the hypothesis that CGRs of mitochondrial DNA sequences can act as genomic signatures was validated for a snapshot of all sequenced mitochondrial genomes available in the NCBI GenBank sequence database, to our knowledge no such extensive analysis of CGRs of nuclear DNA sequences exists to date. Results We analyzed an extensive dataset, totalling 1.45 gigabase pairs, of nuclear/nucleoid genomic sequences (nDNA) from 42 different organisms, spanning all major kingdoms of life. Our computational experiments indicate that CGR signatures of nDNA of two different origins cannot always be differentiated, especially if they originate from closely-related species such as H. sapiens and P. troglodytes or E. coli and E. fergusonii. To address this issue, we propose the general concept of additive DNA signature of a set (collection) of DNA sequences. One particular instance, the composite DNA signature, combines information from nDNA fragments and organellar (mitochondrial, chloroplast, or plasmid) genomes. We demonstrate that, in this dataset, composite DNA signatures originating from two different organisms can be differentiated in all cases, including those where the use of CGR signatures of nDNA failed or was inconclusive. Another instance, the assembled DNA signature, combines information from many short DNA subfragments (e.g., 100 basepairs) of a given DNA fragment, to produce its signature. We show that an assembled DNA signature has the same distinguishing power as a conventionally computed CGR signature, while using shorter contiguous sequences and potentially less sequence information. Conclusions Our results suggest that, while CGR signatures of nDNA cannot always play the role of genomic signatures, composite and assembled DNA signatures (separately or in combination) could potentially be used instead. Such additive signatures could be used, e.g., with raw unassembled next-generation sequencing (NGS) read data, when high-quality sequencing data is not available, or to complement information obtained by other methods of species identification or classification. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1157-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Rallis Karamichalis
- Department of Computer Science, University of Western Ontario, London ON, N6A 5B7, Canada
| | - Lila Kari
- School of Computing Science, University of Waterloo, Waterloo, ON, N2L 3G1, Canada. .,Department of Computer Science, University of Western Ontario, London ON, N6A 5B7, Canada.
| | - Stavros Konstantinidis
- Department of Mathematics and Computing Science, Saint Mary's University, Halifax NS, Canada
| | - Steffen Kopecki
- Department of Computer Science, University of Western Ontario, London ON, N6A 5B7, Canada.,Department of Mathematics and Computing Science, Saint Mary's University, Halifax NS, Canada
| | - Stephen Solis-Reyes
- Department of Computer Science, University of Western Ontario, London ON, N6A 5B7, Canada
| |
Collapse
|
6
|
Hoang T, Yin C, Yau SST. Numerical encoding of DNA sequences by chaos game representation with application in similarity comparison. Genomics 2016; 108:134-142. [PMID: 27538895 DOI: 10.1016/j.ygeno.2016.08.002] [Citation(s) in RCA: 38] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2016] [Revised: 08/04/2016] [Accepted: 08/12/2016] [Indexed: 11/19/2022]
Abstract
Numerical encoding plays an important role in DNA sequence analysis via computational methods, in which numerical values are associated with corresponding symbolic characters. After numerical representation, digital signal processing methods can be exploited to analyze DNA sequences. To reflect the biological properties of the original sequence, it is vital that the representation is one-to-one. Chaos Game Representation (CGR) is an iterative mapping technique that assigns each nucleotide in a DNA sequence to a respective position on the plane that allows the depiction of the DNA sequence in the form of image. Using CGR, a biological sequence can be transformed one-to-one to a numerical sequence that preserves the main features of the original sequence. In this research, we propose to encode DNA sequences by considering 2D CGR coordinates as complex numbers, and apply digital signal processing methods to analyze their evolutionary relationship. Computational experiments indicate that this approach gives comparable results to the state-of-the-art multiple sequence alignment method, Clustal Omega, and is significantly faster. The MATLAB code for our method can be accessed from: www.mathworks.com/matlabcentral/fileexchange/57152.
Collapse
Affiliation(s)
- Tung Hoang
- Department of Mathematics, Statistics and Computer Science, University of Ilinois at Chicago, Chicago, IL 60607, USA
| | - Changchuan Yin
- Department of Mathematics, Statistics and Computer Science, University of Ilinois at Chicago, Chicago, IL 60607, USA
| | - Stephen S-T Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China.
| |
Collapse
|
7
|
Karamichalis R, Kari L, Konstantinidis S, Kopecki S. An investigation into inter- and intragenomic variations of graphic genomic signatures. BMC Bioinformatics 2015; 16:246. [PMID: 26249837 PMCID: PMC4527362 DOI: 10.1186/s12859-015-0655-4] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2014] [Accepted: 06/30/2015] [Indexed: 11/30/2022] Open
Abstract
Background Motivated by the general need to identify and classify species based on molecular evidence, genome comparisons have been proposed that are based on measuring mostly Euclidean distances between Chaos Game Representation (CGR) patterns of genomic DNA sequences. Results We provide, on an extensive dataset and using several different distances, confirmation of the hypothesis that CGR patterns are preserved along a genomic DNA sequence, and are different for DNA sequences originating from genomes of different species. This finding lends support to the theory that CGRs of genomic sequences can act as graphic genomic signatures. In particular, we compare the CGR patterns of over five hundred different 150,000 bp genomic sequences spanning one complete chromosome from each of six organisms, representing all kingdoms of life: H. sapiens (Animalia; chromosome 21), S. cerevisiae (Fungi; chromosome 4), A. thaliana (Plantae; chromosome 1), P. falciparum (Protista; chromosome 14), E. coli (Bacteria - full genome), and P. furiosus (Archaea - full genome). To maximize the diversity within each species, we also analyze the interrelationships within a set of over five hundred 150,000 bp genomic sequences sampled from the entire aforementioned genomes. Lastly, we provide some preliminary evidence of this method’s ability to classify genomic DNA sequences at lower taxonomic levels by comparing sequences sampled from the entire genome of H. sapiens (class Mammalia, order Primates) and of M. musculus (class Mammalia, order Rodentia), for a total length of approximately 174 million basepairs analyzed. We compute pairwise distances between CGRs of these genomic sequences using six different distances, and construct Molecular Distance Maps, which visualize all sequences as points in a two-dimensional or three-dimensional space, to simultaneously display their interrelationships. Conclusion Our analysis confirms, for this dataset, that CGR patterns of DNA sequences from the same genome are in general quantitatively similar, while being different for DNA sequences from genomes of different species. Our assessment of the performance of the six distances analyzed uses three different quality measures and suggests that several distances outperform the Euclidean distance, which has so far been almost exclusively used for such studies.
Collapse
Affiliation(s)
- Rallis Karamichalis
- Department of Computer Science, University of Western Ontario, London, ON, Canada.
| | - Lila Kari
- Department of Computer Science, University of Western Ontario, London, ON, Canada.
| | - Stavros Konstantinidis
- Department of Mathematics and Computing Science, Saint Mary's University, Halifax, NS, Canada.
| | - Steffen Kopecki
- Department of Computer Science, University of Western Ontario, London, ON, Canada. .,Department of Mathematics and Computing Science, Saint Mary's University, Halifax, NS, Canada.
| |
Collapse
|
8
|
Kari L, Hill KA, Sayem AS, Karamichalis R, Bryans N, Davis K, Dattani NS. Mapping the space of genomic signatures. PLoS One 2015; 10:e0119815. [PMID: 26000734 PMCID: PMC4441465 DOI: 10.1371/journal.pone.0119815] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2014] [Accepted: 01/16/2015] [Indexed: 01/01/2023] Open
Abstract
We propose a computational method to measure and visualize interrelationships among any number of DNA sequences allowing, for example, the examination of hundreds or thousands of complete mitochondrial genomes. An "image distance" is computed for each pair of graphical representations of DNA sequences, and the distances are visualized as a Molecular Distance Map: Each point on the map represents a DNA sequence, and the spatial proximity between any two points reflects the degree of structural similarity between the corresponding sequences. The graphical representation of DNA sequences utilized, Chaos Game Representation (CGR), is genome- and species-specific and can thus act as a genomic signature. Consequently, Molecular Distance Maps could inform species identification, taxonomic classifications and, to a certain extent, evolutionary history. The image distance employed, Structural Dissimilarity Index (DSSIM), implicitly compares the occurrences of oligomers of length up to k (herein k = 9) in DNA sequences. We computed DSSIM distances for more than 5 million pairs of complete mitochondrial genomes, and used Multi-Dimensional Scaling (MDS) to obtain Molecular Distance Maps that visually display the sequence relatedness in various subsets, at different taxonomic levels. This general-purpose method does not require DNA sequence alignment and can thus be used to compare similar or vastly different DNA sequences, genomic or computer-generated, of the same or different lengths. We illustrate potential uses of this approach by applying it to several taxonomic subsets: phylum Vertebrata, (super)kingdom Protista, classes Amphibia-Insecta-Mammalia, class Amphibia, and order Primates. This analysis of an extensive dataset confirms that the oligomer composition of full mtDNA sequences can be a source of taxonomic information. This method also correctly finds the mtDNA sequences most closely related to that of the anatomically modern human (the Neanderthal, the Denisovan, and the chimp), and that the sequence most different from it in this dataset belongs to a cucumber.
Collapse
Affiliation(s)
- Lila Kari
- Department of Computer Science, University of Western Ontario, London, Ontario, Canada
| | - Kathleen A. Hill
- Department of Computer Science, University of Western Ontario, London, Ontario, Canada
- Department of Biology, University of Western Ontario, London, Ontario, Canada
| | - Abu S. Sayem
- Department of Computer Science, University of Western Ontario, London, Ontario, Canada
| | - Rallis Karamichalis
- Department of Computer Science, University of Western Ontario, London, Ontario, Canada
| | - Nathaniel Bryans
- Department of Computer Science, University of Western Ontario, London, Ontario, Canada
| | - Katelyn Davis
- Department of Biology, University of Western Ontario, London, Ontario, Canada
| | - Nikesh S. Dattani
- Physical and Theoretical Chemistry Laboratory, Department of Chemistry, Oxford University, Oxford, United Kingdom
| |
Collapse
|
9
|
Hoang T, Yin C, Zheng H, Yu C, Lucy He R, Yau SST. A new method to cluster DNA sequences using Fourier power spectrum. J Theor Biol 2015; 372:135-45. [PMID: 25747773 PMCID: PMC7094126 DOI: 10.1016/j.jtbi.2015.02.026] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2014] [Revised: 01/15/2015] [Accepted: 02/23/2015] [Indexed: 11/27/2022]
Abstract
A novel clustering method is proposed to classify genes and genomes. For a given DNA sequence, a binary indicator sequence of each nucleotide is constructed, and Discrete Fourier Transform is applied on these four sequences to attain respective power spectra. Mathematical moments are built from these spectra, and multidimensional vectors of real numbers are constructed from these moments. Cluster analysis is then performed in order to determine the evolutionary relationship between DNA sequences. The novelty of this method is that sequences with different lengths can be compared easily via the use of power spectra and moments. Experimental results on various datasets show that the proposed method provides an efficient tool to classify genes and genomes. It not only gives comparable results but also is remarkably faster than other multiple sequence alignment and alignment-free methods. We propose to use Fourier power spectrum to cluster genes and genomes. We construct mathematical moments from the power spectrum. We perform phylogenetic analysis of genes and genomes based on moments.
Collapse
Affiliation(s)
- Tung Hoang
- Department of Mathematics, Statistics and Computer Science, University of Ilinois at Chicago, Chicago, IL 60607, USA
| | - Changchuan Yin
- Department of Mathematics, Statistics and Computer Science, University of Ilinois at Chicago, Chicago, IL 60607, USA
| | - Hui Zheng
- Department of Mathematics, Statistics and Computer Science, University of Ilinois at Chicago, Chicago, IL 60607, USA
| | - Chenglong Yu
- Mind and Brain Theme, South Australian Health and Medical Research Institute, North Terrace, Adelaide, SA 5000, Australia; School of Medicine, Flinders University, Adelaide, SA 5001, Australia
| | - Rong Lucy He
- Department of Biological Sciences, Chicago State University, Chicago, IL, USA
| | - Stephen S-T Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China.
| |
Collapse
|
10
|
What has molecular epidemiology ever done for wildlife disease research? Past contributions and future directions. EUR J WILDLIFE RES 2014. [DOI: 10.1007/s10344-014-0882-4] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
11
|
A novel k-word relative measure for sequence comparison. Comput Biol Chem 2014; 53PB:331-338. [PMID: 25462340 DOI: 10.1016/j.compbiolchem.2014.10.007] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2014] [Revised: 08/10/2014] [Accepted: 10/25/2014] [Indexed: 12/28/2022]
Abstract
In order to extract phylogenetic information from DNA sequences, the new normalized k-word average relative distance is proposed in this paper. The proposed measure was tested by discriminate analysis and phylogenetic analysis. The phylogenetic trees based on the Manhattan distance measure are reconstructed with k ranging from 1 to 12. At the same time, a new method is suggested to reduce the matrix dimension, can greatly lessen the amount of calculation and operation time. The experimental assessment demonstrated that our measure was efficient. What's more, comparing with other methods' results shows that our method is feasible and powerful for phylogenetic analysis.
Collapse
|
12
|
Abstract
Dinucleotide usage is known to vary in the genomes of organisms. The dinucleotide usage profiles or genome signatures are similar for sequence samples taken from the same genome, but are different for taxonomically distant species. This concept of genome signatures has been used to study several organisms including viruses, to elucidate the signatures of evolutionary processes at the genome level. Genome signatures assume greater importance in the case of host-pathogen interactions, where molecular interactions between the two species take place continuously, and can influence their genomic composition. In this study, analyses of whole genome sequences of the HIV-1 subtype B, a retrovirus that caused global pandemic of AIDS, have been carried out to analyse the variation in genome signatures of the virus from 1983 to 2007. We show statistically significant temporal variations in some dinucleotide patterns highlighting the selective evolution of the dinucleotide profiles of HIV-1 subtype B, possibly a consequence of host specific selection.
Collapse
|
13
|
Shepard SS, Davis CT, Bahl J, Rivailler P, York IA, Donis RO. LABEL: fast and accurate lineage assignment with assessment of H5N1 and H9N2 influenza A hemagglutinins. PLoS One 2014; 9:e86921. [PMID: 24466291 PMCID: PMC3900692 DOI: 10.1371/journal.pone.0086921] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2013] [Accepted: 12/16/2013] [Indexed: 11/30/2022] Open
Abstract
The evolutionary classification of influenza genes into lineages is a first step in understanding their molecular epidemiology and can inform the subsequent implementation of control measures. We introduce a novel approach called Lineage Assignment By Extended Learning (LABEL) to rapidly determine cladistic information for any number of genes without the need for time-consuming sequence alignment, phylogenetic tree construction, or manual annotation. Instead, LABEL relies on hidden Markov model profiles and support vector machine training to hierarchically classify gene sequences by their similarity to pre-defined lineages. We assessed LABEL by analyzing the annotated hemagglutinin genes of highly pathogenic (H5N1) and low pathogenicity (H9N2) avian influenza A viruses. Using the WHO/FAO/OIE H5N1 evolution working group nomenclature, the LABEL pipeline quickly and accurately identified the H5 lineages of uncharacterized sequences. Moreover, we developed an updated clade nomenclature for the H9 hemagglutinin gene and show a similarly fast and reliable phylogenetic assessment with LABEL. While this study was focused on hemagglutinin sequences, LABEL could be applied to the analysis of any gene and shows great potential to guide molecular epidemiology activities, accelerate database annotation, and provide a data sorting tool for other large-scale bioinformatic studies.
Collapse
MESH Headings
- Animals
- Bayes Theorem
- Cell Lineage
- Chickens
- Evolution, Molecular
- Hemagglutinin Glycoproteins, Influenza Virus/analysis
- Hemagglutinin Glycoproteins, Influenza Virus/genetics
- Influenza A Virus, H5N1 Subtype/classification
- Influenza A Virus, H5N1 Subtype/genetics
- Influenza A Virus, H5N1 Subtype/pathogenicity
- Influenza A Virus, H9N2 Subtype/classification
- Influenza A Virus, H9N2 Subtype/genetics
- Influenza A Virus, H9N2 Subtype/pathogenicity
- Influenza in Birds/genetics
- Influenza in Birds/virology
- Phylogeny
- Poultry Diseases/virology
- Sequence Analysis, DNA
- Software
Collapse
Affiliation(s)
- Samuel S. Shepard
- Influenza Division, Centers for Disease Control and Prevention, Atlanta, Georgia, United States of America
| | - C. Todd Davis
- Influenza Division, Centers for Disease Control and Prevention, Atlanta, Georgia, United States of America
| | - Justin Bahl
- Laboratory of Virus Evolution in Program of Emerging Infectious Diseases, Duke-NUS Graduate Medical School, Singapore, Singapore
- Center for Infectious Diseases, The University of Texas School of Public Health, Houston, Texas, United States of America
| | - Pierre Rivailler
- Influenza Division, Centers for Disease Control and Prevention, Atlanta, Georgia, United States of America
| | - Ian A. York
- Influenza Division, Centers for Disease Control and Prevention, Atlanta, Georgia, United States of America
| | - Ruben O. Donis
- Influenza Division, Centers for Disease Control and Prevention, Atlanta, Georgia, United States of America
| |
Collapse
|
14
|
Luo Y, Zhang Y, Liu X, Yang Y, Yang X, Zheng Z, Deng X, Wu X, Guo X. Characterization of a wild rabies virus isolate of porcine origin in China. INFECTION GENETICS AND EVOLUTION 2013; 17:147-52. [PMID: 23567821 DOI: 10.1016/j.meegid.2013.03.046] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/16/2013] [Revised: 03/27/2013] [Accepted: 03/28/2013] [Indexed: 12/20/2022]
Abstract
Rabies virus (RABV) that circulates worldwide in a variety of mammals can cause fatal encephalomyelitis. GD-SH-01, a street rabies virus, was isolated from a rabid pig in China. We investigated the pathogenicity of GD-SH-01 in suckling and adult mice, and compared the susceptibility of NA and BHK-21 cells in the culture to infection by GD-SH-01 and CVS-24. The complete GD-SH-01 genome sequence was determined and compared with known RABV wild strains to understand the mutations and genetic diversity that allow RABV to spread and adapt in new hosts, such as pigs. Our results suggest that GD-SH-01 possesses the characteristics of a virulent strain in Southern China and shows higher pathogenicity index than that of CVS-24 regardless of its lower level of replication in mouse brain. Up to 47 unique nucleotide substitutions were found in the genome, including five missense mutations. These data provide useful information for further understanding the transmission mechanism and the genetic variation of RABV in dissimilar hosts.
Collapse
Affiliation(s)
- Yongwen Luo
- College of Veterinary Medicine, South China Agricultural University, Guangzhou, Guangdong 510642, China
| | | | | | | | | | | | | | | | | |
Collapse
|
15
|
Hecht L, Dormer A. The evolutionary significance of certain amino acid substitutions and their consequences for HIV-1 immunogenicity toward HLA's A*0201 and B*27. Bioinformation 2013; 9:315-20. [PMID: 23745018 PMCID: PMC3607191 DOI: 10.6026/97320630009315] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2013] [Accepted: 01/29/2013] [Indexed: 11/23/2022] Open
Abstract
In silico tools are employed to examine the evolutionary relationship to possible vaccine peptide candidates' development. This perspective sheds light on the proteomic changes affecting the creation of HLA specific T-cell stimulating peptide vaccines for HIV. Full-length sequences of the envelope protein of the HIV subtypes A, B, C and D were obtained through the NCBI Protein database were aligned using CLUSTALW. They were then analyzed using RANKPEP specific to Human Leukocyte Antigen A*02 and B*27. Geneious was used to catalogue the collected gp160 sequences and to construct a phylogenic tree. Mesquite was employed for ancestral state reconstruction to infer the order of amino acid substitutions in the epitopes examined. The results showed that consensus peptide identified SLAEKNITI had changes that indicated predicted escape mutation in strains of HIV responding to pressure exerted by CD8+ cells expressing HLA A*02. The predominating 9-mers IRIGPGQAF of gp120 are significantly less immunogenic toward HLA B*27 than to HLA A*02. The data confirms previous findings on the importance for efficacious binding, of an arginine residue at the 2(nd) position of the gag SL9 epitope, and extends this principle to other epitopes which interacts with HLA B*27. This study shows that the understanding of viral evolution relating T-cell peptide vaccine design is a development that has much relevance for the creation of personalized therapeutics for HIV treatment.
Collapse
Affiliation(s)
- Luke Hecht
- Institute of Evolutionary Biology, The University of Edinburgh, Kings Buildings, Ashworth Laboratories, West Mains Road, Edinburgh EH9 3JT
| | - Anton Dormer
- Washington Adventist University, Department of Biology, 7600 Flower Avenue Takoma Park, Maryland 20721
| |
Collapse
|
16
|
A simple k-word interval method for phylogenetic analysis of DNA sequences. J Theor Biol 2013; 317:192-9. [DOI: 10.1016/j.jtbi.2012.10.010] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2012] [Revised: 10/02/2012] [Accepted: 10/06/2012] [Indexed: 11/18/2022]
|
17
|
Dwivedi SK, Sengupta S. Classification of HIV-1 sequences using profile Hidden Markov Models. PLoS One 2012; 7:e36566. [PMID: 22623958 PMCID: PMC3356369 DOI: 10.1371/journal.pone.0036566] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2011] [Accepted: 04/10/2012] [Indexed: 11/19/2022] Open
Abstract
Accurate classification of HIV-1 subtypes is essential for studying the dynamic spatial distribution pattern of HIV-1 subtypes and also for developing effective methods of treatment that can be targeted to attack specific subtypes. We propose a classification method based on profile Hidden Markov Model that can accurately identify an unknown strain. We show that a standard method that relies on the construction of a positive training set only, to capture unique features associated with a particular subtype, can accurately classify sequences belonging to all subtypes except B and D. We point out the drawbacks of the standard method; namely, an arbitrary choice of threshold to distinguish between true positives and true negatives, and the inability to discriminate between closely related subtypes. We then propose an improved classification method based on construction of a positive as well as a negative training set to improve discriminating ability between closely related subtypes like B and D. Finally, we show how the improved method can be used to accurately determine the subtype composition of Common Recombinant Forms of the virus that are made up of two or more subtypes. Our method provides a simple and highly accurate alternative to other classification methods and will be useful in accurately annotating newly sequenced HIV-1 strains.
Collapse
Affiliation(s)
- Sanjiv K. Dwivedi
- School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi, India
- School of Sciences, Indian Institute of Technology, Indore, Indore, India
| | - Supratim Sengupta
- School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi, India
- Department of Physical Sciences, Indian Institute of Science Education and Research, Kolkata, Mohanpur, West Bengal, India
| |
Collapse
|
18
|
Pattern matching through Chaos Game Representation: bridging numerical and discrete data structures for biological sequence analysis. Algorithms Mol Biol 2012; 7:10. [PMID: 22551152 PMCID: PMC3402988 DOI: 10.1186/1748-7188-7-10] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2012] [Accepted: 05/02/2012] [Indexed: 01/06/2023] Open
Abstract
Background Chaos Game Representation (CGR) is an iterated function that bijectively maps discrete sequences into a continuous domain. As a result, discrete sequences can be object of statistical and topological analyses otherwise reserved to numerical systems. Characteristically, CGR coordinates of substrings sharing an L-long suffix will be located within 2-L distance of each other. In the two decades since its original proposal, CGR has been generalized beyond its original focus on genomic sequences and has been successfully applied to a wide range of problems in bioinformatics. This report explores the possibility that it can be further extended to approach algorithms that rely on discrete, graph-based representations. Results The exploratory analysis described here consisted of selecting foundational string problems and refactoring them using CGR-based algorithms. We found that CGR can take the role of suffix trees and emulate sophisticated string algorithms, efficiently solving exact and approximate string matching problems such as finding all palindromes and tandem repeats, and matching with mismatches. The common feature of these problems is that they use longest common extension (LCE) queries as subtasks of their procedures, which we show to have a constant time solution with CGR. Additionally, we show that CGR can be used as a rolling hash function within the Rabin-Karp algorithm. Conclusions The analysis of biological sequences relies on algorithmic foundations facing mounting challenges, both logistic (performance) and analytical (lack of unifying mathematical framework). CGR is found to provide the latter and to promise the former: graph-based data structures for sequence analysis operations are entailed by numerical-based data structures produced by CGR maps, providing a unifying analytical framework for a diversity of pattern matching problems.
Collapse
|
19
|
Hatje K, Kollmar M. A phylogenetic analysis of the brassicales clade based on an alignment-free sequence comparison method. FRONTIERS IN PLANT SCIENCE 2012; 3:192. [PMID: 22952468 PMCID: PMC3429886 DOI: 10.3389/fpls.2012.00192] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/25/2012] [Accepted: 08/06/2012] [Indexed: 05/06/2023]
Abstract
Phylogenetic analyses reveal the evolutionary derivation of species. A phylogenetic tree can be inferred from multiple sequence alignments of proteins or genes. The alignment of whole genome sequences of higher eukaryotes is a computational intensive and ambitious task as is the computation of phylogenetic trees based on these alignments. To overcome these limitations, we here used an alignment-free method to compare genomes of the Brassicales clade. For each nucleotide sequence a Chaos Game Representation (CGR) can be computed, which represents each nucleotide of the sequence as a point in a square defined by the four nucleotides as vertices. Each CGR is therefore a unique fingerprint of the underlying sequence. If the CGRs are divided by grid lines each grid square denotes the occurrence of oligonucleotides of a specific length in the sequence (Frequency Chaos Game Representation, FCGR). Here, we used distance measures between FCGRs to infer phylogenetic trees of Brassicales species. Three types of data were analyzed because of their different characteristics: (A) Whole genome assemblies as far as available for species belonging to the Malvidae taxon. (B) EST data of species of the Brassicales clade. (C) Mitochondrial genomes of the Rosids branch, a supergroup of the Malvidae. The trees reconstructed based on the Euclidean distance method are in general agreement with single gene trees. The Fitch-Margoliash and Neighbor joining algorithms resulted in similar to identical trees. Here, for the first time we have applied the bootstrap re-sampling concept to trees based on FCGRs to determine the support of the branchings. FCGRs have the advantage that they are fast to calculate, and can be used as additional information to alignment based data and morphological characteristics to improve the phylogenetic classification of species in ambiguous cases.
Collapse
Affiliation(s)
- Klas Hatje
- Abteilung NMR-Basierte Strukturbiologie, Max-Planck-Institut für Biophysikalische ChemieGöttingen, Germany
| | - Martin Kollmar
- Abteilung NMR-Basierte Strukturbiologie, Max-Planck-Institut für Biophysikalische ChemieGöttingen, Germany
- *Correspondence: Martin Kollmar, Abteilung NMR-Basierte Strukturbiologie, Max-Planck-Institut für Biophysikalische Chemie, Am Fassberg 11, D-37077 Göttingen, Germany. e-mail:
| |
Collapse
|
20
|
Pandit A, Dasanna AK, Sinha S. Multifractal analysis of HIV-1 genomes. Mol Phylogenet Evol 2011; 62:756-63. [PMID: 22155711 DOI: 10.1016/j.ympev.2011.11.017] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2010] [Revised: 10/29/2011] [Accepted: 11/18/2011] [Indexed: 10/14/2022]
Abstract
Pathogens like HIV-1, which evolve into many closely related variants displaying differential infectivity and evolutionary dynamics in a short time scale, require fast and accurate classification. Conventional whole genome sequence alignment-based methods are computationally expensive and involve complex analysis. Alignment-free methodologies are increasingly being used to effectively differentiate genomic variations between viral species. Multifractal analysis, which explores the self-similar nature of genomes, is an alignment-free methodology that has been applied to study such variations. However, whether multifractal analysis can quantify variations between closely related genomes, such as the HIV-1 subtypes, is an open question. Here we address the above by implementing the multifractal analysis on four retroviral genomes (HIV-1, HIV-2, SIVcpz, and HTLV-1), and demonstrate that individual multifractal properties can differentiate between different retrovirus types easily. However, the individual multifractal measures do not resolve within-group variations for different known subtypes of HIV-1 M group. We show here that these known subtypes can instead be classified correctly using a combination of the crucial multifractal measures. This method is simple and computationally fast in comparison to the conventional alignment-based methods for whole genome phylogenetic analysis.
Collapse
Affiliation(s)
- Aridaman Pandit
- Mathematical Modeling and Computational Biology Group, Centre for Cellular and Molecular Biology (CSIR), Hyderabad 500007, India
| | | | | |
Collapse
|
21
|
Unterthiner T, Schultz AK, Bulla J, Morgenstern B, Stanke M, Bulla I. Detection of viral sequence fragments of HIV-1 subfamilies yet unknown. BMC Bioinformatics 2011; 12:93. [PMID: 21481263 PMCID: PMC3086866 DOI: 10.1186/1471-2105-12-93] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2010] [Accepted: 04/11/2011] [Indexed: 11/25/2022] Open
Abstract
Background Methods of determining whether or not any particular HIV-1 sequence stems - completely or in part - from some unknown HIV-1 subtype are important for the design of vaccines and molecular detection systems, as well as for epidemiological monitoring. Nevertheless, a single algorithm only, the Branching Index (BI), has been developed for this task so far. Moving along the genome of a query sequence in a sliding window, the BI computes a ratio quantifying how closely the query sequence clusters with a subtype clade. In its current version, however, the BI does not provide predicted boundaries of unknown fragments. Results We have developed Unknown Subtype Finder (USF), an algorithm based on a probabilistic model, which automatically determines which parts of an input sequence originate from a subtype yet unknown. The underlying model is based on a simple profile hidden Markov model (pHMM) for each known subtype and an additional pHMM for an unknown subtype. The emission probabilities of the latter are estimated using the emission frequencies of the known subtypes by means of a (position-wise) probabilistic model for the emergence of new subtypes. We have applied USF to SIV and HIV-1 sequences formerly classified as having emerged from an unknown subtype. Moreover, we have evaluated its performance on artificial HIV-1 recombinants and non-recombinant HIV-1 sequences. The results have been compared with the corresponding results of the BI. Conclusions Our results demonstrate that USF is suitable for detecting segments in HIV-1 sequences stemming from yet unknown subtypes. Comparing USF with the BI shows that our algorithm performs as good as the BI or better.
Collapse
Affiliation(s)
- Thomas Unterthiner
- Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr, 1, 37077 Göttingen, Germany.
| | | | | | | | | | | |
Collapse
|
22
|
Whole-genome analysis of a human rabies virus from Sri Lanka. Arch Virol 2011; 156:659-69. [PMID: 21298456 DOI: 10.1007/s00705-010-0905-8] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2010] [Accepted: 12/21/2010] [Indexed: 10/18/2022]
Abstract
The complete genome sequence of a human rabies virus, strain H-08-1320, from Sri Lanka was determined and compared with other rabies viruses. The size of the genome was 11,926 nt, and it was composed of a 58-nucleotide 3' leader, five protein genes--N (1353 nt), P (894 nt), M (609 nt), G (1575 nt), and L (6387 nt)--and a 70-nt 5' trailer. The intergenic region G-L contained 515 nt. The sizes of the nucleoprotein, phosphoprotein, matrix-protein, glycoprotein and large-protein was 450, 296, 202, 524 and 2,128 residues, respectively. The phosphoprotein and large protein were one amino acid shorter and longer, respectively, than those of most rabies viruses. The glycoprotein of H-08-1320 had a unique amino acid substitution at antigenic site I. Whole-genome phylogenetic analysis showed that strain H-08-1320 formed an independent lineage and did not cluster with rabies viruses from other countries.
Collapse
|
23
|
Osmanbeyoglu HU, Ganapathiraju MK. N-gram analysis of 970 microbial organisms reveals presence of biological language models. BMC Bioinformatics 2011; 12:12. [PMID: 21219653 PMCID: PMC3027111 DOI: 10.1186/1471-2105-12-12] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2010] [Accepted: 01/10/2011] [Indexed: 11/29/2022] Open
Abstract
Background It has been suggested previously that genome and proteome sequences show characteristics typical of natural-language texts such as "signature-style" word usage indicative of authors or topics, and that the algorithms originally developed for natural language processing may therefore be applied to genome sequences to draw biologically relevant conclusions. Following this approach of 'biological language modeling', statistical n-gram analysis has been applied for comparative analysis of whole proteome sequences of 44 organisms. It has been shown that a few particular amino acid n-grams are found in abundance in one organism but occurring very rarely in other organisms, thereby serving as genome signatures. At that time proteomes of only 44 organisms were available, thereby limiting the generalization of this hypothesis. Today nearly 1,000 genome sequences and corresponding translated sequences are available, making it feasible to test the existence of biological language models over the evolutionary tree. Results We studied whole proteome sequences of 970 microbial organisms using n-gram frequencies and cross-perplexity employing the Biological Language Modeling Toolkit and Patternix Revelio toolkit. Genus-specific signatures were observed even in a simple unigram distribution. By taking statistical n-gram model of one organism as reference and computing cross-perplexity of all other microbial proteomes with it, cross-perplexity was found to be predictive of branch distance of the phylogenetic tree. For example, a 4-gram model from proteome of Shigellae flexneri 2a, which belongs to the Gammaproteobacteria class showed a self-perplexity of 15.34 while the cross-perplexity of other organisms was in the range of 15.59 to 29.5 and was proportional to their branching distance in the evolutionary tree from S. flexneri. The organisms of this genus, which happen to be pathotypes of E.coli, also have the closest perplexity values with E. coli. Conclusion Whole proteome sequences of microbial organisms have been shown to contain particular n-gram sequences in abundance in one organism but occurring very rarely in other organisms, thereby serving as proteome signatures. Further it has also been shown that perplexity, a statistical measure of similarity of n-gram composition, can be used to predict evolutionary distance within a genus in the phylogenetic tree.
Collapse
Affiliation(s)
- Hatice Ulku Osmanbeyoglu
- Department of Biomedical Informatics, University of Pittsburgh, 5150 Center Ave, Suite 301, Pittsburgh, PA 15232, USA
| | | |
Collapse
|