1
|
Halimeh FB, Rafei R, Osman M, Kassem II, Diene SM, Dabboussi F, Rolain JM, Hamze M. Historical, current, and emerging tools for identification and serotyping of Shigella. Braz J Microbiol 2021; 52:2043-2055. [PMID: 34524650 PMCID: PMC8441030 DOI: 10.1007/s42770-021-00573-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2020] [Accepted: 06/29/2021] [Indexed: 11/17/2022] Open
Abstract
The Shigella genus includes serious foodborne disease etiologic agents, with 4 species and 54 serotypes. Identification at species and serotype levels is a crucial task in microbiological laboratories. Nevertheless, the genetic similarity between Shigella spp. and Escherichia coli challenges the correct identification and serotyping of Shigella spp., with subsequent negative repercussions on surveillance, epidemiological investigations, and selection of appropriate treatments. For this purpose, multiple techniques have been developed historically ranging from phenotype-based methods and single or multilocus molecular techniques to whole-genome sequencing (WGS). To facilitate the selection of the most relevant method, we herein provide a global overview of historical and emerging identification and serotyping techniques with a particular focus on the WGS-based approaches. This review highlights the excellent discriminatory power of WGS to more accurately elucidate the epidemiology of Shigella spp., disclose novel promising genomic targets for surveillance methods, and validate previous well-established methods.
Collapse
Affiliation(s)
- Fatima Bachir Halimeh
- Laboratoire Microbiologie Santé et Environnement (LMSE), Doctoral School of Sciences and Technology, Faculty of Public Health, Lebanese University, Tripoli, Lebanon.,Aix-Marseille University, IRD, APHM, MEPHI, IHU-Méditerranée Infection, Faculté de Médecine Et de Pharmacie, 19-21 boulevard Jean Moulin, 13385, Marseille CEDEX 05, France
| | - Rayane Rafei
- Laboratoire Microbiologie Santé et Environnement (LMSE), Doctoral School of Sciences and Technology, Faculty of Public Health, Lebanese University, Tripoli, Lebanon
| | - Marwan Osman
- Laboratoire Microbiologie Santé et Environnement (LMSE), Doctoral School of Sciences and Technology, Faculty of Public Health, Lebanese University, Tripoli, Lebanon.,Department of Population Medicine and Diagnostic Sciences, College of Veterinary Medicine, Cornell University, Ithaca, NY, 14850, USA
| | - Issmat I Kassem
- Center for Food Safety and Department of Food Science and Technology, University of Georgia, 1109 Experiment Street, Griffin, GA, 30223-1797, USA
| | - Seydina M Diene
- Aix-Marseille University, IRD, APHM, MEPHI, IHU-Méditerranée Infection, Faculté de Médecine Et de Pharmacie, 19-21 boulevard Jean Moulin, 13385, Marseille CEDEX 05, France
| | - Fouad Dabboussi
- Laboratoire Microbiologie Santé et Environnement (LMSE), Doctoral School of Sciences and Technology, Faculty of Public Health, Lebanese University, Tripoli, Lebanon
| | - Jean-Marc Rolain
- Aix-Marseille University, IRD, APHM, MEPHI, IHU-Méditerranée Infection, Faculté de Médecine Et de Pharmacie, 19-21 boulevard Jean Moulin, 13385, Marseille CEDEX 05, France
| | - Monzer Hamze
- Laboratoire Microbiologie Santé et Environnement (LMSE), Doctoral School of Sciences and Technology, Faculty of Public Health, Lebanese University, Tripoli, Lebanon.
| |
Collapse
|
2
|
Phylogenetic Analysis of HIV-1 Genomes Based on the Position-Weighted K-mers Method. ENTROPY 2020; 22:e22020255. [PMID: 33286029 PMCID: PMC7516702 DOI: 10.3390/e22020255] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/17/2020] [Revised: 02/07/2020] [Accepted: 02/20/2020] [Indexed: 12/31/2022]
Abstract
HIV-1 viruses, which are predominant in the family of HIV viruses, have strong pathogenicity and infectivity. They can evolve into many different variants in a very short time. In this study, we propose a new and effective alignment-free method for the phylogenetic analysis of HIV-1 viruses using complete genome sequences. Our method combines the position distribution information and the counts of the k-mers together. We also propose a metric to determine the optimal k value. We name our method the Position-Weighted k-mers (PWkmer) method. Validation and comparison with the Robinson-Foulds distance method and the modified bootstrap method on a benchmark dataset show that our method is reliable for the phylogenetic analysis of HIV-1 viruses. PWkmer can resolve within-group variations for different known subtypes of Group M of HIV-1 viruses. This method is simple and computationally fast for whole genome phylogenetic analysis.
Collapse
|
3
|
Prabha R, Singh DP. Cyanobacterial phylogenetic analysis based on phylogenomics approaches render evolutionary diversification and adaptation: an overview of representative orders. 3 Biotech 2019; 9:87. [PMID: 30800598 DOI: 10.1007/s13205-019-1635-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2018] [Accepted: 02/11/2019] [Indexed: 12/12/2022] Open
Abstract
Phylogenetic studies based on a definite set of marker genes usually reconstruct evolutionary relationships among the prokaryotic species. Based on specific target sequences, such studies represent variations and allow identification of similarities or dissimilarities in organisms. With the advent of completely sequenced genomes and accumulation of information on whole prokaryotic genomes, phylogenetic reconstructions should be considered more reliable if they are ideally based on entire genomes to resolve phylogenetic interest. We applied phylogenomics approaches taking into account completely sequenced cyanobacterial genomes to reconstruct underlying species that represented major taxonomic classes and belonged to distinctly different habitats (freshwater, marine, soils, and rocks). We did not rely on describing phylogeny of all representative class of cyanobacterial species on the basis of only ribosomal gene, 16S rDNA gene. In contrast, we analyzed combined molecular marker and phylogenomics approaches (genome alignment, gene content and gene order, composition vector and protein domain content) for accurately inferring phylogenetic relationship of species. We have shown that this approach reflects the impact of evolution on the organisms and considers connects with the ecological adaptation in cyanobacteria in different habitats. Analysis revealed that the members from marine habitat occupy different profile than those from freshwater. Impact of GC content and genomic repetitiveness over the diversification of cyanobacterial species and their possible role in adaptation was also reflected. Members occupying similar habitats cover more evolutionary distance together and also evolve various strategies for adaptation and survival either through genomic repetitiveness or preferences for genes of particular functions or modified GC content. Genomes undergo different changes for their adaptation in diverse habitats.
Collapse
Affiliation(s)
- Ratna Prabha
- 1ICAR-National Bureau of Agriculturally Important Microorganisms, Kushmaur, Maunath Bhanjan, 275101 India
- 2Department of Biotechnology, Mewar University, Gangrar, Chittorgarh, Rajasthan India
| | - Dhananjaya P Singh
- 1ICAR-National Bureau of Agriculturally Important Microorganisms, Kushmaur, Maunath Bhanjan, 275101 India
| |
Collapse
|
4
|
Pavan ME, Pavan EE, Glaeser SP, Etchebehere C, Kämpfer P, Pettinari MJ, López NI. Proposal for a new classification of a deep branching bacterial phylogenetic lineage: transfer of Coprothermobacter proteolyticus and Coprothermobacter platensis to Coprothermobacteraceae fam. nov., within Coprothermobacterales ord. nov., Coprothermobacteria classis nov. and Coprothermobacterota phyl. nov. and emended description of the family Thermodesulfobiaceae. Int J Syst Evol Microbiol 2018; 68:1627-1632. [PMID: 29595416 DOI: 10.1099/ijsem.0.002720] [Citation(s) in RCA: 39] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
The genus Coprothermobacter (initially named Thermobacteroides) is currently placed within the phylum Firmicutes. Early 16S rRNA gene based phylogenetic studies pointed out the great differences between Coprothermobacter and other members of the Firmicutes, revealing that it constitutes a new deep branching lineage. Over the years, several studies based on 16S rRNA gene and whole genome sequences have indicated that Coprothermobacter is very distant phylogenetically to all other bacteria, supporting its placement in a distinct deeply rooted novel phylum. In view of this, we propose its allocation to the new family Coprothermobacteraceae within the novel order Coprothermobacterales, the new class Coprothermobacteria, and the new phylum Coprothermobacterota, and an emended description of the family Thermodesulfobiaceae.
Collapse
Affiliation(s)
- María Elisa Pavan
- Departamento de Química Biológica, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Buenos Aires, Argentina
| | - Esteban E Pavan
- Biomedical Technologies Laboratory, Department of Electronics, Information and Bioengineering, Politecnico di Milano, Milan, Italy
| | - Stefanie P Glaeser
- Institut für Angewandte Mikrobiologie, Universität Giessen, Giessen, Germany
| | - Claudia Etchebehere
- Microbial Ecology Laboratory, Department of Biochemistry and Microbial Genetics, Biological Research Institute "Clemente Estable", Montevideo, Uruguay
| | - Peter Kämpfer
- Institut für Angewandte Mikrobiologie, Universität Giessen, Giessen, Germany
| | - María Julia Pettinari
- Departamento de Química Biológica, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Buenos Aires, Argentina.,IQUIBICEN-CONICET, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Buenos Aires, Argentina
| | - Nancy I López
- IQUIBICEN-CONICET, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Buenos Aires, Argentina.,Departamento de Química Biológica, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Buenos Aires, Argentina
| |
Collapse
|
5
|
Höhl M, Rigoutsos I, Ragan MA. Pattern-Based Phylogenetic Distance Estimation and Tree Reconstruction. Evol Bioinform Online 2017. [DOI: 10.1177/117693430600200016] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
We have developed an alignment-free method that calculates phylogenetic distances using a maximum-likelihood approach for a model of sequence change on patterns that are discovered in unaligned sequences. To evaluate the phylogenetic accuracy of our method, and to conduct a comprehensive comparison of existing alignment-free methods (freely available as Python package decaf+py at http://www.bioinformatics.org.au ), we have created a data set of reference trees covering a wide range of phylogenetic distances. Amino acid sequences were evolved along the trees and input to the tested methods; from their calculated distances we infered trees whose topologies we compared to the reference trees. We find our pattern-based method statistically superior to all other tested alignment-free methods. We also demonstrate the general advantage of alignment-free methods over an approach based on automated alignments when sequences violate the assumption of collinearity. Similarly, we compare methods on empirical data from an existing alignment benchmark set that we used to derive reference distances and trees. Our pattern-based approach yields distances that show a linear relationship to reference distances over a substantially longer range than other alignment-free methods. The pattern-based approach outperforms alignment-free methods and its phylogenetic accuracy is statistically indistinguishable from alignment-based distances.
Collapse
Affiliation(s)
- Michael Höhl
- Institute for Molecular Bioscience, The University of Queensland, Brisbane QLD 4072, Australia
- Australian Research Council Centre in Bioinformatics
| | - Isidore Rigoutsos
- Australian Research Council Centre in Bioinformatics
- Bioinformatics and Pattern Discovery Group, IBM Thomas J Watson Research Center, Yorktown Heights, NY 10598, U.S.A
| | - Mark A. Ragan
- Institute for Molecular Bioscience, The University of Queensland, Brisbane QLD 4072, Australia
- Australian Research Council Centre in Bioinformatics
| |
Collapse
|
6
|
Bogachev MI, Markelov OA, Kayumov AR, Bunde A. Superstatistical model of bacterial DNA architecture. Sci Rep 2017; 7:43034. [PMID: 28225058 PMCID: PMC5320525 DOI: 10.1038/srep43034] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2016] [Accepted: 01/18/2017] [Indexed: 12/15/2022] Open
Abstract
Understanding the physical principles that govern the complex DNA structural organization as well as its mechanical and thermodynamical properties is essential for the advancement in both life sciences and genetic engineering. Recently we have discovered that the complex DNA organization is explicitly reflected in the arrangement of nucleotides depicted by the universal power law tailed internucleotide interval distribution that is valid for complete genomes of various prokaryotic and eukaryotic organisms. Here we suggest a superstatistical model that represents a long DNA molecule by a series of consecutive ~150 bp DNA segments with the alternation of the local nucleotide composition between segments exhibiting long-range correlations. We show that the superstatistical model and the corresponding DNA generation algorithm explicitly reproduce the laws governing the empirical nucleotide arrangement properties of the DNA sequences for various global GC contents and optimal living temperatures. Finally, we discuss the relevance of our model in terms of the DNA mechanical properties. As an outlook, we focus on finding the DNA sequences that encode a given protein while simultaneously reproducing the nucleotide arrangement laws observed from empirical genomes, that may be of interest in the optimization of genetic engineering of long DNA molecules.
Collapse
Affiliation(s)
- Mikhail I. Bogachev
- Biomedical Engineering Research Centre, St. Petersburg Electrotechnical University, St. Petersburg, 197376, Russia
- Molecular Genetics of Microorganisms Lab, Institute of Fundamental Medicine and Biology, Kazan (Volga Region) Federal University, Kazan, Tatarstan, 420008, Russia
| | - Oleg A. Markelov
- Biomedical Engineering Research Centre, St. Petersburg Electrotechnical University, St. Petersburg, 197376, Russia
| | - Airat R. Kayumov
- Molecular Genetics of Microorganisms Lab, Institute of Fundamental Medicine and Biology, Kazan (Volga Region) Federal University, Kazan, Tatarstan, 420008, Russia
| | - Armin Bunde
- Institut für Theoretische Physik, Justus-Liebig-Universität Giessen, 35392 Giessen, Germany
| |
Collapse
|
7
|
Chen S, Deng LY, Bowman D, Shiau JJH, Wong TY, Madahian B, Lu HHS. Phylogenetic tree construction using trinucleotide usage profile (TUP). BMC Bioinformatics 2016; 17:381. [PMID: 27766939 PMCID: PMC5073869 DOI: 10.1186/s12859-016-1222-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND It has been a challenging task to build a genome-wide phylogenetic tree for a large group of species containing a large number of genes with long nucleotides sequences. The most popular method, called feature frequency profile (FFP-k), finds the frequency distribution for all words of certain length k over the whole genome sequence using (overlapping) windows of the same length. For a satisfactory result, the recommended word length (k) ranges from 6 to 15 and it may not be a multiple of 3 (codon length). The total number of possible words needed for FFP-k can range from 46=4096 to 415. RESULTS We propose a simple improvement over the popular FFP method using only a typical word length of 3. A new method, called Trinucleotide Usage Profile (TUP), is proposed based only on the (relative) frequency distribution using non-overlapping windows of length 3. The total number of possible words needed for TUP is 43=64, which is much less than the total count for the recommended optimal "resolution" for FFP. To build a phylogenetic tree, we propose first representing each of the species by a TUP vector and then using an appropriate distance measure between pairs of the TUP vectors for the tree construction. In particular, we propose summarizing a DNA sequence by a matrix of three rows corresponding to three reading frames, recording the frequency distribution of the non-overlapping words of length 3 in each of the reading frame. We also provide a numerical measure for comparing trees constructed with various methods. CONCLUSIONS Compared to the FFP method, our empirical study showed that the proposed TUP method is more capable of building phylogenetic trees with a stronger biological support. We further provide some justifications on this from the information theory viewpoint. Unlike the FFP method, the TUP method takes the advantage that the starting of the first reading frame is (usually) known. Without this information, the FFP method could only rely on the frequency distribution of overlapping words, which is the average (or mixture) of the frequency distributions of three possible reading frames. Consequently, we show (from the entropy viewpoint) that the FFP procedure could dilute important gene information and therefore provides less accurate classification.
Collapse
Affiliation(s)
- Si Chen
- Key Laboratory of Combinatorial Biosynthesis and Drug Discovery Ministry of Education and School of Pharmaceutical Sciences Wuhan University, Wuhan, China
| | - Lih-Yuan Deng
- Department of Mathematical Sciences, University of Memphis, Memphis, TN, USA
| | - Dale Bowman
- Department of Mathematical Sciences, University of Memphis, Memphis, TN, USA
| | | | - Tit-Yee Wong
- Department of Biological Sciences, University of Memphis, Memphis, TN, USA
| | - Behrouz Madahian
- Department of Mathematical Sciences, University of Memphis, Memphis, TN, USA
| | | |
Collapse
|
8
|
Thankachan SV, Apostolico A, Aluru S. A Provably Efficient Algorithm for the k-Mismatch Average Common Substring Problem. J Comput Biol 2016; 23:472-82. [PMID: 27058840 DOI: 10.1089/cmb.2015.0235] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Alignment-free sequence comparison methods are attracting persistent interest, driven by data-intensive applications in genome-wide molecular taxonomy and phylogenetic reconstruction. Among all the methods based on substring composition, the average common substring (ACS) measure admits a straightforward linear time sequence comparison algorithm, while yielding impressive results in multiple applications. An important direction of this research is to extend the approach to permit a bounded edit/hamming distance between substrings, so as to reflect more accurately the evolutionary process. To date, however, algorithms designed to incorporate k ≥ 1 mismatches have O(n(2)) worst-case time complexity, where n is the total length of the input sequences. On the other hand, accounting for mismatches has shown to lead to much improved classification, while heuristics can improve practical performance. In this article, we close the gap by presenting the first provably efficient algorithm for the k-mismatch average common string (ACSk) problem that takes O(n) space and O(n log(k) n) time in the worst case for any constant k. Our method extends the generalized suffix tree model to incorporate a carefully selected bounded set of perturbed suffixes, and can be applied to other complex approximate sequence matching problems.
Collapse
Affiliation(s)
| | - Alberto Apostolico
- College of Computing, Georgia Institute of Technology , Atlanta, Georgia
| | - Srinivas Aluru
- College of Computing, Georgia Institute of Technology , Atlanta, Georgia
| |
Collapse
|
9
|
Aurell E, Innocenti N, Zhou HJ. The bulk and the tail of minimal absent words in genome sequences. Phys Biol 2016; 13:026004. [PMID: 27043075 DOI: 10.1088/1478-3975/13/2/026004] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Minimal absent words (MAW) of a genomic sequence are subsequences that are absent themselves but the subwords of which are all present in the sequence. The characteristic distribution of genomic MAWs as a function of their length has been observed to be qualitatively similar for all living organisms, the bulk being rather short, and only relatively few being long. It has been an open issue whether the reason behind this phenomenon is statistical or reflects a biological mechanism, and what biological information is contained in absent words. In this work we demonstrate that the bulk can be described by a probabilistic model of sampling words from random sequences, while the tail of long MAWs is of biological origin. We introduce the concept of a core of a MAW, which are sequences present in the genome and closest to a given MAW. We show that in E. faecalis, E. coli and yeast the cores of the longest MAWs, which exist in two or more copies, are located in highly conserved regions the most prominent example being ribosomal RNAs. We also show that while the distribution of the cores of long MAWs is roughly uniform over these genomes on a coarse-grained level, on a more detailed level it is strongly enhanced in 3' untranslated regions (UTRs) and, to a lesser extent, also in 5' UTRs. This indicates that MAWs and associated MAW cores correspond to fine-tuned evolutionary relationships, and suggest that they can be more widely used as markers for genomic complexity.
Collapse
Affiliation(s)
- Erik Aurell
- Department of Computational Biology, KTH Royal Institute of Technology, AlbaNova University Center, SE-10691 Stockholm, Sweden. Department of Information and Computer Science, Aalto University, FI-02150 Espoo, Finland
| | | | | |
Collapse
|
10
|
Zuo G, Hao B. CVTree3 Web Server for Whole-genome-based and Alignment-free Prokaryotic Phylogeny and Taxonomy. GENOMICS, PROTEOMICS & BIOINFORMATICS 2015; 13:321-31. [PMID: 26563468 PMCID: PMC4678791 DOI: 10.1016/j.gpb.2015.08.004] [Citation(s) in RCA: 146] [Impact Index Per Article: 16.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/22/2015] [Accepted: 08/10/2015] [Indexed: 01/15/2023]
Abstract
A faithful phylogeny and an objective taxonomy for prokaryotes should agree with each other and ultimately follow the genome data. With the number of sequenced genomes reaching tens of thousands, both tree inference and detailed comparison with taxonomy are great challenges. We now provide one solution in the latest Release 3.0 of the alignment-free and whole-genome-based web server CVTree3. The server resides in a cluster of 64 cores and is equipped with an interactive, collapsible, and expandable tree display. It is capable of comparing the tree branching order with prokaryotic classification at all taxonomic ranks from domains down to species and strains. CVTree3 allows for inquiry by taxon names and trial on lineage modifications. In addition, it reports a summary of monophyletic and non-monophyletic taxa at all ranks as well as produces print-quality subtree figures. After giving an overview of retrospective verification of the CVTree approach, the power of the new server is described for the mega-classification of prokaryotes and determination of taxonomic placement of some newly-sequenced genomes. A few discrepancies between CVTree and 16S rRNA analyses are also summarized with regard to possible taxonomic revisions. CVTree3 is freely accessible to all users at http://tlife.fudan.edu.cn/cvtree3/ without login requirements.
Collapse
Affiliation(s)
- Guanghong Zuo
- T-Life Research Center, Department of Physics, Fudan University, Shanghai 200433, China
| | - Bailin Hao
- T-Life Research Center, Department of Physics, Fudan University, Shanghai 200433, China.
| |
Collapse
|
11
|
Zuo G, Xu Z, Hao B. Phylogeny and Taxonomy of Archaea: A Comparison of the Whole-Genome-Based CVTree Approach with 16S rRNA Sequence Analysis. Life (Basel) 2015; 5:949-68. [PMID: 25789552 PMCID: PMC4390887 DOI: 10.3390/life5010949] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2014] [Revised: 03/06/2015] [Accepted: 03/09/2015] [Indexed: 11/29/2022] Open
Abstract
A tripartite comparison of Archaea phylogeny and taxonomy at and above the rank order is reported: (1) the whole-genome-based and alignment-free CVTree using 179 genomes; (2) the 16S rRNA analysis exemplified by the All-Species Living Tree with 366 archaeal sequences; and (3) the Second Edition of Bergey's Manual of Systematic Bacteriology complemented by some current literature. A high degree of agreement is reached at these ranks. From the newly proposed archaeal phyla, Korarchaeota, Thaumarchaeota, Nanoarchaeota and Aigarchaeota, to the recent suggestion to divide the class Halobacteria into three orders, all gain substantial support from CVTree. In addition, the CVTree helped to determine the taxonomic position of some newly sequenced genomes without proper lineage information. A few discrepancies between the CVTree and the 16S rRNA approaches call for further investigation.
Collapse
Affiliation(s)
- Guanghong Zuo
- Life Research Center and Department of Physics, Fudan University, 220 Handan Road, Shanghai 200433, China.
| | - Zhao Xu
- Thermo Fisher Scientific, 200 Oyster Point Blvd, South San Francisco, CA 94080, USA.
| | - Bailin Hao
- Life Research Center and Department of Physics, Fudan University, 220 Handan Road, Shanghai 200433, China.
| |
Collapse
|
12
|
|
13
|
Bonham-Carter O, Steele J, Bastola D. Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis. Brief Bioinform 2014; 15:890-905. [PMID: 23904502 PMCID: PMC4296134 DOI: 10.1093/bib/bbt052] [Citation(s) in RCA: 97] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2013] [Accepted: 05/31/2013] [Indexed: 12/17/2022] Open
Abstract
Modern sequencing and genome assembly technologies have provided a wealth of data, which will soon require an analysis by comparison for discovery. Sequence alignment, a fundamental task in bioinformatics research, may be used but with some caveats. Seminal techniques and methods from dynamic programming are proving ineffective for this work owing to their inherent computational expense when processing large amounts of sequence data. These methods are prone to giving misleading information because of genetic recombination, genetic shuffling and other inherent biological events. New approaches from information theory, frequency analysis and data compression are available and provide powerful alternatives to dynamic programming. These new methods are often preferred, as their algorithms are simpler and are not affected by synteny-related problems. In this review, we provide a detailed discussion of computational tools, which stem from alignment-free methods based on statistical analysis from word frequencies. We provide several clear examples to demonstrate applications and the interpretations over several different areas of alignment-free analysis such as base-base correlations, feature frequency profiles, compositional vectors, an improved string composition and the D2 statistic metric. Additionally, we provide detailed discussion and an example of analysis by Lempel-Ziv techniques from data compression.
Collapse
|
14
|
New layers in understanding and predicting α-linolenic acid content in plants using amino acid characteristics of omega-3 fatty acid desaturase. Comput Biol Med 2014; 54:14-23. [DOI: 10.1016/j.compbiomed.2014.08.019] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2014] [Revised: 08/16/2014] [Accepted: 08/17/2014] [Indexed: 12/11/2022]
|
15
|
A novel alignment-free method for whole genome analysis: Application to HIV-1 subtyping and HEV genotyping. Inf Sci (N Y) 2014. [DOI: 10.1016/j.ins.2014.04.029] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
16
|
Zuo G, Li Q, Hao B. On K-peptide length in composition vector phylogeny of prokaryotes. Comput Biol Chem 2014; 53 Pt A:166-73. [PMID: 25205031 DOI: 10.1016/j.compbiolchem.2014.08.021] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/11/2014] [Indexed: 11/25/2022]
Abstract
Using an enlarged alphabet of K-tuples is the way to carry out alignment-free comparison of genomes in the composition vector (CV) approach to prokaryotic phylogeny. We summarize the known aspects concerning the choice of K and examine the results of using CVs with subtraction of a statistical background for K=3-9 and using raw CVs without subtraction for K=1-12. The criterion for evaluation consists in direct comparison with taxonomy. For prokaryotes the best performances are obtained for K=5 and 6 with subtraction and for K=11, 12 or even more without subtraction. In general, CVs with subtractions are slightly better and less CPU consuming, but CVs without subtraction may provide complementary information.
Collapse
Affiliation(s)
- Guanghong Zuo
- T-Life Research Center, Fudan University, Shanghai 200433, China
| | - Qiang Li
- CAS-MPG Partner Institute for Computational Biology, Shanghai 200032, China
| | - Bailin Hao
- T-Life Research Center, Fudan University, Shanghai 200433, China.
| |
Collapse
|
17
|
Ebrahimi M, Aghagolzadeh P, Shamabadi N, Tahmasebi A, Alsharifi M, Adelson DL, Hemmatzadeh F, Ebrahimie E. Understanding the undelaying mechanism of HA-subtyping in the level of physic-chemical characteristics of protein. PLoS One 2014; 9:e96984. [PMID: 24809455 PMCID: PMC4014573 DOI: 10.1371/journal.pone.0096984] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2013] [Accepted: 04/07/2014] [Indexed: 01/05/2023] Open
Abstract
The evolution of the influenza A virus to increase its host range is a major concern worldwide. Molecular mechanisms of increasing host range are largely unknown. Influenza surface proteins play determining roles in reorganization of host-sialic acid receptors and host range. In an attempt to uncover the physic-chemical attributes which govern HA subtyping, we performed a large scale functional analysis of over 7000 sequences of 16 different HA subtypes. Large number (896) of physic-chemical protein characteristics were calculated for each HA sequence. Then, 10 different attribute weighting algorithms were used to find the key characteristics distinguishing HA subtypes. Furthermore, to discover machine leaning models which can predict HA subtypes, various Decision Tree, Support Vector Machine, Naïve Bayes, and Neural Network models were trained on calculated protein characteristics dataset as well as 10 trimmed datasets generated by attribute weighting algorithms. The prediction accuracies of the machine learning methods were evaluated by 10-fold cross validation. The results highlighted the frequency of Gln (selected by 80% of attribute weighting algorithms), percentage/frequency of Tyr, percentage of Cys, and frequencies of Try and Glu (selected by 70% of attribute weighting algorithms) as the key features that are associated with HA subtyping. Random Forest tree induction algorithm and RBF kernel function of SVM (scaled by grid search) showed high accuracy of 98% in clustering and predicting HA subtypes based on protein attributes. Decision tree models were successful in monitoring the short mutation/reassortment paths by which influenza virus can gain the key protein structure of another HA subtype and increase its host range in a short period of time with less energy consumption. Extracting and mining a large number of amino acid attributes of HA subtypes of influenza A virus through supervised algorithms represent a new avenue for understanding and predicting possible future structure of influenza pandemics.
Collapse
Affiliation(s)
- Mansour Ebrahimi
- Department of Biology, School of Basic Sciences, University of Qom, Qom, Iran
| | - Parisa Aghagolzadeh
- Department of Nephrology, Hypertension, and Clinical Pharmacology, University of Bern, Bern, Switzerland
| | - Narges Shamabadi
- Department of Biology, School of Basic Sciences, University of Qom, Qom, Iran
| | | | - Mohammed Alsharifi
- School of Molecular and Biomedical Science, The University of Adelaide, Adelaide, Australia
| | - David L. Adelson
- School of Molecular and Biomedical Science, The University of Adelaide, Adelaide, Australia
| | - Farhid Hemmatzadeh
- School of Animal and Veterinary Science, The University of Adelaide, Adelaide, Australia
- * E-mail: (FH); (EE)
| | - Esmaeil Ebrahimie
- School of Molecular and Biomedical Science, The University of Adelaide, Adelaide, Australia
- * E-mail: (FH); (EE)
| |
Collapse
|
18
|
Yuan J, Zhu Q, Liu B. Phylogenetic and biological significance of evolutionary elements from metazoan mitochondrial genomes. PLoS One 2014; 9:e84330. [PMID: 24465405 PMCID: PMC3896360 DOI: 10.1371/journal.pone.0084330] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2013] [Accepted: 11/14/2013] [Indexed: 12/29/2022] Open
Abstract
The evolutionary history of living species is usually inferred through the phylogenetic analysis of molecular and morphological information using various mathematical models. New challenges in phylogenetic analysis are centered mostly on the search for accurate and efficient methods to handle the huge amounts of sequence data generated from newer genome sequencing. The next major challenge is the determination of relationships between the evolution of structural elements and their functional implementation, which is largely ignored in previous analyses. Here, we described the discovery of structural elements in metazoan mitochondrial genomes, termed key K-strings, that can serve as a basis for phylogenetic tree construction. Although comprising only a small fraction (0.73%) of all K-strings, these key K-strings are pivotal to the tree construction because they allow for a significant reduction in the computational time required to construct phylogenetic trees, and more importantly, they make significant improvement to the results of phylogenetic inference. The trees constructed from the key K-strings were consistent overall to our current view of metazoan phylogeny and exhibited a more rational topology than the trees constructed by using other conventional methods. Surprisingly, the key K-strings tended to accumulate in the conserved regions of the original sequences, which were most likely due to strong selection pressure. Furthermore, the special structural features of the key K-strings should have some potential applications in the study of the structures and functions relationship of proteins and in the determination of evolutionary trajectory of species. The novelty and potential importance of key K-strings lead us to believe that they are essential evolutionary elements. As such, they may play important roles in the process of species evolution and their physical existence. Further studies could lead to discoveries regarding the relationship between evolution and processes of speciation.
Collapse
Affiliation(s)
- Jianbo Yuan
- Center of Systematic Genomics, Xinjiang Institute of Ecology and Geography, Chinese Academy of Sciences, Urumqi, Xinjiang, China
- CAS Key Laboratory of Experimental Marine Biology, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, Shandong, China
- Graduate University of Chinese Academy of Sciences, Beijing, China
| | | | - Bin Liu
- Center of Systematic Genomics, Xinjiang Institute of Ecology and Geography, Chinese Academy of Sciences, Urumqi, Xinjiang, China
- CAS Key Laboratory of Experimental Marine Biology, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, Shandong, China
- Graduate University of Chinese Academy of Sciences, Beijing, China
- * E-mail:
| |
Collapse
|
19
|
Dai Q, Yan Z, Shi Z, Liu X, Yao Y, He P. Study of LZ-word distribution and its application for sequence comparison. J Theor Biol 2013; 336:52-60. [PMID: 23876763 PMCID: PMC7094135 DOI: 10.1016/j.jtbi.2013.07.008] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2013] [Revised: 07/06/2013] [Accepted: 07/10/2013] [Indexed: 11/29/2022]
Abstract
Lempel-Ziv complexity has been widely used for sequence comparison and achieved promising results, but until now components' distribution in exhaustive history has not been studied. This paper investigated the whole distribution of LZ-words and presented a novel statistical method for sequence comparison. With the components' length in mind, we revised Lempel-Ziv complexity and obtained various sets of LZ-words. Instead of calculating the LZ-words' contents, we defined a series of set operations on LZ-word set to compare biological sequences. In order to assess the effectiveness of the proposed method, we performed two sets of experiments and compared it with alignment-based methods.
Collapse
Affiliation(s)
- Qi Dai
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, People's Republic of China.
| | | | | | | | | | | |
Collapse
|
20
|
Zuo G, Xu Z, Hao B. Shigella strains are not clones of Escherichia coli but sister species in the genus Escherichia. GENOMICS PROTEOMICS & BIOINFORMATICS 2012; 11:61-5. [PMID: 23395177 PMCID: PMC4357666 DOI: 10.1016/j.gpb.2012.11.002] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/28/2012] [Accepted: 11/05/2012] [Indexed: 02/02/2023]
Abstract
Shigella species and Escherichia coli are closely related organisms. Early phenotyping experiments and several recent molecular studies put Shigella within the species E. coli. However, the whole-genome-based, alignment-free and parameter-free CVTree approach shows convincingly that four established Shigella species, Shigella boydii, Shigella sonnei, Shigella felxneri and Shigella dysenteriae, are distinct from E. coli strains, and form sister species to E. coli within the genus Escherichia. In view of the overall success and high resolution power of the CVTree approach, this result should be taken seriously. We hope that the present report may promote further in-depth study of the Shigella-E. coli relationship.
Collapse
Affiliation(s)
- Guanghong Zuo
- T-Life Research Center and Department of Physics, Fudan University, Shanghai 200433, China
| | | | | |
Collapse
|
21
|
Castellini A, Franco G, Manca V. A dictionary based informational genome analysis. BMC Genomics 2012; 13:485. [PMID: 22985068 PMCID: PMC3577435 DOI: 10.1186/1471-2164-13-485] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2012] [Accepted: 08/28/2012] [Indexed: 11/16/2022] Open
Abstract
Background In the post-genomic era several methods of computational genomics are emerging to understand how the whole information is structured within genomes. Literature of last five years accounts for several alignment-free methods, arisen as alternative metrics for dissimilarity of biological sequences. Among the others, recent approaches are based on empirical frequencies of DNA k-mers in whole genomes. Results Any set of words (factors) occurring in a genome provides a genomic dictionary. About sixty genomes were analyzed by means of informational indexes based on genomic dictionaries, where a systemic view replaces a local sequence analysis. A software prototype applying a methodology here outlined carried out some computations on genomic data. We computed informational indexes, built the genomic dictionaries with different sizes, along with frequency distributions. The software performed three main tasks: computation of informational indexes, storage of these in a database, index analysis and visualization. The validation was done by investigating genomes of various organisms. A systematic analysis of genomic repeats of several lengths, which is of vivid interest in biology (for example to compute excessively represented functional sequences, such as promoters), was discussed, and suggested a method to define synthetic genetic networks. Conclusions We introduced a methodology based on dictionaries, and an efficient motif-finding software application for comparative genomics. This approach could be extended along many investigation lines, namely exported in other contexts of computational genomics, as a basis for discrimination of genomic pathologies.
Collapse
Affiliation(s)
- Alberto Castellini
- Department of Computer Science, Strada Le Grazie 15, 37134 Verona, Italy
| | | | | |
Collapse
|
22
|
Gao B, Gupta RS. Phylogenetic framework and molecular signatures for the main clades of the phylum Actinobacteria. Microbiol Mol Biol Rev 2012; 76:66-112. [PMID: 22390973 PMCID: PMC3294427 DOI: 10.1128/mmbr.05011-11] [Citation(s) in RCA: 167] [Impact Index Per Article: 13.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023] Open
Abstract
The phylum Actinobacteria harbors many important human pathogens and also provides one of the richest sources of natural products, including numerous antibiotics and other compounds of biotechnological interest. Thus, a reliable phylogeny of this large phylum and the means to accurately identify its different constituent groups are of much interest. Detailed phylogenetic and comparative analyses of >150 actinobacterial genomes reported here form the basis for achieving these objectives. In phylogenetic trees based upon 35 conserved proteins, most of the main groups of Actinobacteria as well as a number of their superageneric clades are resolved. We also describe large numbers of molecular markers consisting of conserved signature indels in protein sequences and whole proteins that are specific for either all Actinobacteria or their different clades (viz., orders, families, genera, and subgenera) at various taxonomic levels. These signatures independently support the existence of different phylogenetic clades, and based upon them, it is now possible to delimit the phylum Actinobacteria (excluding Coriobacteriia) and most of its major groups in clear molecular terms. The species distribution patterns of these markers also provide important information regarding the interrelationships among different main orders of Actinobacteria. The identified molecular markers, in addition to enabling the development of a stable and reliable phylogenetic framework for this phylum, also provide novel and powerful means for the identification of different groups of Actinobacteria in diverse environments. Genetic and biochemical studies on these Actinobacteria-specific markers should lead to the discovery of novel biochemical and/or other properties that are unique to different groups of Actinobacteria.
Collapse
Affiliation(s)
- Beile Gao
- Department of Biochemistry and Biomedical Science, McMaster University, Hamilton, Ontario, Canada
| | | |
Collapse
|
23
|
Integrating overlapping structures and background information of words significantly improves biological sequence comparison. PLoS One 2011; 6:e26779. [PMID: 22102867 PMCID: PMC3213098 DOI: 10.1371/journal.pone.0026779] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2011] [Accepted: 10/04/2011] [Indexed: 12/19/2022] Open
Abstract
Word-based models have achieved promising results in sequence comparison. However, as the important statistical properties of words in biological sequence, how to use the overlapping structures and background information of the words to improve sequence comparison is still a problem. This paper proposed a new statistical method that integrates the overlapping structures and the background information of the words in biological sequences. To assess the effectiveness of this integration for sequence comparison, two sets of evaluation experiments were taken to test the proposed model. The first one, performed via receiver operating curve analysis, is the application of proposed method in discrimination between functionally related regulatory sequences and unrelated sequences, intron and exon. The second experiment is to evaluate the performance of the proposed method with f-measure for clustering Hepatitis E virus genotypes. It was demonstrated that the proposed method integrating the overlapping structures and the background information of words significantly improves biological sequence comparison and outperforms the existing models.
Collapse
|
24
|
Liu X, Zhao YP. Substitution matrices of residue triplets derived from protein blocks. J Comput Biol 2011; 17:1679-87. [PMID: 21128854 DOI: 10.1089/cmb.2008.0035] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
In protein sequence alignment, residue similarity is usually evaluated by substitution matrix, which scores all possible exchanges of one amino acid with another. Several matrices are widely used in sequence alignment, including PAM matrices derived from homologous sequence and BLOSUM matrices derived from aligned segments of BLOCKS. However, most matrices have not addressed the high-order residue-residue interactions that are vital to the bio-properties of protein. With consideration for the inherent correlation in residue triplet, we present a new scoring scheme for sequence alignment. Protein sequence is treated as overlapping and successive 3-residue segments. Two edge residues of a triplet are clustered into hydrophobic or polar categories, respectively. Protein sequence is then rewritten into triplet sequence with 2 x 20 x 2 = 80 alphabets. Using a traditional approach, we construct a new scoring scheme named TLESUM(hp) (TripLEt SUbstitution Matrices with hydrophobic and polar information) for pairwise substitution of triplets, which characterizes the similarity of residue triplets. The applications of this matrix led to marked improvements in multiple sequence alignment and in searching structurally alike residue segments. The reason for the occurrence of the "twilight zone," i.e., structure explosion of low identity sequences, is also discussed.
Collapse
Affiliation(s)
- Xin Liu
- State Key Laboratory of Nonlinear Mechanics, Institute of Mechanics, Chinese Academy of Sciences, Beijing, China
| | | |
Collapse
|
25
|
Using Markov model to improve word normalization algorithm for biological sequence comparison. Amino Acids 2011; 42:1867-77. [DOI: 10.1007/s00726-011-0906-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2010] [Accepted: 03/29/2011] [Indexed: 10/18/2022]
|
26
|
Chang G, Wang T. Weighted relative entropy for alignment-free sequence comparison based on Markov model. J Biomol Struct Dyn 2011; 28:545-55. [PMID: 21142223 DOI: 10.1080/07391102.2011.10508594] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
In this paper, we introduce a probabilistic measure for computing the similarity between two biological sequences without alignment. The computation of the similarity measure is based on the Kullback-Leibler divergence of two constructed Markov models. We firstly validate the method on clustering nine chromosomes from three species. Secondly, we give the result of similarity search based on our new method. We lastly apply the measure to the construction of phylogenetic tree of 48 HEV genome sequences. Our results indicate that the weighted relative entropy is an efficient and powerful alignment-free measure for the analysis of sequences in the genomic scale.
Collapse
Affiliation(s)
- Guisong Chang
- School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, PR China.
| | | |
Collapse
|
27
|
Dai Q, Liu X, Yao Y, Zhao F. Numerical characteristics of word frequencies and their application to dissimilarity measure for sequence comparison. J Theor Biol 2011; 276:174-80. [PMID: 21334347 DOI: 10.1016/j.jtbi.2011.02.005] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2010] [Revised: 02/05/2011] [Accepted: 02/07/2011] [Indexed: 10/18/2022]
Abstract
Sequence comparison is one of the major tasks in bioinformatics, which can be used to study structural and functional conservation, as well as evolutionary relations among the sequences. Numerous dissimilarity measures achieve promising results in sequence comparison, but challenges remain. This paper studied numerical characteristics of word frequencies and proposed a novel dissimilarity measure for sequence comparison. Instead of using the word frequencies directly, the proposed measure considers both the word frequencies and overlapping structures of words. To verify the effectiveness of the proposed measure, we tested it with two experiments and further compared it with alignment-based and alignment-free measures. The results demonstrate that the proposed measure extracting more information on the overlapping structures of the words improves the efficiency of sequence comparison.
Collapse
Affiliation(s)
- Qi Dai
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, People's Republic of China.
| | | | | | | |
Collapse
|
28
|
Liu X, Dai Q, Li L, He Z. An efficient binomial model-based measure for sequence comparison and its application. J Biomol Struct Dyn 2011; 28:833-43. [PMID: 21294594 DOI: 10.1080/07391102.2011.10508611] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Abstract
Sequence comparison is one of the major tasks in bioinformatics, which could serve as evidence of structural and functional conservation, as well as of evolutionary relations. There are several similarity/dissimilarity measures for sequence comparison, but challenges remains. This paper presented a binomial model-based measure to analyze biological sequences. With help of a random indicator, the occurrence of a word at any position of sequence can be regarded as a random Bernoulli variable, and the distribution of a sum of the word occurrence is well known to be a binomial one. By using a recursive formula, we computed the binomial probability of the word count and proposed a binomial model-based measure based on the relative entropy. The proposed measure was tested by extensive experiments including classification of HEV genotypes and phylogenetic analysis, and further compared with alignment-based and alignment-free measures. The results demonstrate that the proposed measure based on binomial model is more efficient.
Collapse
Affiliation(s)
- Xiaoqing Liu
- School of Science, Hangzhou Dianzi Unviersity, Hangzhou 310018, People's Republic of China
| | | | | | | |
Collapse
|
29
|
Zuo G, Xu Z, Yu H, Hao B. Jackknife and bootstrap tests of the composition vector trees. GENOMICS, PROTEOMICS & BIOINFORMATICS 2010; 8:262-7. [PMID: 21382595 PMCID: PMC5054193 DOI: 10.1016/s1672-0229(10)60028-9] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Composition vector trees (CVTrees) are inferred from whole-genome data by an alignment-free and parameter-free method. The agreement of these trees with the corresponding taxonomy provides an objective justification of the inferred phylogeny In this work, we show the stability and self-consistency of CVTrees by performing bootstrap and jackknife re-sampling tests adapted to this alignment-free approach. Our ultimate goal is to advocate the viewpoint that time-consuming statistical re-sampling tests can be avoided at all in using this alignment-free approach. Agreement with taxonomy should be taken as a major criterion to estimate prokaryotic phylogenetic trees.
Collapse
Affiliation(s)
- Guanghong Zuo
- T-Life Research Center & Department of Physics, Fudan University, Shanghai 200433, China
- Shanghai Institute of Applied Physics, Chinese Acadamy of Sciences, Shanghai 201800, China
| | - Zhao Xu
- T-Life Research Center & Department of Physics, Fudan University, Shanghai 200433, China
- Applied Biosystems, Inc., Beijing 100027, China
| | - Hongjie Yu
- T-Life Research Center & Department of Physics, Fudan University, Shanghai 200433, China
- Fudan-VARI Center for Genetic Epidemiology, Fudan University, Shanghai 200433, China
| | - Bailin Hao
- T-Life Research Center & Department of Physics, Fudan University, Shanghai 200433, China
- Institute of Theoretical Physics, Chinese Acadamy of Sciences, Beijing 100190, China
- Santa Fe Institute, Santa Fe, NM 87505, USA
| |
Collapse
|
30
|
Sun J, Xu Z, Hao B. Whole-genome based Archaea phylogeny and taxonomy: A composition vector approach. CHINESE SCIENCE BULLETIN-CHINESE 2010; 55:2323-2328. [PMID: 32214732 PMCID: PMC7089326 DOI: 10.1007/s11434-010-3008-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2009] [Accepted: 08/13/2009] [Indexed: 11/24/2022]
Abstract
The newly proposed alignment-free and parameter-free composition vector (CVtree) method has been successfully applied to infer phylogenetic relationship of viruses, chloroplasts, bacteria, and fungi from their whole-genome data. In this study we pay special attention to the phylogenetic positions of 56 Archaea genomes among which 7 species have not been listed either in Bergey's Manual of Systematic Bacteriology or in Taxonomic Outline of Bacteria and Archaea (TOBA). By inspecting the stable monophyletic branchings in CVTrees reconstructed from a total of 861 genomes (56 Archaea plus 797 Bacteria, using 8 Eukarya as outgroups) definite taxonomic assignments were proposed for these not-fully-classified species. Further development of Archaea taxonomy may verify the predicted phylogenetic results of the CVTree approach.
Collapse
Affiliation(s)
- JianDong Sun
- 1T-Life Research Center & Department of Physics, Fudan University, Shanghai, 200433 China
| | - Zhao Xu
- 1T-Life Research Center & Department of Physics, Fudan University, Shanghai, 200433 China
| | - BaiLin Hao
- 1T-Life Research Center & Department of Physics, Fudan University, Shanghai, 200433 China
- 2Institute of Theoretical Physics, Chinese Academy of Sciences, Beijing, 100190 China
- 3Santa Fe Institute, Santa Fe, New Mexico, 87501 USA
| |
Collapse
|
31
|
Corel E, Pitschi F, Laprevotte I, Grasseau G, Didier G, Devauchelle C. MS4--Multi-Scale Selector of Sequence Signatures: an alignment-free method for classification of biological sequences. BMC Bioinformatics 2010; 11:406. [PMID: 20673356 PMCID: PMC2923138 DOI: 10.1186/1471-2105-11-406] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2009] [Accepted: 07/30/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND While multiple alignment is the first step of usual classification schemes for biological sequences, alignment-free methods are being increasingly used as alternatives when multiple alignments fail. Subword-based combinatorial methods are popular for their low algorithmic complexity (suffix trees ...) or exhaustivity (motif search), in general with fixed length word and/or number of mismatches. We developed previously a method to detect local similarities (the N-local decoding) based on the occurrences of repeated subwords of fixed length, which does not impose a fixed number of mismatches. The resulting similarities are, for some "good" values of N, sufficiently relevant to form the basis of a reliable alignment-free classification. The aim of this paper is to develop a method that uses the similarities detected by N-local decoding while not imposing a fixed value of N. We present a procedure that selects for every position in the sequences an adaptive value of N, and we implement it as the MS4 classification tool. RESULTS Among the equivalence classes produced by the N-local decodings for all N, we select a (relatively) small number of "relevant" classes corresponding to variable length subwords that carry enough information to perform the classification. The parameter N, for which correct values are data-dependent and thus hard to guess, is here replaced by the average repetitivity kappa of the sequences. We show that our approach yields classifications of several sets of HIV/SIV sequences that agree with the accepted taxonomy, even on usually discarded repetitive regions (like the non-coding part of LTR). CONCLUSIONS The method MS4 satisfactorily classifies a set of sequences that are notoriously hard to align. This suggests that our approach forms the basis of a reliable alignment-free classification tool. The only parameter kappa of MS4 seems to give reasonable results even for its default value, which can be a great advantage for sequence sets for which little information is available.
Collapse
Affiliation(s)
- Eduardo Corel
- Georg-August-Universität, Institut für Mikrobiologie und Genetik, Göttingen, Germany
| | | | | | | | | | | |
Collapse
|
32
|
Apostolico A. Maximal Words in Sequence Comparisons Based on Subword Composition. ALGORITHMS AND APPLICATIONS 2010. [DOI: 10.1007/978-3-642-12476-1_2] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
|
33
|
Wang H, Xu Z, Gao L, Hao B. A fungal phylogeny based on 82 complete genomes using the composition vector method. BMC Evol Biol 2009; 9:195. [PMID: 19664262 PMCID: PMC3087519 DOI: 10.1186/1471-2148-9-195] [Citation(s) in RCA: 159] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2008] [Accepted: 08/10/2009] [Indexed: 01/23/2023] Open
Abstract
BACKGROUND Molecular phylogenetics and phylogenomics have greatly revised and enriched the fungal systematics in the last two decades. Most of the analyses have been performed by comparing single or multiple orthologous gene regions. Sequence alignment has always been an essential element in tree construction. These alignment-based methods (to be called the standard methods hereafter) need independent verification in order to put the fungal Tree of Life (TOL) on a secure footing. The ever-increasing number of sequenced fungal genomes and the recent success of our newly proposed alignment-free composition vector tree (CVTree, see Methods) approach have made the verification feasible. RESULTS In all, 82 fungal genomes covering 5 phyla were obtained from the relevant genome sequencing centers. An unscaled phylogenetic tree with 3 outgroup species was constructed by using the CVTree method. Overall, the resultant phylogeny infers all major groups in accordance with standard methods. Furthermore, the CVTree provides information on the placement of several currently unsettled groups. Within the sub-phylum Pezizomycotina, our phylogeny places the Dothideomycetes and Eurotiomycetes as sister taxa. Within the Sordariomycetes, it infers that Magnaporthe grisea and the Plectosphaerellaceae are closely related to the Sordariales and Hypocreales, respectively. Within the Eurotiales, it supports that Aspergillus nidulans is the early-branching species among the 8 aspergilli. Within the Onygenales, it groups Histoplasma and Paracoccidioides together, supporting that the Ajellomycetaceae is a distinct clade from Onygenaceae. Within the sub-phylum Saccharomycotina, the CVTree clearly resolves two clades: (1) species that translate CTG as serine instead of leucine (the CTG clade) and (2) species that have undergone whole-genome duplication (the WGD clade). It places Candida glabrata at the base of the WGD clade. CONCLUSION Using different input data and methodology, the CVTree approach is a good complement to the standard methods. The remarkable consistency between them has brought about more confidence to the current understanding of the fungal branch of TOL.
Collapse
Affiliation(s)
- Hao Wang
- T-life Research Center, Department of Physics, Fudan University, Shanghai 200433, PR China.
| | | | | | | |
Collapse
|
34
|
Xu Z, Hao B. CVTree update: a newly designed phylogenetic study platform using composition vectors and whole genomes. Nucleic Acids Res 2009; 37:W174-8. [PMID: 19398429 PMCID: PMC2703908 DOI: 10.1093/nar/gkp278] [Citation(s) in RCA: 152] [Impact Index Per Article: 10.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2009] [Revised: 04/10/2009] [Accepted: 04/14/2009] [Indexed: 11/21/2022] Open
Abstract
The CVTree web server (http://tlife.fudan.edu.cn/cvtree) presented here is a new implementation of the whole genome-based, alignment-free composition vector (CV) method for phylogenetic analysis. It is more efficient and user-friendly than the previously published version in the 2004 web server issue of Nucleic Acids Research. The development of whole genome-based alignment-free CV method has provided an independent verification to the traditional phylogenetic analysis based on a single gene or a few genes. This new implementation attempts to meet the challenge of ever increasing amount of genome data and includes in its database more than 850 prokaryotic genomes which will be updated monthly from NCBI, and more than 80 fungal genomes collected manually from several sequencing centers. This new CVTree web server provides a faster and stable research platform. Users can upload their own sequences to find their phylogenetic position among genomes selected from the server's; inbuilt database. All sequence data used in a session may be downloaded as a compressed file. In addition to standard phylogenetic trees, users can also choose to output trees whose monophyletic branches are collapsed to various taxonomic levels. This feature is particularly useful for comparing phylogeny with taxonomy when dealing with thousands of genomes.
Collapse
Affiliation(s)
- Zhao Xu
- T-Life Research Center, Fudan University, 220 Handan Road, Shanghai 200433, China.
| | | |
Collapse
|
35
|
Apostolico A, Denas O. Fast algorithms for computing sequence distances by exhaustive substring composition. Algorithms Mol Biol 2008; 3:13. [PMID: 18957094 PMCID: PMC2615014 DOI: 10.1186/1748-7188-3-13] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2008] [Accepted: 10/28/2008] [Indexed: 11/28/2022] Open
Abstract
The increasing throughput of sequencing raises growing needs for methods of sequence analysis and comparison on a genomic scale, notably, in connection with phylogenetic tree reconstruction. Such needs are hardly fulfilled by the more traditional measures of sequence similarity and distance, like string edit and gene rearrangement, due to a mixture of epistemological and computational problems. Alternative measures, based on the subword composition of sequences, have emerged in recent years and proved to be both fast and effective in a variety of tested cases. The common denominator of such measures is an underlying information theoretic notion of relative compressibility. Their viability depends critically on computational cost. The present paper describes as a paradigm the extension and efficient implementation of one of the methods in this class. The method is based on the comparison of the frequencies of all subwords in the two input sequences, where frequencies are suitably adjusted to take into account the statistical background.
Collapse
Affiliation(s)
- Alberto Apostolico
- Academia Nazionale dei Lincei, Rome, Italy
- Department of Information Engineering, Universitá di Padova, Padova, Italy
- College of Computing, Georgia Institute of Technology, Atlanta, Georgia, USA
| | - Olgert Denas
- College of Computing, Georgia Institute of Technology, Atlanta, Georgia, USA
| |
Collapse
|
36
|
Liu Z, DeSantis TZ, Andersen GL, Knight R. Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers. Nucleic Acids Res 2008; 36:e120. [PMID: 18723574 PMCID: PMC2566877 DOI: 10.1093/nar/gkn491] [Citation(s) in RCA: 386] [Impact Index Per Article: 24.1] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
The recent introduction of massively parallel pyrosequencers allows rapid, inexpensive analysis of microbial community composition using 16S ribosomal RNA (rRNA) sequences. However, a major challenge is to design a workflow so that taxonomic information can be accurately and rapidly assigned to each read, so that the composition of each community can be linked back to likely ecological roles played by members of each species, genus, family or phylum. Here, we use three large 16S rRNA datasets to test whether taxonomic information based on the full-length sequences can be recaptured by short reads that simulate the pyrosequencer outputs. We find that different taxonomic assignment methods vary radically in their ability to recapture the taxonomic information in full-length 16S rRNA sequences: most methods are sensitive to the region of the 16S rRNA gene that is targeted for sequencing, but many combinations of methods and rRNA regions produce consistent and accurate results. To process large datasets of partial 16S rRNA sequences obtained from surveys of various microbial communities, including those from human body habitats, we recommend the use of Greengenes or RDP classifier with fragments of at least 250 bases, starting from one of the primers R357, R534, R798, F343 or F517.
Collapse
Affiliation(s)
- Zongzhi Liu
- Department of Chemistry and Biochemistry, UCB 215, University of Colorado at Boulder, Boulder, CO 80309-0215, USA
| | | | | | | |
Collapse
|
37
|
Abstract
BACKGROUND Historically, two categories of computational algorithms (alignment-based and alignment-free) have been applied to sequence comparison-one of the most fundamental issues in bioinformatics. Multiple sequence alignment, although dominantly used by biologists, possesses both fundamental as well as computational limitations. Consequently, alignment-free methods have been explored as important alternatives in estimating sequence similarity. Of the alignment-free methods, the string composition vector (CV) methods, which use the frequencies of nucleotide or amino acid strings to represent sequence information, show promising results in genome sequence comparison of prokaryotes. The existing CV-based methods, however, suffer certain statistical problems, thereby underestimating the amount of evolutionary information in genetic sequences. RESULTS We show that the existing string composition based methods have two problems, one related to the Markov model assumption and the other associated with the denominator of the frequency normalization equation. We propose an improved complete composition vector method under the assumption of a uniform and independent model to estimate sequence information contributing to selection for sequence comparison. Phylogenetic analyses using both simulated and experimental data sets demonstrate that our new method is more robust compared with existing counterparts and comparable in robustness with alignment-based methods. CONCLUSION We observed two problems existing in the currently used string composition methods and proposed a new robust method for the estimation of evolutionary information of genetic sequences. In addition, we discussed that it might not be necessary to use relatively long strings to build a complete composition vector (CCV), due to the overlapping nature of vector strings with a variable length. We suggested a practical approach for the choice of an optimal string length to construct the CCV.
Collapse
Affiliation(s)
- Guoqing Lu
- Department of Biology, University of Nebraska, Omaha, NE 68182, USA.
| | | | | |
Collapse
|
38
|
Gao L, Qi J, Sun J, Hao B. Prokaryote phylogeny meets taxonomy: an exhaustive comparison of composition vector trees with systematic bacteriology. ACTA ACUST UNITED AC 2008; 50:587-99. [PMID: 17879055 DOI: 10.1007/s11427-007-0084-3] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2007] [Accepted: 07/21/2007] [Indexed: 10/22/2022]
Abstract
We perform an exhaustive, taxon by taxon, comparison of the branchings in the composition vector trees (CVTrees) inferred from 432 prokaryotic genomes available on 31 December 2006, with the bacteriologists' taxonomy--primarily the latest online Outline of the Bergey's Manual of Systematic Bacteriology. The CVTree phylogeny agrees very well with the Bergey's taxonomy in majority of fine branchings and overall structures. At the same time most of the differences between the trees and the Manual have been known to biologists to some extent and may hint at taxonomic revisions. Instead of demonstrating the overwhelming agreement this paper puts emphasis on the biological implications of the differences.
Collapse
Affiliation(s)
- Lei Gao
- Institute of Theoretical Physics, Chinese Academy of Sciences, Beijing 100080, China
| | | | | | | |
Collapse
|
39
|
Guo FB, Yu XJ. Separate base usages of genes located on the leading and lagging strands in Chlamydia muridarum revealed by the Z curve method. BMC Genomics 2007; 8:366. [PMID: 17925038 PMCID: PMC2089121 DOI: 10.1186/1471-2164-8-366] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2007] [Accepted: 10/10/2007] [Indexed: 11/10/2022] Open
Abstract
Background The nucleotide compositional asymmetry between the leading and lagging strands in bacterial genomes has been the subject of intensive study in the past few years. It is interesting to mention that almost all bacterial genomes exhibit the same kind of base asymmetry. This work aims to investigate the strand biases in Chlamydia muridarum genome and show the potential of the Z curve method for quantitatively differentiating genes on the leading and lagging strands. Results The occurrence frequencies of bases of protein-coding genes in C. muridarum genome were analyzed by the Z curve method. It was found that genes located on the two strands of replication have distinct base usages in C. muridarum genome. According to their positions in the 9-D space spanned by the variables u1 – u9 of the Z curve method, K-means clustering algorithm can assign about 94% of genes to the correct strands, which is a few percent higher than those correctly classified by K-means based on the RSCU. The base usage and codon usage analyses show that genes on the leading strand have more G than C and more T than A, particularly at the third codon position. For genes on the lagging strand the biases is reverse. The y component of the Z curves for the complete chromosome sequences show that the excess of G over C and T over A are more remarkable in C. muridarum genome than in other bacterial genomes without separating base and/or codon usages. Furthermore, for the genomes of Borrelia burgdorferi, Treponema pallidum, Chlamydia muridarum and Chlamydia trachomatis, in which distinct base and/or codon usages have been observed, closer phylogenetic distance is found compared with other bacterial genomes. Conclusion The nature of the strand biases of base composition in C. muridarum is similar to that in most other bacterial genomes. However, the base composition asymmetry between the leading and lagging strands in C. muridarum is more significant than that in other bacteria. It's supposed that the remarkable strand biases of G/C and T/A are responsible for the appearance of separate base or codon usages in C. muridarum. On the other hand, the closer phylogenetic distance among the four bacterial genomes with separate base and/or codon usages is necessary rather than occasional. It's also shown that the Z curve method may be more sensitive than RSCU when being used to quantitatively analyze DNA sequences.
Collapse
Affiliation(s)
- Feng-Biao Guo
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | | |
Collapse
|
40
|
Abstract
The process of inferring phylogenetic trees from molecular sequences almost always starts with a multiple alignment of these sequences but can also be based on methods that do not involve multiple sequence alignment. Very little is known about the accuracy with which such alignment-free methods recover the correct phylogeny or about the potential for increasing their accuracy. We conducted a large-scale comparison of ten alignment-free methods, among them one new approach that does not calculate distances and a faster variant of our pattern-based approach; all distance-based alignment-free methods are freely available from http://www.bioinformatics.org.au (as Python package decaf+py). We show that most methods exhibit a higher overall reconstruction accuracy in the presence of high among-site rate variation. Under all conditions that we considered, variants of the pattern-based approach were significantly better than the other alignment-free methods. The new pattern-based variant achieved a speed-up of an order of magnitude in the distance calculation step, accompanied by a small loss of tree reconstruction accuracy. A method of Bayesian inference from k-mers did not improve on classical alignment-free (and distance-based) methods but may still offer other advantages due to its Bayesian nature. We found the optimal word length k of word-based methods to be stable across various data sets, and we provide parameter ranges for two different alphabets. The influence of these alphabets was analyzed to reveal a trade-off in reconstruction accuracy between long and short branches. We have mapped the phylogenetic accuracy for many alignment-free methods, among them several recently introduced ones, and increased our understanding of their behavior in response to biologically important parameters. In all experiments, the pattern-based approach emerged as superior, at the expense of higher resource consumption. Nonetheless, no alignment-free method that we examined recovers the correct phylogeny as accurately as does an approach based on maximum-likelihood distance estimates of multiply aligned sequences.
Collapse
Affiliation(s)
- Michael Höhl
- Australian Research Council Centre in Bioinformatics, and Institute for Molecular Bioscience, The University of QueenslandBrisbane, QLD 4072, Australia E-mail:
| | - Mark A. Ragan
- Australian Research Council Centre in Bioinformatics, and Institute for Molecular Bioscience, The University of QueenslandBrisbane, QLD 4072, Australia E-mail:
| |
Collapse
|
41
|
Li J, Sayood K. A genome signature based on markov modeling. CONFERENCE PROCEEDINGS : ... ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL CONFERENCE 2007; 2005:2832-5. [PMID: 17282832 DOI: 10.1109/iembs.2005.1617063] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
We propose a "genome signature" for bacterial genomes based on a triplets Markov model. Without the alignment or data preprocessing required by traditional analysis methods, the model is shown to efficiently capture identifying genomic information at genus, species and strain levels. Based on the model, a simple distance measure is proposed for constructing phylogeny trees. Unlike other genome signatures based on word frequency with problems balancing word length and window size, the method has been shown to work successfully with both bacterial whole genome data and individual eukaryotic genes. Applications of the model to phylogenetic analysis and sequence fragment identification are presented.
Collapse
Affiliation(s)
- Jian Li
- Department of Electrical Engineering, University of Nebraska-Lincoln, NE 68588, USA.
| | | |
Collapse
|
42
|
Shen J, Zhang S, Lee HC, Hao B. SeeDNA: a visualization tool for K-string content of long DNA sequences and their randomized counterparts. GENOMICS PROTEOMICS & BIOINFORMATICS 2005; 2:192-6. [PMID: 15862120 PMCID: PMC5172470 DOI: 10.1016/s1672-0229(04)02025-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
An interactive tool to visualize the K-string composition of long DNA sequences including bacterial complete genomes is described. It is especially useful for exploring short palindromic structures in the sequences. The SeeDNA program runs on Red Hat Linux with GTK+ support. It displays two-dimensional (2D) or one-dimensional (1D) histograms of the K-string distribution of a given sequence and/or its randomized counterpart. It is also capable of showing the difference of K-string distributions between two sequences. The C source code using the GTK+ package is freely available.
Collapse
Affiliation(s)
- Junjie Shen
- Department of Computer Science, Zhejiang University, Hangzhou 310027, China
| | - Shuyu Zhang
- T-Life Research Center, Fudan University, Shanghai 200433, China
| | - Hoong-Chien Lee
- Department of Physics, National Central University, Chungli, Taiwan 320, China
| | - Bailin Hao
- T-Life Research Center, Fudan University, Shanghai 200433, China
- Institute of Theoretical Physics, Chinese Academy of Sciences, Beijing 100080, China
- Corresponding author.
| |
Collapse
|
43
|
Qi J, Luo H, Hao B. CVTree: a phylogenetic tree reconstruction tool based on whole genomes. Nucleic Acids Res 2004; 32:W45-7. [PMID: 15215347 PMCID: PMC441500 DOI: 10.1093/nar/gkh362] [Citation(s) in RCA: 161] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2004] [Revised: 03/03/2004] [Accepted: 03/03/2004] [Indexed: 11/14/2022] Open
Abstract
Composition Vector Tree (CVTree) implements a systematic method of inferring evolutionary relatedness of microbial organisms from the oligopeptide content of their complete proteomes (http://cvtree.cbi.pku.edu.cn). Since the first bacterial genomes were sequenced in 1995 there have been several attempts to infer prokaryote phylogeny from complete genomes. Most of them depend on sequence alignment directly or indirectly and, in some cases, need fine-tuning and adjustment. The composition vector method circumvents the ambiguity of choosing the genes for phylogenetic reconstruction and avoids the necessity of aligning sequences of essentially different length and gene content. This new method does not contain 'free' parameter and 'fine-tuning'. A bootstrap test for a phylogenetic tree of 139 organisms has shown the stability of the branchings, which support the small subunit ribosomal RNA (SSU rRNA) tree of life in its overall structure and in many details. It may provide a quick reference in prokaryote phylogenetics whenever the proteome of an organism is available, a situation that will become commonplace in the near future.
Collapse
Affiliation(s)
- Ji Qi
- The Institute of Theoretical Physics, Academia Sinica, Beijing 100080, China.
| | | | | |
Collapse
|
44
|
Yeganova L, Smith L, Wilbur WJ. Identification of related gene/protein names based on an HMM of name variations. Comput Biol Chem 2004; 28:97-107. [PMID: 15130538 PMCID: PMC5815558 DOI: 10.1016/j.compbiolchem.2003.12.003] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2003] [Revised: 12/11/2003] [Accepted: 12/12/2003] [Indexed: 11/18/2022]
Abstract
Gene and protein names follow few, if any, true naming conventions and are subject to great variation in different occurrences of the same name. This gives rise to two important problems in natural language processing. First, can one locate the names of genes or proteins in free text, and second, can one determine when two names denote the same gene or protein? The first of these problems is a special case of the problem of named entity recognition, while the second is a special case of the problem of automatic term recognition (ATR). We study the second problem, that of gene or protein name variation. Here we describe a system which, given a query gene or protein name, identifies related gene or protein names in a large list. The system is based on a dynamic programming algorithm for sequence alignment in which the mutation matrix is allowed to vary under the control of a fully trainable hidden Markov model.
Collapse
Affiliation(s)
- L Yeganova
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bldg. 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
| | | | | |
Collapse
|