1
|
Hellmuth M, Stadler PF. The Theory of Gene Family Histories. Methods Mol Biol 2024; 2802:1-32. [PMID: 38819554 DOI: 10.1007/978-1-0716-3838-5_1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/01/2024]
Abstract
Most genes are part of larger families of evolutionary-related genes. The history of gene families typically involves duplications and losses of genes as well as horizontal transfers into other organisms. The reconstruction of detailed gene family histories, i.e., the precise dating of evolutionary events relative to phylogenetic tree of the underlying species has remained a challenging topic despite their importance as a basis for detailed investigations into adaptation and functional evolution of individual members of the gene family. The identification of orthologs, moreover, is a particularly important subproblem of the more general setting considered here. In the last few years, an extensive body of mathematical results has appeared that tightly links orthology, a formal notion of best matches among genes, and horizontal gene transfer. The purpose of this chapter is to broadly outline some of the key mathematical insights and to discuss their implication for practical applications. In particular, we focus on tree-free methods, i.e., methods to infer orthology or horizontal gene transfer as well as gene trees, species trees, and reconciliations between them without using a priori knowledge of the underlying trees or statistical models for the inference of phylogenetic trees. Instead, the initial step aims to extract binary relations among genes.
Collapse
Affiliation(s)
- Marc Hellmuth
- Department of Mathematics, Faculty of Science, Stockholm University, Stockholm, Sweden
| | - Peter F Stadler
- Bioinformatics Group, Department of Computer Science, Leipzig University, Leipzig, Germany.
- Interdisciplinary Center for Bioinformatics, Leipzig University, Leipzig, Germany.
- Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany.
- Universidad Nacional de Colombia, Bogotá, Colombia.
- Institute for Theoretical Chemistry, University of Vienna, Wien, Austria.
- Center for non-coding RNA in Technology and Health, University of Copenhagen, Frederiksberg, Denmark.
- Santa Fe Institute, Santa Fe, NM, USA.
| |
Collapse
|
2
|
Bernaola-Galván P, Carpena P, Gómez-Martín C, Oliver JL. Compositional Structure of the Genome: A Review. BIOLOGY 2023; 12:849. [PMID: 37372134 PMCID: PMC10295253 DOI: 10.3390/biology12060849] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/21/2023] [Revised: 06/06/2023] [Accepted: 06/07/2023] [Indexed: 06/29/2023]
Abstract
As the genome carries the historical information of a species' biotic and environmental interactions, analyzing changes in genome structure over time by using powerful statistical physics methods (such as entropic segmentation algorithms, fluctuation analysis in DNA walks, or measures of compositional complexity) provides valuable insights into genome evolution. Nucleotide frequencies tend to vary along the DNA chain, resulting in a hierarchically patchy chromosome structure with heterogeneities at different length scales that range from a few nucleotides to tens of millions of them. Fluctuation analysis reveals that these compositional structures can be classified into three main categories: (1) short-range heterogeneities (below a few kilobase pairs (Kbp)) primarily attributed to the alternation of coding and noncoding regions, interspersed or tandem repeats densities, etc.; (2) isochores, spanning tens to hundreds of tens of Kbp; and (3) superstructures, reaching sizes of tens of megabase pairs (Mbp) or even larger. The obtained isochore and superstructure coordinates in the first complete T2T human sequence are now shared in a public database. In this way, interested researchers can use T2T isochore data, as well as the annotations for different genome elements, to check a specific hypothesis about genome structure. Similarly to other levels of biological organization, a hierarchical compositional structure is prevalent in the genome. Once the compositional structure of a genome is identified, various measures can be derived to quantify the heterogeneity of such structure. The distribution of segment G+C content has recently been proposed as a new genome signature that proves to be useful for comparing complete genomes. Another meaningful measure is the sequence compositional complexity (SCC), which has been used for genome structure comparisons. Lastly, we review the recent genome comparisons in species of the ancient phylum Cyanobacteria, conducted by phylogenetic regression of SCC against time, which have revealed positive trends towards higher genome complexity. These findings provide the first evidence for a driven progressive evolution of genome compositional structure.
Collapse
Affiliation(s)
- Pedro Bernaola-Galván
- Department of Applied Physics II and Institute Carlos I for Theoretical and Computational Physics, University of Málaga, 29071 Málaga, Spain; (P.B.-G.); (P.C.)
| | - Pedro Carpena
- Department of Applied Physics II and Institute Carlos I for Theoretical and Computational Physics, University of Málaga, 29071 Málaga, Spain; (P.B.-G.); (P.C.)
| | - Cristina Gómez-Martín
- Department of Pathology, Cancer Center Amsterdam, Amsterdam UMC, Vrije Universiteit Amsterdam, 1081 HV Amsterdam, The Netherlands;
- Department of Genetics, Faculty of Sciences, 18071 and Laboratory of Bioinformatics, Institute of Biotechnology, Center of Biomedical Research, University of Granada, 18100 Granada, Spain
| | - Jose L. Oliver
- Department of Genetics, Faculty of Sciences, 18071 and Laboratory of Bioinformatics, Institute of Biotechnology, Center of Biomedical Research, University of Granada, 18100 Granada, Spain
| |
Collapse
|
3
|
de la Fuente R, Díaz-Villanueva W, Arnau V, Moya A. Genomic Signature in Evolutionary Biology: A Review. BIOLOGY 2023; 12:biology12020322. [PMID: 36829597 PMCID: PMC9953303 DOI: 10.3390/biology12020322] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/08/2022] [Revised: 02/11/2023] [Accepted: 02/13/2023] [Indexed: 02/19/2023]
Abstract
Organisms are unique physical entities in which information is stored and continuously processed. The digital nature of DNA sequences enables the construction of a dynamic information reservoir. However, the distinction between the hardware and software components in the information flow is crucial to identify the mechanisms generating specific genomic signatures. In this work, we perform a bibliometric analysis to identify the different purposes of looking for particular patterns in DNA sequences associated with a given phenotype. This study has enabled us to make a conceptual breakdown of the genomic signature and differentiate the leading applications. On the one hand, it refers to gene expression profiling associated with a biological function, which may be shared across taxa. This signature is the focus of study in precision medicine. On the other hand, it also refers to characteristic patterns in species-specific DNA sequences. This interpretation plays a key role in comparative genomics, identifying evolutionary relationships. Looking at the relevant studies in our bibliographic database, we highlight the main factors causing heterogeneities in genome composition and how they can be quantified. All these findings lead us to reformulate some questions relevant to evolutionary biology.
Collapse
Affiliation(s)
- Rebeca de la Fuente
- Institute of Integrative Systems Biology (I2Sysbio), University of Valencia and Spanish Research Council (CSIC), 46980 Valencia, Spain
- Correspondence:
| | - Wladimiro Díaz-Villanueva
- Institute of Integrative Systems Biology (I2Sysbio), University of Valencia and Spanish Research Council (CSIC), 46980 Valencia, Spain
| | - Vicente Arnau
- Institute of Integrative Systems Biology (I2Sysbio), University of Valencia and Spanish Research Council (CSIC), 46980 Valencia, Spain
| | - Andrés Moya
- Institute of Integrative Systems Biology (I2Sysbio), University of Valencia and Spanish Research Council (CSIC), 46980 Valencia, Spain
- Foundation for the Promotion of Sanitary and Biomedical Research of the Valencian Community (FISABIO), 46020 Valencia, Spain
- CIBER in Epidemiology and Public Health (CIBEResp), 28029 Madrid, Spain
| |
Collapse
|
4
|
Kern L, Abdeen SK, Kolodziejczyk AA, Elinav E. Commensal inter-bacterial interactions shaping the microbiota. Curr Opin Microbiol 2021; 63:158-171. [PMID: 34365152 DOI: 10.1016/j.mib.2021.07.011] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2021] [Revised: 07/15/2021] [Accepted: 07/16/2021] [Indexed: 12/14/2022]
Abstract
The gut microbiota, a complex ecosystem of microorganisms of different kingdoms, impacts host physiology and disease. Within this ecosystem, inter-bacterial interactions and their impacts on microbiota community structure and the eukaryotic host remain insufficiently explored. Microbiota-related inter-bacterial interactions range from symbiotic interactions, involving exchange of nutrients, enzymes, and genetic material; competition for nutrients and space, mediated by biophysical alterations and secretion of toxins and anti-microbials; to predation of overpopulating bacteria. Collectively, these understudied interactions hold important clues as to forces shaping microbiota diversity, niche formation, and responses to signals perceived from the host, incoming pathogens and the environment. In this review, we highlight the roles and mechanisms of selected inter-bacterial interactions in the microbiota, and their potential impacts on the host and pathogenic infection. We discuss challenges in mechanistically decoding these complex interactions, and prospects of harnessing them as future targets for rational microbiota modification in a variety of diseases.
Collapse
Affiliation(s)
- Lara Kern
- Immunology Department, Weizmann Institute of Science, Rehovot, 7610001, Israel
| | - Suhaib K Abdeen
- Immunology Department, Weizmann Institute of Science, Rehovot, 7610001, Israel
| | | | - Eran Elinav
- Immunology Department, Weizmann Institute of Science, Rehovot, 7610001, Israel; Cancer-Microbiota Division Deutsches Krebsforschungszentrum (DKFZ), Neuenheimer Feld 280, 69120 Heidelberg, Germany.
| |
Collapse
|
5
|
Noroy C, Meyer DF. The super repertoire of type IV effectors in the pangenome of Ehrlichia spp. provides insights into host-specificity and pathogenesis. PLoS Comput Biol 2021; 17:e1008788. [PMID: 34252087 PMCID: PMC8274917 DOI: 10.1371/journal.pcbi.1008788] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Accepted: 05/26/2021] [Indexed: 11/28/2022] Open
Abstract
The identification of bacterial effectors is essential to understand how obligatory intracellular bacteria such as Ehrlichia spp. manipulate the host cell for survival and replication. Infection of mammals–including humans–by the intracellular pathogenic bacteria Ehrlichia spp. depends largely on the injection of virulence proteins that hijack host cell processes. Several hypothetical virulence proteins have been identified in Ehrlichia spp., but one so far has been experimentally shown to translocate into host cells via the type IV secretion system. However, the current challenge is to identify most of the type IV effectors (T4Es) to fully understand their role in Ehrlichia spp. virulence and host adaptation. Here, we predict the T4E repertoires of four sequenced Ehrlichia spp. and four other Anaplasmataceae as comparative models (pathogenic Anaplasma spp. and Wolbachia endosymbiont) using previously developed S4TE 2.0 software. This analysis identified 579 predicted T4Es (228 pT4Es for Ehrlichia spp. only). The effector repertoires of Ehrlichia spp. overlapped, thereby defining a conserved core effectome of 92 predicted effectors shared by all strains. In addition, 69 species-specific T4Es were predicted with non-canonical GC% mostly in gene sparse regions of the genomes and we observed a bias in pT4Es according to host-specificity. We also identified new protein domain combinations, suggesting novel effector functions. This work presenting the predicted effector collection of Ehrlichia spp. can serve as a guide for future functional characterisation of effectors and design of alternative control strategies against these bacteria. A fundamental step for the survival and replication of intravacuolar bacterial pathogens is the establishment of a replicative niche inside host cells by the secretion of bacterial effector proteins in the cytoplasm of the infected cells. These effectors manipulate host signaling pathways, thus allowing to escape the host degradative pathway and uptake nutrients required for intracellular replication of bacteria. In this study, we used S4TE2.0 software for high-throughput computational prediction of bacterial type IV effectors in zoonotic bacteria of the Anaplasmataceae family. The analysis of protein architecture of effectors helped us to identify the cellular pathways targeted during the infection process. The demonstration that effectors are modular components with a broad variety of protein architectures nicely explains their pleotropic mode of action and enlightens their function. We showed that bacterial adaptation to a given host during evolution requires a minimal repertoire of candidate effectors although further experimental determination is needed. T4Es are of increasing interest for basic research, including comprehension of hijacked cellular pathways, manipulated innate immunity, and application for therapeutics. Indeed pathogenomics-driven studies, especially on genetically intractable intracellular bacteria such as Anaplasmataceae, have now a substantial impact for the development of host-targeted antimicrobials, as an alternative to antibiotics.
Collapse
Affiliation(s)
- Christophe Noroy
- CIRAD, UMR ASTRE, Petit-Bourg, Guadeloupe, France
- ASTRE, CIRAD, INRA, Univ Montpellier, Montpellier, France
- Université des Antilles, Fouillole, Pointe-à-Pitre, Guadeloupe, France
| | - Damien F. Meyer
- CIRAD, UMR ASTRE, Petit-Bourg, Guadeloupe, France
- ASTRE, CIRAD, INRA, Univ Montpellier, Montpellier, France
- * E-mail:
| |
Collapse
|
6
|
Indirect identification of horizontal gene transfer. J Math Biol 2021; 83:10. [PMID: 34218334 PMCID: PMC8254804 DOI: 10.1007/s00285-021-01631-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2020] [Revised: 04/06/2021] [Accepted: 06/13/2021] [Indexed: 12/04/2022]
Abstract
Several implicit methods to infer horizontal gene transfer (HGT) focus on pairs of genes that have diverged only after the divergence of the two species in which the genes reside. This situation defines the edge set of a graph, the later-divergence-time (LDT) graph, whose vertices correspond to genes colored by their species. We investigate these graphs in the setting of relaxed scenarios, i.e., evolutionary scenarios that encompass all commonly used variants of duplication-transfer-loss scenarios in the literature. We characterize LDT graphs as a subclass of properly vertex-colored cographs, and provide a polynomial-time recognition algorithm as well as an algorithm to construct a relaxed scenario that explains a given LDT. An edge in an LDT graph implies that the two corresponding genes are separated by at least one HGT event. The converse is not true, however. We show that the complete xenology relation is described by an rs-Fitch graph, i.e., a complete multipartite graph satisfying constraints on the vertex coloring. This class of vertex-colored graphs is also recognizable in polynomial time. We finally address the question “how much information about all HGT events is contained in LDT graphs” with the help of simulations of evolutionary scenarios with a wide range of duplication, loss, and HGT events. In particular, we show that a simple greedy graph editing scheme can be used to efficiently detect HGT events that are implicitly contained in LDT graphs.
Collapse
|
7
|
Tay AP, Hosking B, Hosking C, Bauer DC, Wilson LO. INSIDER: alignment-free detection of foreign DNA sequences. Comput Struct Biotechnol J 2021; 19:3810-3816. [PMID: 34285780 PMCID: PMC8273350 DOI: 10.1016/j.csbj.2021.06.045] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2021] [Revised: 06/28/2021] [Accepted: 06/28/2021] [Indexed: 11/21/2022] Open
Abstract
External DNA sequences can be inserted into an organism's genome either through natural processes such as gene transfer, or through targeted genome engineering strategies. Being able to robustly identify such foreign DNA is a crucial capability for health and biosecurity applications, such as anti-microbial resistance (AMR) detection or monitoring gene drives. This capability does not exist for poorly characterised host genomes or with limited information about the integrated sequence. To address this, we developed the INserted Sequence Information DEtectoR (INSIDER). INSIDER analyses whole genome sequencing data and identifies segments of potentially foreign origin by their significant shift in k-mer signatures. We demonstrate the power of INSIDER to separate integrated DNA sequences from normal genomic sequences on a synthetic dataset simulating the insertion of a CRISPR-Cas gene drive into wild-type yeast. As a proof-of-concept, we use INSIDER to detect the exact AMR plasmid in whole genome sequencing data from a Citrobacter freundii patient isolate. INSIDER streamlines the process of identifying integrated DNA in poorly characterised wild species or when the insert is of unknown origin, thus enhancing the monitoring of emerging biosecurity threats.
Collapse
Affiliation(s)
- Aidan P. Tay
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, New South Wales, Sydney, Australia
- Applied BioSciences, Faculty of Science and Engineering, Macquarie University, New South Wales, Sydney, Australia
| | - Brendan Hosking
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, New South Wales, Sydney, Australia
| | - Cameron Hosking
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, New South Wales, Sydney, Australia
| | - Denis C. Bauer
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, New South Wales, Sydney, Australia
- Department of Biomedical Sciences, Macquarie University, New South Wales, Sydney, Australia
- Applied BioSciences, Faculty of Science and Engineering, Macquarie University, New South Wales, Sydney, Australia
| | - Laurence O.W. Wilson
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, New South Wales, Sydney, Australia
- Applied BioSciences, Faculty of Science and Engineering, Macquarie University, New South Wales, Sydney, Australia
| |
Collapse
|
8
|
Bize A, Midoux C, Mariadassou M, Schbath S, Forterre P, Da Cunha V. Exploring short k-mer profiles in cells and mobile elements from Archaea highlights the major influence of both the ecological niche and evolutionary history. BMC Genomics 2021; 22:186. [PMID: 33726663 PMCID: PMC7962313 DOI: 10.1186/s12864-021-07471-y] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2020] [Accepted: 02/24/2021] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND K-mer-based methods have greatly advanced in recent years, largely driven by the realization of their biological significance and by the advent of next-generation sequencing. Their speed and their independence from the annotation process are major advantages. Their utility in the study of the mobilome has recently emerged and they seem a priori adapted to the patchy gene distribution and the lack of universal marker genes of viruses and plasmids. To provide a framework for the interpretation of results from k-mer based methods applied to archaea or their mobilome, we analyzed the 5-mer DNA profiles of close to 600 archaeal cells, viruses and plasmids. Archaea is one of the three domains of life. Archaea seem enriched in extremophiles and are associated with a high diversity of viral and plasmid families, many of which are specific to this domain. We explored the dataset structure by multivariate and statistical analyses, seeking to identify the underlying factors. RESULTS For cells, the 5-mer profiles were inconsistent with the phylogeny of archaea. At a finer taxonomic level, the influence of the taxonomy and the environmental constraints on 5-mer profiles was very strong. These two factors were interdependent to a significant extent, and the respective weights of their contributions varied according to the clade. A convergent adaptation was observed for the class Halobacteria, for which a strong 5-mer signature was identified. For mobile elements, coevolution with the host had a clear influence on their 5-mer profile. This enabled us to identify one previously known and one new case of recent host transfer based on the atypical composition of the mobile elements involved. Beyond the effect of coevolution, extrachromosomal elements strikingly retain the specific imprint of their own viral or plasmid taxonomic family in their 5-mer profile. CONCLUSION This specific imprint confirms that the evolution of extrachromosomal elements is driven by multiple parameters and is not restricted to host adaptation. In addition, we detected only recent host transfer events, suggesting the fast evolution of short k-mer profiles. This calls for caution when using k-mers for host prediction, metagenomic binning or phylogenetic reconstruction.
Collapse
Affiliation(s)
- Ariane Bize
- Université Paris-Saclay, INRAE, PROSE, F-92761, Antony, France.
| | - Cédric Midoux
- Université Paris-Saclay, INRAE, PROSE, F-92761, Antony, France.,Université Paris-Saclay, INRAE, MaIAGE, F-78350, Jouy-en-Josas, France.,Université Paris-Saclay, INRAE, BioinfOmics, MIGALE bioinformatics facility, F-78350, Jouy-en-Josas, France
| | - Mahendra Mariadassou
- Université Paris-Saclay, INRAE, MaIAGE, F-78350, Jouy-en-Josas, France.,Université Paris-Saclay, INRAE, BioinfOmics, MIGALE bioinformatics facility, F-78350, Jouy-en-Josas, France
| | - Sophie Schbath
- Université Paris-Saclay, INRAE, MaIAGE, F-78350, Jouy-en-Josas, France.,Université Paris-Saclay, INRAE, BioinfOmics, MIGALE bioinformatics facility, F-78350, Jouy-en-Josas, France
| | - Patrick Forterre
- Institut Pasteur, Unité de Virologie des Archées, Département de Microbiologie, 25 Rue du Docteur Roux, 75015, Paris, France. .,Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198, Gif-sur-Yvette, France.
| | - Violette Da Cunha
- Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198, Gif-sur-Yvette, France
| |
Collapse
|
9
|
Goussarov G, Cleenwerck I, Mysara M, Leys N, Monsieurs P, Tahon G, Carlier A, Vandamme P, Van Houdt R. PaSiT: a novel approach based on short-oligonucleotide frequencies for efficient bacterial identification and typing. Bioinformatics 2020; 36:2337-2344. [PMID: 31899493 PMCID: PMC7178395 DOI: 10.1093/bioinformatics/btz964] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2019] [Revised: 11/21/2019] [Accepted: 12/30/2019] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION One of the most widespread methods used in taxonomy studies to distinguish between strains or taxa is the calculation of average nucleotide identity. It requires a computationally expensive alignment step and is therefore not suitable for large-scale comparisons. Short oligonucleotide-based methods do offer a faster alternative but at the expense of accuracy. Here, we aim to address this shortcoming by providing a software that implements a novel method based on short-oligonucleotide frequencies to compute inter-genomic distances. RESULTS Our tetranucleotide and hexanucleotide implementations, which were optimized based on a taxonomically well-defined set of over 200 newly sequenced bacterial genomes, are as accurate as the short oligonucleotide-based method TETRA and average nucleotide identity, for identifying bacterial species and strains, respectively. Moreover, the lightweight nature of this method makes it applicable for large-scale analyses. AVAILABILITY AND IMPLEMENTATION The method introduced here was implemented, together with other existing methods, in a dependency-free software written in C, GenDisCal, available as source code from https://github.com/LM-UGent/GenDisCal. The software supports multithreading and has been tested on Windows and Linux (CentOS). In addition, a Java-based graphical user interface that acts as a wrapper for the software is also available. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Gleb Goussarov
- Microbiology Unit, Belgian Nuclear Research Centre (SCK•CEN), Mol, Belgium
- Laboratory of Microbiology and BCCM/LMG Bacteria Collection, Department of Biochemistry and Microbiology, Faculty of Sciences, Ghent University, Ghent, Belgium
| | - Ilse Cleenwerck
- Laboratory of Microbiology and BCCM/LMG Bacteria Collection, Department of Biochemistry and Microbiology, Faculty of Sciences, Ghent University, Ghent, Belgium
| | - Mohamed Mysara
- Microbiology Unit, Belgian Nuclear Research Centre (SCK•CEN), Mol, Belgium
| | - Natalie Leys
- Microbiology Unit, Belgian Nuclear Research Centre (SCK•CEN), Mol, Belgium
| | - Pieter Monsieurs
- Microbiology Unit, Belgian Nuclear Research Centre (SCK•CEN), Mol, Belgium
| | - Guillaume Tahon
- Laboratory of Microbiology and BCCM/LMG Bacteria Collection, Department of Biochemistry and Microbiology, Faculty of Sciences, Ghent University, Ghent, Belgium
| | - Aurélien Carlier
- Laboratory of Microbiology and BCCM/LMG Bacteria Collection, Department of Biochemistry and Microbiology, Faculty of Sciences, Ghent University, Ghent, Belgium
- LIPM, Université de Toulouse, INRAE, CNRS, Castanet-Tolosan, France
| | - Peter Vandamme
- Laboratory of Microbiology and BCCM/LMG Bacteria Collection, Department of Biochemistry and Microbiology, Faculty of Sciences, Ghent University, Ghent, Belgium
| | - Rob Van Houdt
- Microbiology Unit, Belgian Nuclear Research Centre (SCK•CEN), Mol, Belgium
| |
Collapse
|
10
|
Zhou Y, Zhang W, Wu H, Huang K, Jin J. A high-resolution genomic composition-based method with the ability to distinguish similar bacterial organisms. BMC Genomics 2019; 20:754. [PMID: 31638897 PMCID: PMC6805505 DOI: 10.1186/s12864-019-6119-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2019] [Accepted: 09/20/2019] [Indexed: 12/03/2022] Open
Abstract
Background Genomic composition has been found to be species specific and is used to differentiate bacterial species. To date, almost no published composition-based approaches are able to distinguish between most closely related organisms, including intra-genus species and intra-species strains. Thus, it is necessary to develop a novel approach to address this problem. Results Here, we initially determine that the “tetranucleotide-derived z-value Pearson correlation coefficient” (TETRA) approach is representative of other published statistical methods. Then, we devise a novel method called “Tetranucleotide-derived Z-value Manhattan Distance” (TZMD) and compare it with the TETRA approach. Our results show that TZMD reflects the maximal genome difference, while TETRA does not in most conditions, demonstrating in theory that TZMD provides improved resolution. Additionally, our analysis of real data shows that TZMD improves species differentiation and clearly differentiates similar organisms, including similar species belonging to the same genospecies, subspecies and intraspecific strains, most of which cannot be distinguished by TETRA. Furthermore, TZMD is able to determine clonal strains with the TZMD = 0 criterion, which intrinsically encompasses identical composition, high average nucleotide identity and high percentage of shared genomes. Conclusions Our extensive assessment demonstrates that TZMD has high resolution. This study is the first to propose a composition-based method for differentiating bacteria at the strain level and to demonstrate that composition is also strain specific. TZMD is a powerful tool and the first easy-to-use approach for differentiating clonal and non-clonal strains. Therefore, as the first composition-based algorithm for strain typing, TZMD will facilitate bacterial studies in the future.
Collapse
Affiliation(s)
- Yizhuang Zhou
- Laboratory of Hepatobiliary and Pancreatic Surgery, The Affiliated Hospital of Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China. .,Peking-Tsinghua Center for Life Science, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, People's Republic of China.
| | - Wenting Zhang
- Laboratory of Hepatobiliary and Pancreatic Surgery, The Affiliated Hospital of Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China
| | - Huixian Wu
- China-USA Lipids in Health and Disease Research Center, Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China.,Guangxi Key Laboratory of Molecular Medicine in Liver Injury and Repair, Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China
| | - Kai Huang
- Laboratory of Hepatobiliary and Pancreatic Surgery, The Affiliated Hospital of Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China.,China-USA Lipids in Health and Disease Research Center, Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China.,Guangxi Key Laboratory of Molecular Medicine in Liver Injury and Repair, Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China
| | - Junfei Jin
- Laboratory of Hepatobiliary and Pancreatic Surgery, The Affiliated Hospital of Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China. .,China-USA Lipids in Health and Disease Research Center, Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China. .,Guangxi Key Laboratory of Molecular Medicine in Liver Injury and Repair, Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China.
| |
Collapse
|
11
|
Bernard G, Chan CX, Chan YB, Chua XY, Cong Y, Hogan JM, Maetschke SR, Ragan MA. Alignment-free inference of hierarchical and reticulate phylogenomic relationships. Brief Bioinform 2019; 20:426-435. [PMID: 28673025 PMCID: PMC6433738 DOI: 10.1093/bib/bbx067] [Citation(s) in RCA: 55] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2017] [Revised: 05/04/2017] [Indexed: 11/22/2022] Open
Abstract
We are amidst an ongoing flood of sequence data arising from the application of high-throughput technologies, and a concomitant fundamental revision in our understanding of how genomes evolve individually and within the biosphere. Workflows for phylogenomic inference must accommodate data that are not only much larger than before, but often more error prone and perhaps misassembled, or not assembled in the first place. Moreover, genomes of microbes, viruses and plasmids evolve not only by tree-like descent with modification but also by incorporating stretches of exogenous DNA. Thus, next-generation phylogenomics must address computational scalability while rethinking the nature of orthogroups, the alignment of multiple sequences and the inference and comparison of trees. New phylogenomic workflows have begun to take shape based on so-called alignment-free (AF) approaches. Here, we review the conceptual foundations of AF phylogenetics for the hierarchical (vertical) and reticulate (lateral) components of genome evolution, focusing on methods based on k-mers. We reflect on what seems to be successful, and on where further development is needed.
Collapse
|
12
|
Liu L, Anderson C, Pearl D, Edwards SV. Modern Phylogenomics: Building Phylogenetic Trees Using the Multispecies Coalescent Model. Methods Mol Biol 2019; 1910:211-239. [PMID: 31278666 DOI: 10.1007/978-1-4939-9074-0_7] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
The multispecies coalescent (MSC) model provides a compelling framework for building phylogenetic trees from multilocus DNA sequence data. The pure MSC is best thought of as a special case of so-called "multispecies network coalescent" models, in which gene flow is allowed among branches of the tree, whereas MSC methods assume there is no gene flow between diverging species. Early implementations of the MSC, such as "parsimony" or "democratic vote" approaches to combining information from multiple gene trees, as well as concatenation, in which DNA sequences from multiple gene trees are combined into a single "supergene," were quickly shown to be inconsistent in some regions of tree space, in so far as they converged on the incorrect species tree as more gene trees and sequence data were accumulated. The anomaly zone, a region of tree space in which the most frequent gene tree is different from the species tree, is one such region where many so-called "coalescent" methods are inconsistent. Second-generation implementations of the MSC employed Bayesian or likelihood models; these are consistent in all regions of gene tree space, but Bayesian methods in particular are incapable of handling the large phylogenomic data sets currently available. Two-step methods, such as MP-EST and ASTRAL, in which gene trees are first estimated and then combined to estimate an overarching species tree, are currently popular in part because they can handle large phylogenomic data sets. These methods are consistent in the anomaly zone but can sometimes provide inappropriate measures of tree support or apportion error and signal in the data inappropriately. MP-EST in particular employs a likelihood model which can be conveniently manipulated to perform statistical tests of competing species trees, incorporating the likelihood of the collected gene trees on each species tree in a likelihood ratio test. Such tests provide a useful alternative to the multilocus bootstrap, which only indirectly tests the appropriateness of competing species trees. We illustrate these tests and implementations of the MSC with examples and suggest that MSC methods are a useful class of models effectively using information from multiple loci to build phylogenetic trees.
Collapse
Affiliation(s)
- Liang Liu
- Department of Statistics, University of Georgia, Athens, GA, USA
| | | | - Dennis Pearl
- Department of Statistics, Pennsylvania State University, University Park, PA, USA
| | - Scott V Edwards
- Department of Organismic and Evolutionary Biology & Museum of Comparative Zoology, Harvard University, Cambridge, MA, USA.
| |
Collapse
|
13
|
Danchin A, Ouzounis C, Tokuyasu T, Zucker JD. No wisdom in the crowd: genome annotation in the era of big data - current status and future prospects. Microb Biotechnol 2018; 11:588-605. [PMID: 29806194 PMCID: PMC6011933 DOI: 10.1111/1751-7915.13284] [Citation(s) in RCA: 33] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
Science and engineering rely on the accumulation and dissemination of knowledge to make discoveries and create new designs. Discovery-driven genome research rests on knowledge passed on via gene annotations. In response to the deluge of sequencing big data, standard annotation practice employs automated procedures that rely on majority rules. We argue this hinders progress through the generation and propagation of errors, leading investigators into blind alleys. More subtly, this inductive process discourages the discovery of novelty, which remains essential in biological research and reflects the nature of biology itself. Annotation systems, rather than being repositories of facts, should be tools that support multiple modes of inference. By combining deduction, induction and abduction, investigators can generate hypotheses when accurate knowledge is extracted from model databases. A key stance is to depart from 'the sequence tells the structure tells the function' fallacy, placing function first. We illustrate our approach with examples of critical or unexpected pathways, using MicroScope to demonstrate how tools can be implemented following the principles we advocate. We end with a challenge to the reader.
Collapse
Affiliation(s)
- Antoine Danchin
- Integromics, Institute of Cardiometabolism and Nutrition, Hôpital de la Pitié-Salpêtrière, 47 Boulevard de l'Hôpital, 75013, Paris, France
- School of Biomedical Sciences, Li KaShing Faculty of Medicine, Hong Kong University, 21 Sassoon Road, Pokfulam, Hong Kong
| | - Christos Ouzounis
- Biological Computation and Process Laboratory, Centre for Research and Technology Hellas, Chemical Process and Energy Resources Institute, Thessalonica, 57001, Greece
| | - Taku Tokuyasu
- Shenzhen Institutes of Advanced Technology, Institute of Synthetic Biology, Shenzhen University Town, 1068 Xueyuan Avenue, Shenzhen, China
| | - Jean-Daniel Zucker
- Integromics, Institute of Cardiometabolism and Nutrition, Hôpital de la Pitié-Salpêtrière, 47 Boulevard de l'Hôpital, 75013, Paris, France
| |
Collapse
|
14
|
Ren J, Bai X, Lu YY, Tang K, Wang Y, Reinert G, Sun F. Alignment-Free Sequence Analysis and Applications. Annu Rev Biomed Data Sci 2018; 1:93-114. [PMID: 31828235 PMCID: PMC6905628 DOI: 10.1146/annurev-biodatasci-080917-013431] [Citation(s) in RCA: 58] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
Genome and metagenome comparisons based on large amounts of next generation sequencing (NGS) data pose significant challenges for alignment-based approaches due to the huge data size and the relatively short length of the reads. Alignment-free approaches based on the counts of word patterns in NGS data do not depend on the complete genome and are generally computationally efficient. Thus, they contribute significantly to genome and metagenome comparison. Recently, novel statistical approaches have been developed for the comparison of both long and shotgun sequences. These approaches have been applied to many problems including the comparison of gene regulatory regions, genome sequences, metagenomes, binning contigs in metagenomic data, identification of virus-host interactions, and detection of horizontal gene transfers. We provide an updated review of these applications and other related developments of word-count based approaches for alignment-free sequence analysis.
Collapse
Affiliation(s)
- Jie Ren
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Xin Bai
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China
| | - Yang Young Lu
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Kujin Tang
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Ying Wang
- Department of Automation, Xiamen University, Xiamen, Fujian, China
| | - Gesine Reinert
- Department of Statistics, University of Oxford, Oxford, United Kingdom
| | - Fengzhu Sun
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China
| |
Collapse
|
15
|
Clasen FJ, Pierneef RE, Slippers B, Reva O. EuGI: a novel resource for studying genomic islands to facilitate horizontal gene transfer detection in eukaryotes. BMC Genomics 2018; 19:323. [PMID: 29724163 PMCID: PMC5934851 DOI: 10.1186/s12864-018-4724-8] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2017] [Accepted: 04/25/2018] [Indexed: 11/17/2022] Open
Abstract
Background Genomic islands (GIs) are inserts of foreign DNA that have potentially arisen through horizontal gene transfer (HGT). There are evidences that GIs can contribute significantly to the evolution of prokaryotes. The acquisition of GIs through HGT in eukaryotes has, however, been largely unexplored. In this study, the previously developed GI prediction tool, SeqWord Gene Island Sniffer (SWGIS), is modified to predict GIs in eukaryotic chromosomes. Artificial simulations are used to estimate ratios of predicting false positive and false negative GIs by inserting GIs into different test chromosomes and performing the SWGIS v2.0 algorithm. Using SWGIS v2.0, GIs are then identified in 36 fungal, 22 protozoan and 8 invertebrate genomes. Results SWGIS v2.0 predicts GIs in large eukaryotic chromosomes based on the atypical nucleotide composition of these regions. Averages for predicting false negative and false positive GIs were 20.1% and 11.01% respectively. A total of 10,550 GIs were identified in 66 eukaryotic species with 5299 of these GIs coding for at least one functional protein. The EuGI web-resource, freely accessible at http://eugi.bi.up.ac.za, was developed that allows browsing the database created from identified GIs and genes within GIs through an interactive and visual interface. Conclusions SWGIS v2.0 along with the EuGI database, which houses GIs identified in 66 different eukaryotic species, and the EuGI web-resource, provide the first comprehensive resource for studying HGT in eukaryotes. Electronic supplementary material The online version of this article (10.1186/s12864-018-4724-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Frederick Johannes Clasen
- Centre for Bioinformatics and Computational Biology; Department of Biochemistry, Genetics and Microbiology, University of Pretoria, Pretoria 0002, Private Bag X20, Hatfield, 0028, South Africa. .,Forestry and Agricultural Biotechnology Institute; Department of Biochemistry , Genetics and Microbiology, University of Pretoria, Pretoria, 0002, South Africa.
| | - Rian Ewald Pierneef
- Centre for Bioinformatics and Computational Biology; Department of Biochemistry, Genetics and Microbiology, University of Pretoria, Pretoria 0002, Private Bag X20, Hatfield, 0028, South Africa
| | - Bernard Slippers
- Forestry and Agricultural Biotechnology Institute; Department of Biochemistry , Genetics and Microbiology, University of Pretoria, Pretoria, 0002, South Africa
| | - Oleg Reva
- Centre for Bioinformatics and Computational Biology; Department of Biochemistry, Genetics and Microbiology, University of Pretoria, Pretoria 0002, Private Bag X20, Hatfield, 0028, South Africa
| |
Collapse
|
16
|
Tang K, Lu YY, Sun F. Background Adjusted Alignment-Free Dissimilarity Measures Improve the Detection of Horizontal Gene Transfer. Front Microbiol 2018; 9:711. [PMID: 29713314 PMCID: PMC5911508 DOI: 10.3389/fmicb.2018.00711] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2018] [Accepted: 03/27/2018] [Indexed: 11/20/2022] Open
Abstract
Horizontal gene transfer (HGT) plays an important role in the evolution of microbial organisms including bacteria. Alignment-free methods based on single genome compositional information have been used to detect HGT. Currently, Manhattan and Euclidean distances based on tetranucleotide frequencies are the most commonly used alignment-free dissimilarity measures to detect HGT. By testing on simulated bacterial sequences and real data sets with known horizontal transferred genomic regions, we found that more advanced alignment-free dissimilarity measures such as CVTree and d2* that take into account the background Markov sequences can solve HGT detection problems with significantly improved performance. We also studied the influence of different factors such as evolutionary distance between host and donor sequences, size of sliding window, and host genome composition on the performances of alignment-free methods to detect HGT. Our study showed that alignment-free methods can predict HGT accurately when host and donor genomes are in different order levels. Among all methods, CVTree with word length of 3, d2* with word length 3, Markov order 1 and d2* with word length 4, Markov order 1 outperform others in terms of their highest F1-score and their robustness under the influence of different factors.
Collapse
Affiliation(s)
- Kujin Tang
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA, United States
| | - Yang Young Lu
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA, United States
| | - Fengzhu Sun
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA, United States.,Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China
| |
Collapse
|
17
|
Horizontal acquisition of a hypoxia-responsive molybdenum cofactor biosynthesis pathway contributed to Mycobacterium tuberculosis pathoadaptation. PLoS Pathog 2017; 13:e1006752. [PMID: 29176894 PMCID: PMC5720804 DOI: 10.1371/journal.ppat.1006752] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2017] [Revised: 12/07/2017] [Accepted: 11/13/2017] [Indexed: 12/16/2022] Open
Abstract
The unique ability of the tuberculosis (TB) bacillus, Mycobacterium tuberculosis, to persist for long periods of time in lung hypoxic lesions chiefly contributes to the global burden of latent TB. We and others previously reported that the M. tuberculosis ancestor underwent massive episodes of horizontal gene transfer (HGT), mostly from environmental species. Here, we sought to explore whether such ancient HGT played a part in M. tuberculosis evolution towards pathogenicity. We were interested by a HGT-acquired M. tuberculosis-specific gene set, namely moaA1-D1, which is involved in the biosynthesis of the molybdenum cofactor. Horizontal acquisition of this gene set was striking because homologues of these moa genes are present all across the Mycobacterium genus, including in M. tuberculosis. Here, we discovered that, unlike their paralogues, the moaA1-D1 genes are strongly induced under hypoxia. In vitro, a M. tuberculosis moaA1-D1-null mutant has an impaired ability to respire nitrate, to enter dormancy and to survive in oxygen-limiting conditions. Conversely, heterologous expression of moaA1-D1 in the phylogenetically closest non-TB mycobacterium, Mycobacterium kansasii, which lacks these genes, improves its capacity to respire nitrate and grants it with a marked ability to survive oxygen depletion. In vivo, the M. tuberculosis moaA1-D1-null mutant shows impaired survival in hypoxic granulomas in C3HeB/FeJ mice, but not in normoxic lesions in C57BL/6 animals. Collectively, our results identify a novel pathway required for M. tuberculosis resistance to host-imposed stress, namely hypoxia, and provide evidence that ancient HGT bolstered M. tuberculosis evolution from an environmental species towards a pervasive human-adapted pathogen. Mycobacterium tuberculosis, the etiological agent of tuberculosis (TB), can persist for years and even decades in the lungs of its human host. Here we report that a unique M. tuberculosis gene cluster involved in the synthesis of the molybdenum cofactor, a cofactor for several oxidoreductases including the nitrate reductase, allows this major pathogen to respire nitrate and to persist in a dormant state under hypoxia, a stress condition encountered in lung TB lesions. Strikingly the M. tuberculosis ancestor, which most likely was an environmental harmless bacterium, acquired this gene cluster, together with its hypoxia-responsive transcriptional regulator, horizontally from neighboring bacteria. Our results uncover a key step in M. tuberculosis evolution towards pathogenicity.
Collapse
|
18
|
Gatherer D. Genome Signatures, Self-Organizing Maps and Higher Order Phylogenies: A Parametric Analysis. Evol Bioinform Online 2017. [DOI: 10.1177/117693430700300001] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Genome signatures are data vectors derived from the compositional statistics of DNA. The self-organizing map (SOM) is a neural network method for the conceptualisation of relationships within complex data, such as genome signatures. The various parameters of the SOM training phase are investigated for their effect on the accuracy of the resulting output map. It is concluded that larger SOMs, as well as taking longer to train, are less sensitive in phylogenetic classification of unknown DNA sequences. However, where a classification can be made, a larger SOM is more accurate. Increasing the number of iterations in the training phase of the SOM only slightly increases accuracy, without improving sensitivity. The optimal length of the DNA sequence k-mer from which the genome signature should be derived is 4 or 5, but shorter values are almost as effective. In general, these results indicate that small, rapidly trained SOMs are generally as good as larger, longer trained ones for the analysis of genome signatures. These results may also be more generally applicable to the use of SOMs for other complex data sets, such as microarray data.
Collapse
Affiliation(s)
- Derek Gatherer
- MRC Virology Unit, Institute of Virology. Church Street, Glasgow G11 5JR, UK
| |
Collapse
|
19
|
Zielezinski A, Vinga S, Almeida J, Karlowski WM. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol 2017; 18:186. [PMID: 28974235 PMCID: PMC5627421 DOI: 10.1186/s13059-017-1319-7] [Citation(s) in RCA: 248] [Impact Index Per Article: 35.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Abstract
Alignment-free sequence analyses have been applied to problems ranging from whole-genome phylogeny to the classification of protein families, identification of horizontally transferred genes, and detection of recombined sequences. The strength of these methods makes them particularly useful for next-generation sequencing data processing and analysis. However, many researchers are unclear about how these methods work, how they compare to alignment-based methods, and what their potential is for use for their research. We address these questions and provide a guide to the currently available alignment-free sequence analysis tools.
Collapse
Affiliation(s)
- Andrzej Zielezinski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University in Poznan, Umultowska 89, 61-614, Poznan, Poland
| | - Susana Vinga
- IDMEC, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1049-001, Lisbon, Portugal
| | - Jonas Almeida
- Stony Brook University (SUNY), 101 Nicolls Road, Stony Brook, NY, 11794, USA
| | - Wojciech M Karlowski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University in Poznan, Umultowska 89, 61-614, Poznan, Poland.
| |
Collapse
|
20
|
Genomic Analysis of Calderihabitans maritimus KKC1, a Thermophilic, Hydrogenogenic, Carboxydotrophic Bacterium Isolated from Marine Sediment. Appl Environ Microbiol 2017; 83:AEM.00832-17. [PMID: 28526793 DOI: 10.1128/aem.00832-17] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2017] [Accepted: 05/13/2017] [Indexed: 11/20/2022] Open
Abstract
Calderihabitans maritimus KKC1 is a thermophilic, hydrogenogenic carboxydotroph isolated from a submerged marine caldera. Here, we describe the de novo sequencing and feature analysis of the C. maritimus KKC1 genome. Genome-based phylogenetic analysis confirmed that C. maritimus KKC1 was most closely related to the genus Moorella, which includes well-studied acetogenic members. Comparative genomic analysis revealed that, like Moorella, C. maritimus KKC1 retained both the CO2-reducing Wood-Ljungdahl pathway and energy-converting hydrogenase-based module activated by reduced ferredoxin, but it lacked the HydABC and NfnAB electron-bifurcating enzymes and pyruvate:ferredoxin oxidoreductase required for ferredoxin reduction for acetogenic growth. Furthermore, C. maritimus KKC1 harbored six genes encoding CooS, a catalytic subunit of the anaerobic CO dehydrogenase that can reduce ferredoxin via CO oxidation, whereas Moorella possessed only two CooS genes. Our analysis revealed that three cooS genes formed known gene clusters in other microorganisms, i.e., cooS-acetyl coenzyme A (acetyl-CoA) synthase (which contained a frameshift mutation), cooS-energy-converting hydrogenase, and cooF-cooS-FAD-NAD oxidoreductase, while the other three had novel genomic contexts. Sequence composition analysis indicated that these cooS genes likely evolved from a common ancestor. Collectively, these data suggest that C. maritimus KKC1 may be highly dependent on CO as a low-potential electron donor to directly reduce ferredoxin and may be more suited to carboxydotrophic growth compared to the acetogenic growth observed in Moorella, which show adaptation at a thermodynamic limit.IMPORTANCECalderihabitans maritimus KKC1 and members of the genus Moorella are phylogenetically related but physiologically distinct. The former is a hydrogenogenic carboxydotroph that can grow on carbon monoxide (CO) with H2 production, whereas the latter include acetogenic bacteria that grow on H2 plus CO2 with acetate production. Both species may require reduced ferredoxin as an actual "energy equivalent," but ferredoxin is a low-potential electron carrier and requires a high-energy substrate as an electron donor for reduction. Comparative genomic analysis revealed that C. maritimus KKC1 lacked specific electron-bifurcating enzymes and possessed six CO dehydrogenases, unlike Moorella species. This suggests that C. maritimus KKC1 may be more dependent on CO, a strong electron donor that can directly reduce ferredoxin via CO dehydrogenase, and may exhibit a survival strategy different from that of acetogenic Moorella, which solves the energetic barrier associated with endergonic reduction of ferredoxin with hydrogen.
Collapse
|
21
|
Barros-Carvalho GA, Van Sluys MA, Lopes FM. An Efficient Approach to Explore and Discriminate Anomalous Regions in Bacterial Genomes Based on Maximum Entropy. J Comput Biol 2017; 24:1125-1133. [PMID: 28570142 DOI: 10.1089/cmb.2017.0042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Recently, there has been an increase in the number of whole bacterial genomes sequenced, mainly due to the advancing of next-generation sequencing technologies. In face of this, there is a need to provide new analytical alternatives that can follow this advance. Given our current knowledge about the genomic plasticity of bacteria and that those genomic regions can uncover important features about this microorganism, our goal was to develop a fast methodology based on maximum entropy (ME) to guide the researcher to regions that could be prioritized during the analysis. This methodology was compared with other available methods. In addition, ME was applied to eight different bacterial genera. The methodology consists of two main steps: processing the nucleotide sequence and ME calculation. We applied ME to Xanthomonas axonopodis pv. citri 306 (XAC) and Xanthomonas campestris pv. campestris ATCC 33913 (XCC), both of which have their anomalous regions well documented. We then compared our results against those from Alien Hunter, HGT-DB, Islander, IslandPath, and SIGI-HMM. ME was shown to be superior in terms of efficiency and analysis duration. Besides, ME only needs the genome sequence in FASTA format as input. The proposed strategy based on ME is able to help in bacterial genome exploration. This is a simple and fast strategy for individual genomes in comparison with other available methods, without relying on previous annotation and alignments. This methodology can also be a new option in the early stages of analysis of newly sequenced bacterial genomes.
Collapse
Affiliation(s)
- Gesiele Almeida Barros-Carvalho
- 1 Institute of Mathematics and Statistics, University of São Paulo , São Paulo, Brazil .,2 GaTE Lab, Department of Botany, Institute of Bioscience, University of São Paulo , São Paulo, Brazil
| | - Marie-Anne Van Sluys
- 2 GaTE Lab, Department of Botany, Institute of Bioscience, University of São Paulo , São Paulo, Brazil
| | | |
Collapse
|
22
|
Cuecas A, Kanoksilapatham W, Gonzalez JM. Evidence of horizontal gene transfer by transposase gene analyses in Fervidobacterium species. PLoS One 2017; 12:e0173961. [PMID: 28426805 PMCID: PMC5398504 DOI: 10.1371/journal.pone.0173961] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2016] [Accepted: 03/01/2017] [Indexed: 11/25/2022] Open
Abstract
Horizontal Gene Transfer (HGT) plays an important role in the physiology and evolution of microorganisms above all thermophilic prokaryotes. Some members of the Phylum Thermotogae (i.e., Thermotoga spp.) have been reported to present genomes constituted by a mosaic of genes from a variety of origins. This study presents a novel approach to search on the potential plasticity of Fervidobacterium genomes using putative transposase-encoding genes as the target of analysis. Transposases are key proteins involved in genomic DNA rearrangements. A comprehensive comparative analysis, including phylogeny, non-metric multidimensional scaling analysis of tetranucleotide frequencies, repetitive flanking sequences and divergence estimates, was performed on the transposase genes detected in four Fervidobacterium genomes: F. nodosum, F. pennivorans, F. islandicum and a new isolate (Fervidobacterium sp. FC2004). Transposase sequences were classified in different groups by their degree of similarity. The different methods used in this study pointed that over half of the transposase genes represented putative HGT events with closest relative sequences within the phylum Firmicutes, being Caldicellulosiruptor the genus showing highest gene sequence proximity. These results confirmed a direct evolutionary relationship through HGT between specific Fervidobacterium species and thermophilic Firmicutes leading to potential gene sequence and functionality sharing to thrive under similar environmental conditions. Transposase-encoding genes represent suitable targets to approach the plasticity and potential mosaicism of bacterial genomes.
Collapse
Affiliation(s)
- Alba Cuecas
- Institute of Natural Resources and Agrobiology, Spanish Council for Research, IRNAS-CSIC, Avda. Rena Mercedes 10, Sevilla, Spain
| | - Wirojne Kanoksilapatham
- Department of Microbiology, Faculty of Science, Silpakorn University, Nakhon Pathom, Thailand
| | - Juan M. Gonzalez
- Institute of Natural Resources and Agrobiology, Spanish Council for Research, IRNAS-CSIC, Avda. Rena Mercedes 10, Sevilla, Spain
- * E-mail:
| |
Collapse
|
23
|
Jain S, Panda A, Colson P, Raoult D, Pontarotti P. MimiLook: A Phylogenetic Workflow for Detection of Gene Acquisition in Major Orthologous Groups of Megavirales. Viruses 2017; 9:v9040072. [PMID: 28387730 PMCID: PMC5408678 DOI: 10.3390/v9040072] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2017] [Revised: 04/03/2017] [Accepted: 04/03/2017] [Indexed: 12/20/2022] Open
Abstract
With the inclusion of new members, understanding about evolutionary mechanisms and processes by which members of the proposed order, Megavirales, have evolved has become a key area of interest. The central role of gene acquisition has been shown in previous studies. However, the major drawback in gene acquisition studies is the focus on few MV families or putative families with large variation in their genetic structure. Thus, here we have tried to develop a methodology by which we can detect horizontal gene transfers (HGTs), taking into consideration orthologous groups of distantly related Megavirale families. Here, we report an automated workflow MimiLook, prepared as a Perl command line program, that deduces orthologous groups (OGs) from ORFomes of Megavirales and constructs phylogenetic trees by performing alignment generation, alignment editing and protein-protein BLAST (BLASTP) searching across the National Center for Biotechnology Information (NCBI) non-redundant (nr) protein sequence database. Finally, this tool detects statistically validated events of gene acquisitions with the help of the T-REX algorithm by comparing individual gene tree with NCBI species tree. In between the steps, the workflow decides about handling paralogs, filtering outputs, identifying Megavirale specific OGs, detection of HGTs, along with retrieval of information about those OGs that are monophyletic with organisms from cellular domains of life. By implementing MimiLook, we noticed that nine percent of Megavirale gene families (i.e., OGs) have been acquired by HGT, 80% OGs were Megaviralespecific and eight percent were found to be sharing common ancestry with members of cellular domains (Eukaryote, Bacteria, Archaea, Phages or other viruses) and three percent were ambivalent. The results are briefly discussed to emphasize methodology. Also, MimiLook is relevant for detecting evolutionary scenarios in other targeted phyla with user defined modifications. It can be accessed at following link 10.6084/m9.figshare.4653622.
Collapse
Affiliation(s)
- Sourabh Jain
- Aix-Marseille Université, Ecole Centrale de Marseille, I2M UMR 7373, CNRS équipe Evolution Biologique et Modélisation, 13284 Marseille, France.
- Aix-Marseille Université, Unité de Recherche sur les Maladies Infectieuses et Tropicales Emergentes (URMITE), UM63 CNRS 7278 INSERM U1095IRD 198, Faculté de Médecine, 13284 Marseille, France.
| | - Arup Panda
- Aix-Marseille Université, Ecole Centrale de Marseille, I2M UMR 7373, CNRS équipe Evolution Biologique et Modélisation, 13284 Marseille, France.
- Aix-Marseille Université, Unité de Recherche sur les Maladies Infectieuses et Tropicales Emergentes (URMITE), UM63 CNRS 7278 INSERM U1095IRD 198, Faculté de Médecine, 13284 Marseille, France.
| | - Philippe Colson
- Aix-Marseille Université, Unité de Recherche sur les Maladies Infectieuses et Tropicales Emergentes (URMITE), UM63 CNRS 7278 INSERM U1095IRD 198, Faculté de Médecine, 13284 Marseille, France.
- IHU Méditerranée Infection, Assistance Publique-Hôpitaux de Marseille, Centre Hospitalo-universitaire Timone, Pôle des Maladies Infectieuses et Tropicales Clinique et Biologique, Fédération de Bactériologie-Hygiène-Virologie, 13385 Marseille, France.
| | - Didier Raoult
- Aix-Marseille Université, Unité de Recherche sur les Maladies Infectieuses et Tropicales Emergentes (URMITE), UM63 CNRS 7278 INSERM U1095IRD 198, Faculté de Médecine, 13284 Marseille, France.
- IHU Méditerranée Infection, Assistance Publique-Hôpitaux de Marseille, Centre Hospitalo-universitaire Timone, Pôle des Maladies Infectieuses et Tropicales Clinique et Biologique, Fédération de Bactériologie-Hygiène-Virologie, 13385 Marseille, France.
| | - Pierre Pontarotti
- Aix-Marseille Université, Ecole Centrale de Marseille, I2M UMR 7373, CNRS équipe Evolution Biologique et Modélisation, 13284 Marseille, France.
| |
Collapse
|
24
|
Abstract
Most phylogenetic methods are model-based and depend on models of evolution designed to approximate the evolutionary processes. Several methods have been developed to identify suitable models of evolution for phylogenetic analysis of alignments of nucleotide or amino acid sequences and some of these methods are now firmly embedded in the phylogenetic protocol. However, in a disturbingly large number of cases, it appears that these models were used without acknowledgement of their inherent shortcomings. In this chapter, we discuss the problem of model selection and show how some of the inherent shortcomings may be identified and overcome.
Collapse
Affiliation(s)
| | - Vivek Jayaswal
- School of Biomedical Sciences, Queensland University of Technology, Brisbane, QLD, Australia
| | - Faisal M Ababneh
- Department of Mathematics & Statistics, Al-Hussein Bin Talal University, Ma'an, Jordan
| | - John Robinson
- School of Mathematics & Statistics, University of Sydney, Sydney, NSW, Australia
| |
Collapse
|
25
|
Maumus F, Blanc G. Study of Gene Trafficking between Acanthamoeba and Giant Viruses Suggests an Undiscovered Family of Amoeba-Infecting Viruses. Genome Biol Evol 2016; 8:3351-3363. [PMID: 27811174 PMCID: PMC5203793 DOI: 10.1093/gbe/evw260] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/21/2016] [Indexed: 01/10/2023] Open
Abstract
The nucleocytoplasmic large DNA viruses (NCLDV) are a group of extremely complex double-stranded DNA viruses, which are major parasites of a variety of eukaryotes. Recent studies showed that certain unicellular eukaryotes contain fragments of NCLDV DNA integrated in their genome, when surprisingly many of these organisms were not previously shown to be infected by NCLDVs. These findings prompted us to search the genome of Acanthamoeba castellanii strain Neff (Neff), one of the most prolific hosts in the discovery of giant NCLDVs, for possible DNA inserts of viral origin. We report the identification of 267 markers of lateral gene transfer with viruses, approximately half of which are clustered in Neff genome regions of viral origins, transcriptionally inactive or exhibit nucleotide-composition signatures suggestive of a foreign origin. The integrated viral genes had diverse origin among relatives of viruses that infect Neff, including Mollivirus, Pandoravirus, Marseillevirus, Pithovirus, and Mimivirus However, phylogenetic analysis suggests the existence of a yet-undiscovered family of amoeba-infecting NCLDV in addition to the five already characterized. The active transcription of some apparently anciently integrated virus-like genes suggests that some viral genes might have been domesticated during the amoeba evolution. These insights confirm that genomic insertion of NCLDV DNA is a common theme in eukaryotes. This gene flow contributed fertilizing the eukaryotic gene repertoire and participated in the occurrence of orphan genes, a long standing issue in genomics. Search for viral inserts in eukaryotic genomes followed by environmental screening of the original viruses should be used to isolate radically new NCLDVs.
Collapse
Affiliation(s)
| | - Guillaume Blanc
- Structural and Genomic Information Laboratory (IGS), Aix-Marseille Université, CNRS UMR (IMM FR 3479), Marseille, France
| |
Collapse
|
26
|
Abstract
Unraveling the drivers controlling the response and adaptation of biological communities to environmental change, especially anthropogenic activities, is a central but poorly understood issue in ecology and evolution. Comparative genomics studies suggest that lateral gene transfer (LGT) is a major force driving microbial genome evolution, but its role in the evolution of microbial communities remains elusive. To delineate the importance of LGT in mediating the response of a groundwater microbial community to heavy metal contamination, representative Rhodanobacter reference genomes were sequenced and compared to shotgun metagenome sequences. 16S rRNA gene-based amplicon sequence analysis indicated that Rhodanobacter populations were highly abundant in contaminated wells with low pHs and high levels of nitrate and heavy metals but remained rare in the uncontaminated wells. Sequence comparisons revealed that multiple geochemically important genes, including genes encoding Fe2+/Pb2+ permeases, most denitrification enzymes, and cytochrome c553, were native to Rhodanobacter and not subjected to LGT. In contrast, the Rhodanobacter pangenome contained a recombinational hot spot in which numerous metal resistance genes were subjected to LGT and/or duplication. In particular, Co2+/Zn2+/Cd2+ efflux and mercuric resistance operon genes appeared to be highly mobile within Rhodanobacter populations. Evidence of multiple duplications of a mercuric resistance operon common to most Rhodanobacter strains was also observed. Collectively, our analyses indicated the importance of LGT during the evolution of groundwater microbial communities in response to heavy metal contamination, and a conceptual model was developed to display such adaptive evolutionary processes for explaining the extreme dominance of Rhodanobacter populations in the contaminated groundwater microbiome. Lateral gene transfer (LGT), along with positive selection and gene duplication, are the three main mechanisms that drive adaptive evolution of microbial genomes and communities, but their relative importance is unclear. Some recent studies suggested that LGT is a major adaptive mechanism for microbial populations in response to changing environments, and hence, it could also be critical in shaping microbial community structure. However, direct evidence of LGT and its rates in extant natural microbial communities in response to changing environments is still lacking. Our results presented in this study provide explicit evidence that LGT played a crucial role in driving the evolution of a groundwater microbial community in response to extreme heavy metal contamination. It appears that acquisition of genes critical for survival, growth, and reproduction via LGT is the most rapid and effective way to enable microorganisms and associated microbial communities to quickly adapt to abrupt harsh environmental stresses.
Collapse
|
27
|
Karamichalis R, Kari L, Konstantinidis S, Kopecki S. An investigation into inter- and intragenomic variations of graphic genomic signatures. BMC Bioinformatics 2015; 16:246. [PMID: 26249837 PMCID: PMC4527362 DOI: 10.1186/s12859-015-0655-4] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2014] [Accepted: 06/30/2015] [Indexed: 11/30/2022] Open
Abstract
Background Motivated by the general need to identify and classify species based on molecular evidence, genome comparisons have been proposed that are based on measuring mostly Euclidean distances between Chaos Game Representation (CGR) patterns of genomic DNA sequences. Results We provide, on an extensive dataset and using several different distances, confirmation of the hypothesis that CGR patterns are preserved along a genomic DNA sequence, and are different for DNA sequences originating from genomes of different species. This finding lends support to the theory that CGRs of genomic sequences can act as graphic genomic signatures. In particular, we compare the CGR patterns of over five hundred different 150,000 bp genomic sequences spanning one complete chromosome from each of six organisms, representing all kingdoms of life: H. sapiens (Animalia; chromosome 21), S. cerevisiae (Fungi; chromosome 4), A. thaliana (Plantae; chromosome 1), P. falciparum (Protista; chromosome 14), E. coli (Bacteria - full genome), and P. furiosus (Archaea - full genome). To maximize the diversity within each species, we also analyze the interrelationships within a set of over five hundred 150,000 bp genomic sequences sampled from the entire aforementioned genomes. Lastly, we provide some preliminary evidence of this method’s ability to classify genomic DNA sequences at lower taxonomic levels by comparing sequences sampled from the entire genome of H. sapiens (class Mammalia, order Primates) and of M. musculus (class Mammalia, order Rodentia), for a total length of approximately 174 million basepairs analyzed. We compute pairwise distances between CGRs of these genomic sequences using six different distances, and construct Molecular Distance Maps, which visualize all sequences as points in a two-dimensional or three-dimensional space, to simultaneously display their interrelationships. Conclusion Our analysis confirms, for this dataset, that CGR patterns of DNA sequences from the same genome are in general quantitatively similar, while being different for DNA sequences from genomes of different species. Our assessment of the performance of the six distances analyzed uses three different quality measures and suggests that several distances outperform the Euclidean distance, which has so far been almost exclusively used for such studies.
Collapse
Affiliation(s)
- Rallis Karamichalis
- Department of Computer Science, University of Western Ontario, London, ON, Canada.
| | - Lila Kari
- Department of Computer Science, University of Western Ontario, London, ON, Canada.
| | - Stavros Konstantinidis
- Department of Mathematics and Computing Science, Saint Mary's University, Halifax, NS, Canada.
| | - Steffen Kopecki
- Department of Computer Science, University of Western Ontario, London, ON, Canada. .,Department of Mathematics and Computing Science, Saint Mary's University, Halifax, NS, Canada.
| |
Collapse
|
28
|
Pierneef R, Cronje L, Bezuidt O, Reva ON. Pre_GI: a global map of ontological links between horizontally transferred genomic islands in bacterial and archaeal genomes. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2015. [PMID: 26200753 PMCID: PMC5630688 DOI: 10.1093/database/bav058] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
The Predicted Genomic Islands database (Pre_GI) is a comprehensive repository of prokaryotic genomic islands (islands, GIs) freely accessible at http://pregi.bi.up.ac.za/index.php. Pre_GI, Version 2015, catalogues 26 744 islands identified in 2407 bacterial/archaeal chromosomes and plasmids. It provides an easy-to-use interface which allows users the ability to query against the database with a variety of fields, parameters and associations. Pre_GI is constructed to be a web-resource for the analysis of ontological roads between islands and cartographic analysis of the global fluxes of mobile genetic elements through bacterial and archaeal taxonomic borders. Comparison of newly identified islands against Pre_GI presents an alternative avenue to identify their ontology, origin and relative time of acquisition. Pre_GI aims to aid research on horizontal transfer events and materials through providing data and tools for holistic investigation of migration of genes through ecological niches and taxonomic boundaries.
Collapse
Affiliation(s)
- Rian Pierneef
- Bioinformatics and Computational Biology Unit, Department of Biochemistry, University of Pretoria, Pretoria, Gauteng 0002, South Africa
| | - Louis Cronje
- Bioinformatics and Computational Biology Unit, Department of Biochemistry, University of Pretoria, Pretoria, Gauteng 0002, South Africa
| | - Oliver Bezuidt
- Bioinformatics and Computational Biology Unit, Department of Biochemistry, University of Pretoria, Pretoria, Gauteng 0002, South Africa
| | - Oleg N Reva
- Bioinformatics and Computational Biology Unit, Department of Biochemistry, University of Pretoria, Pretoria, Gauteng 0002, South Africa
| |
Collapse
|
29
|
Thomas D, Finan C, Newport MJ, Jones S. DNA entropy reveals a significant difference in complexity between housekeeping and tissue specific gene promoters. Comput Biol Chem 2015; 58:19-24. [PMID: 25988219 DOI: 10.1016/j.compbiolchem.2015.05.001] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2014] [Revised: 05/01/2015] [Accepted: 05/01/2015] [Indexed: 10/23/2022]
Abstract
BACKGROUND The complexity of DNA can be quantified using estimates of entropy. Variation in DNA complexity is expected between the promoters of genes with different transcriptional mechanisms; namely housekeeping (HK) and tissue specific (TS). The former are transcribed constitutively to maintain general cellular functions, and the latter are transcribed in restricted tissue and cells types for specific molecular events. It is known that promoter features in the human genome are related to tissue specificity, but this has been difficult to quantify on a genomic scale. If entropy effectively quantifies DNA complexity, calculating the entropies of HK and TS gene promoters as profiles may reveal significant differences. RESULTS Entropy profiles were calculated for a total dataset of 12,003 human gene promoters and for 501 housekeeping (HK) and 587 tissue specific (TS) human gene promoters. The mean profiles show the TS promoters have a significantly lower entropy (p<2.2e-16) than HK gene promoters. The entropy distributions for the 3 datasets show that promoter entropies could be used to identify novel HK genes. CONCLUSION Functional features comprise DNA sequence patterns that are non-random and hence they have lower entropies. The lower entropy of TS gene promoters can be explained by a higher density of positive and negative regulatory elements, required for genes with complex spatial and temporary expression.
Collapse
Affiliation(s)
- David Thomas
- Brighton and Sussex Medical School, University of Sussex, Brighton BN1 9PX, UK
| | - Chris Finan
- Brighton and Sussex Medical School, University of Sussex, Brighton BN1 9PX, UK
| | - Melanie J Newport
- Brighton and Sussex Medical School, University of Sussex, Brighton BN1 9PX, UK
| | - Susan Jones
- The James Hutton Institute, Invergowrie, Dundee DD2 5DA, UK
| |
Collapse
|
30
|
Abstract
Horizontal or Lateral Gene Transfer (HGT or LGT) is the transmission of portions of genomic DNA between organisms through a process decoupled from vertical inheritance. In the presence of HGT events, different fragments of the genome are the result of different evolutionary histories. This can therefore complicate the investigations of evolutionary relatedness of lineages and species. Also, as HGT can bring into genomes radically different genotypes from distant lineages, or even new genes bearing new functions, it is a major source of phenotypic innovation and a mechanism of niche adaptation. For example, of particular relevance to human health is the lateral transfer of antibiotic resistance and pathogenicity determinants, leading to the emergence of pathogenic lineages. Computational identification of HGT events relies upon the investigation of sequence composition or evolutionary history of genes. Sequence composition-based ("parametric") methods search for deviations from the genomic average, whereas evolutionary history-based ("phylogenetic") approaches identify genes whose evolutionary history significantly differs from that of the host species. The evaluation and benchmarking of HGT inference methods typically rely upon simulated genomes, for which the true history is known. On real data, different methods tend to infer different HGT events, and as a result it can be difficult to ascertain all but simple and clear-cut HGT events.
Collapse
Affiliation(s)
| | - Nives Škunca
- ETH Zurich, Zurich, Switzerland
- Swiss Institute of Bioinformatics, Zurich, Switzerland
| | | | - Christophe Dessimoz
- University College London, London, United Kingdom
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
| |
Collapse
|
31
|
Almagro G, Viale AM, Montero M, Rahimpour M, Muñoz FJ, Baroja-Fernández E, Bahaji A, Zúñiga M, González-Candelas F, Pozueta-Romero J. Comparative genomic and phylogenetic analyses of Gammaproteobacterial glg genes traced the origin of the Escherichia coli glycogen glgBXCAP operon to the last common ancestor of the sister orders Enterobacteriales and Pasteurellales. PLoS One 2015; 10:e0115516. [PMID: 25607991 PMCID: PMC4301808 DOI: 10.1371/journal.pone.0115516] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2014] [Accepted: 11/25/2014] [Indexed: 12/22/2022] Open
Abstract
Production of branched α-glucan, glycogen-like polymers is widely spread in the Bacteria domain. The glycogen pathway of synthesis and degradation has been fairly well characterized in the model enterobacterial species Escherichia coli (order Enterobacteriales, class Gammaproteobacteria), in which the cognate genes (branching enzyme glgB, debranching enzyme glgX, ADP-glucose pyrophosphorylase glgC, glycogen synthase glgA, and glycogen phosphorylase glgP) are clustered in a glgBXCAP operon arrangement. However, the evolutionary origin of this particular arrangement and of its constituent genes is unknown. Here, by using 265 complete gammaproteobacterial genomes we have carried out a comparative analysis of the presence, copy number and arrangement of glg genes in all lineages of the Gammaproteobacteria. These analyses revealed large variations in glg gene presence, copy number and arrangements among different gammaproteobacterial lineages. However, the glgBXCAP arrangement was remarkably conserved in all glg-possessing species of the orders Enterobacteriales and Pasteurellales (the E/P group). Subsequent phylogenetic analyses of glg genes present in the Gammaproteobacteria and in other main bacterial groups indicated that glg genes have undergone a complex evolutionary history in which horizontal gene transfer may have played an important role. These analyses also revealed that the E/P glgBXCAP genes (a) share a common evolutionary origin, (b) were vertically transmitted within the E/P group, and (c) are closely related to glg genes of some phylogenetically distant betaproteobacterial species. The overall data allowed tracing the origin of the E. coli glgBXCAP operon to the last common ancestor of the E/P group, and also to uncover a likely glgBXCAP transfer event from the E/P group to particular lineages of the Betaproteobacteria.
Collapse
Affiliation(s)
- Goizeder Almagro
- Instituto de Agrobiotecnología (CSIC/UPNA/Gobierno de Navarra), Iruñako etorbidea 123, 31192 Mutiloabeti, Nafarroa, Spain
| | - Alejandro M. Viale
- Instituto de Biología Molecular y Celular de Rosario (IBR, CONICET), Departamento de Microbiología, Facultad de Ciencias Bioquímicas y Farmacéuticas, Universidad Nacional de Rosario (UNR), Suipacha 531, 2000 Rosario, Argentina
| | - Manuel Montero
- Instituto de Agrobiotecnología (CSIC/UPNA/Gobierno de Navarra), Iruñako etorbidea 123, 31192 Mutiloabeti, Nafarroa, Spain
| | - Mehdi Rahimpour
- Instituto de Agrobiotecnología (CSIC/UPNA/Gobierno de Navarra), Iruñako etorbidea 123, 31192 Mutiloabeti, Nafarroa, Spain
| | - Francisco José Muñoz
- Instituto de Agrobiotecnología (CSIC/UPNA/Gobierno de Navarra), Iruñako etorbidea 123, 31192 Mutiloabeti, Nafarroa, Spain
| | - Edurne Baroja-Fernández
- Instituto de Agrobiotecnología (CSIC/UPNA/Gobierno de Navarra), Iruñako etorbidea 123, 31192 Mutiloabeti, Nafarroa, Spain
| | - Abdellatif Bahaji
- Instituto de Agrobiotecnología (CSIC/UPNA/Gobierno de Navarra), Iruñako etorbidea 123, 31192 Mutiloabeti, Nafarroa, Spain
| | - Manuel Zúñiga
- Dpt. Biotecnología de Alimentos, Instituto de Agroquímica y Tecnología de Alimentos, CSIC, Calle Agustín Escardino, 7, 46980 Paterna, Valencia, Spain
| | - Fernando González-Candelas
- Unidad Mixta Genómica y Salud, FISABIO-Salud Pública/Instituto Cavanilles de Biodiversidad y Biología Evolutiva, Universidad de Valencia, Calle Catedrático José Beltrán Martínez, 246980 Paterna, Valencia, Spain
| | - Javier Pozueta-Romero
- Instituto de Agrobiotecnología (CSIC/UPNA/Gobierno de Navarra), Iruñako etorbidea 123, 31192 Mutiloabeti, Nafarroa, Spain
| |
Collapse
|
32
|
Metzler S, Kalinina OV. Detection of atypical genes in virus families using a one-class SVM. BMC Genomics 2014; 15:913. [PMID: 25336138 PMCID: PMC4210486 DOI: 10.1186/1471-2164-15-913] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2014] [Accepted: 10/10/2014] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND The diversity of viruses, the absence of universally common genes in them, and their ability to act as carriers of genetic material make assessment of evolutionary paths of viral genes very difficult. One important factor contributing to this complexity is horizontal gene transfer. RESULTS We explore the possibility for the systematic identification of atypical genes within virus families, including viruses whose genome is not encoded by a double-stranded DNA. Our method is based on gene statistical features that differ in genes that were subject of recent horizontal gene transfer from those of the genome in which they are observed. We employ a one-class SVM approach to detect atypical genes within a virus family basing of their statistical signatures and without explicit knowledge of the source species. The simplicity of the statistical features used makes the method applicable to various viruses irrespective of their genome size or type. CONCLUSIONS On simulated data, the method can robustly identify alien genes irrespective of the coding nucleic acid found in a virus. It also compares well to results obtained in related studies for double-stranded DNA viruses. Its value in practice is confirmed by the identification of isolated examples of horizontal gene transfer events that have already been described in the literature. A Python package implementing the method and the results for the analyzed virus families are available at http://svm-agp.bioinf.mpi-inf.mpg.de.
Collapse
Affiliation(s)
| | - Olga V Kalinina
- Department for Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Campus E1 4, 66123 Saarbrücken, Germany.
| |
Collapse
|
33
|
Kupczok A, Bollback JP. Motif depletion in bacteriophages infecting hosts with CRISPR systems. BMC Genomics 2014; 15:663. [PMID: 25103210 PMCID: PMC4246573 DOI: 10.1186/1471-2164-15-663] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2014] [Accepted: 02/15/2014] [Indexed: 12/26/2022] Open
Abstract
BACKGROUND CRISPR is a microbial immune system likely to be involved in host-parasite coevolution. It functions using target sequences encoded by the bacterial genome, which interfere with invading nucleic acids using a homology-dependent system. The system also requires protospacer associated motifs (PAMs), short motifs close to the target sequence that are required for interference in CRISPR types I and II. Here, we investigate whether PAMs are depleted in phage genomes due to selection pressure to escape recognition. RESULTS To this end, we analyzed two data sets. Phages infecting all bacterial hosts were analyzed first, followed by a detailed analysis of phages infecting the genus Streptococcus, where PAMs are best understood. We use two different measures of motif underrepresentation that control for codon bias and the frequency of submotifs. We compare phages infecting species with a particular CRISPR type to those infecting species without that type. Since only known PAMs were investigated, the analysis is restricted to CRISPR types I-C and I-E and in Streptococcus to types I-C and II. We found evidence for PAM depletion in Streptococcus phages infecting hosts with CRISPR type I-C, in Vibrio phages infecting hosts with CRISPR type I-E and in Streptococcus thermopilus phages infecting hosts with type II-A, known as CRISPR3. CONCLUSIONS The observed motif depletion in phages with hosts having CRISPR can be attributed to selection rather than to mutational bias, as mutational bias should affect the phages of all hosts. This observation implies that the CRISPR system has been efficient in the groups discussed here.
Collapse
Affiliation(s)
- Anne Kupczok
- />IST Austria, Am Campus 1, 3400 Klosterneuburg, Austria
- />Institute of Microbiology, Christian-Albrechts-University of Kiel, 24118 Kiel, Germany
| | | |
Collapse
|
34
|
Vinga S. Information theory applications for biological sequence analysis. Brief Bioinform 2014; 15:376-89. [PMID: 24058049 PMCID: PMC7109941 DOI: 10.1093/bib/bbt068] [Citation(s) in RCA: 67] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2013] [Accepted: 08/17/2013] [Indexed: 01/13/2023] Open
Abstract
Information theory (IT) addresses the analysis of communication systems and has been widely applied in molecular biology. In particular, alignment-free sequence analysis and comparison greatly benefited from concepts derived from IT, such as entropy and mutual information. This review covers several aspects of IT applications, ranging from genome global analysis and comparison, including block-entropy estimation and resolution-free metrics based on iterative maps, to local analysis, comprising the classification of motifs, prediction of transcription factor binding sites and sequence characterization based on linguistic complexity and entropic profiles. IT has also been applied to high-level correlations that combine DNA, RNA or protein features with sequence-independent properties, such as gene mapping and phenotype analysis, and has also provided models based on communication systems theory to describe information transmission channels at the cell level and also during evolutionary processes. While not exhaustive, this review attempts to categorize existing methods and to indicate their relation with broader transversal topics such as genomic signatures, data compression and complexity, time series analysis and phylogenetic classification, providing a resource for future developments in this promising area.
Collapse
Affiliation(s)
- Susana Vinga
- IDMEC, Instituto Superior Técnico - Universidade de Lisboa (IST-UL), Av. Rovisco Pais, 1049-001 Lisboa, Portugal. Tel.: +351-218419504; Fax: +351-218498097;
| |
Collapse
|
35
|
Peeters N, Carrère S, Anisimova M, Plener L, Cazalé AC, Genin S. Repertoire, unified nomenclature and evolution of the Type III effector gene set in the Ralstonia solanacearum species complex. BMC Genomics 2013; 14:859. [PMID: 24314259 PMCID: PMC3878972 DOI: 10.1186/1471-2164-14-859] [Citation(s) in RCA: 139] [Impact Index Per Article: 12.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2013] [Accepted: 11/29/2013] [Indexed: 12/21/2022] Open
Abstract
Background Ralstonia solanacearum is a soil-borne beta-proteobacterium that causes bacterial wilt disease in many food crops and is a major problem for agriculture in intertropical regions. R. solanacearum is a heterogeneous species, both phenotypically and genetically, and is considered as a species complex. Pathogenicity of R. solanacearum relies on the Type III secretion system that injects Type III effector (T3E) proteins into plant cells. T3E collectively perturb host cell processes and modulate plant immunity to enable bacterial infection. Results We provide the catalogue of T3E in the R. solanacearum species complex, as well as candidates in newly sequenced strains. 94 T3E orthologous groups were defined on phylogenetic bases and ordered using a uniform nomenclature. This curated T3E catalog is available on a public website and a bioinformatic pipeline has been designed to rapidly predict T3E genes in newly sequenced strains. Systematical analyses were performed to detect lateral T3E gene transfer events and identify T3E genes under positive selection. Our analyses also pinpoint the RipF translocon proteins as major discriminating determinants among the phylogenetic lineages. Conclusions Establishment of T3E repertoires in strains representatives of the R. solanacearum biodiversity allowed determining a set of 22 T3E present in all the strains but provided no clues on host specificity determinants. The definition of a standardized nomenclature and the optimization of predictive tools will pave the way to understanding how variation of these repertoires is correlated to the diversification of this species complex and how they contribute to the different strain pathotypes.
Collapse
Affiliation(s)
- Nemo Peeters
- INRA, Laboratoire des Interactions Plantes-Microorganismes (LIPM), UMR441, F-31326 Castanet-Tolosan, France.
| | | | | | | | | | | |
Collapse
|
36
|
Taniguchi Y, Yamada Y, Maruyama O, Kuhara S, Ikeda D. The purity measure for genomic regions leads to horizontally transferred genes. J Bioinform Comput Biol 2013; 11:1343002. [PMID: 24372031 DOI: 10.1142/s0219720013430026] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Sequence analysis is important to understand a genome, and a number of approaches such as sequence alignments and hidden Markov models have been employed. In the field of text mining, the purity measure is developed to detect unusual regions of a string without any domain knowledge. It is reported in that work that only RNAs and transposons are shown to have high purity values. In this work, the purity values of regions of various bacterial genome sequences are computed, and those regions are analyzed extensively. It is found that mobile elements and phages as well as RNAs and transposons have high purity values. It is interesting that they are all classified into a group of horizontally transferred genes. This means that the purity measure is useful to predict horizontally transferred genes.
Collapse
Affiliation(s)
- Yuta Taniguchi
- Department of Informatics, Kyushu University, Fukuoka, Japan
| | | | | | | | | |
Collapse
|
37
|
Jeanniard A, Dunigan DD, Gurnon JR, Agarkova IV, Kang M, Vitek J, Duncan G, McClung OW, Larsen M, Claverie JM, Van Etten JL, Blanc G. Towards defining the chloroviruses: a genomic journey through a genus of large DNA viruses. BMC Genomics 2013; 14:158. [PMID: 23497343 PMCID: PMC3602175 DOI: 10.1186/1471-2164-14-158] [Citation(s) in RCA: 65] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2012] [Accepted: 02/22/2013] [Indexed: 11/29/2022] Open
Abstract
Background Giant viruses in the genus Chlorovirus (family Phycodnaviridae) infect eukaryotic green microalgae. The prototype member of the genus, Paramecium bursaria chlorella virus 1, was sequenced more than 15 years ago, and to date there are only 6 fully sequenced chloroviruses in public databases. Presented here are the draft genome sequences of 35 additional chloroviruses (287 – 348 Kb/319 – 381 predicted protein encoding genes) collected across the globe; they infect one of three different green algal species. These new data allowed us to analyze the genomic landscape of 41 chloroviruses, which revealed some remarkable features about these viruses. Results Genome colinearity, nucleotide conservation and phylogenetic affinity were limited to chloroviruses infecting the same host, confirming the validity of the three previously known subgenera. Clues for the existence of a fourth new subgenus indicate that the boundaries of chlorovirus diversity are not completely determined. Comparison of the chlorovirus phylogeny with that of the algal hosts indicates that chloroviruses have changed hosts in their evolutionary history. Reconstruction of the ancestral genome suggests that the last common chlorovirus ancestor had a slightly more diverse protein repertoire than modern chloroviruses. However, more than half of the defined chlorovirus gene families have a potential recent origin (after Chlorovirus divergence), among which a portion shows compositional evidence for horizontal gene transfer. Only a few of the putative acquired proteins had close homologs in databases raising the question of the true donor organism(s). Phylogenomic analysis identified only seven proteins whose genes were potentially exchanged between the algal host and the chloroviruses. Conclusion The present evaluation of the genomic evolution pattern suggests that chloroviruses differ from that described in the related Poxviridae and Mimiviridae. Our study shows that the fixation of algal host genes has been anecdotal in the evolutionary history of chloroviruses. We finally discuss the incongruence between compositional evidence of horizontal gene transfer and lack of close relative sequences in the databases, which suggests that the recently acquired genes originate from a still largely un-sequenced reservoir of genomes, possibly other unknown viruses that infect the same hosts.
Collapse
Affiliation(s)
- Adrien Jeanniard
- Information Génomique & Structurale, IGS UMR7256, CNRS, Aix-Marseille Université, FR-13288, Marseille, France
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
38
|
Le PT, Ramulu HG, Guijarro L, Paganini J, Gouret P, Chabrol O, Raoult D, Pontarotti P. An automated approach for the identification of horizontal gene transfers from complete genomes reveals the rhizome of Rickettsiales. BMC Evol Biol 2012; 12:243. [PMID: 23234643 PMCID: PMC3575314 DOI: 10.1186/1471-2148-12-243] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2012] [Accepted: 11/22/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Horizontal gene transfer (HGT) is considered to be a major force driving the evolutionary history of prokaryotes. HGT is widespread in prokaryotes, contributing to the genomic repertoire of prokaryotic organisms, and is particularly apparent in Rickettsiales genomes. Gene gains from both distantly and closely related organisms play crucial roles in the evolution of bacterial genomes. In this work, we focus on genes transferred from distantly related species into Rickettsiales species. RESULTS We developed an automated approach for the detection of HGT from other organisms (excluding alphaproteobacteria) into Rickettsiales genomes. Our systematic approach consisted of several specialized features including the application of a parsimony method for inferring phyletic patterns followed by blast filter, automated phylogenetic reconstruction and the application of patterns for HGT detection. We identified 42 instances of HGT in 31 complete Rickettsiales genomes, of which 38 were previously unidentified instances of HGT from Anaplasma, Wolbachia, Candidatus Pelagibacter ubique and Rickettsia genomes. Additionally, putative cases with no phylogenetic support were assigned gene ontology terms. Overall, these transfers could be characterized as "rhizome-like". CONCLUSIONS Our analysis provides a comprehensive, systematic approach for the automated detection of HGTs from several complete proteome sequences that can be applied to detect instances of HGT within other genomes of interest.
Collapse
Affiliation(s)
- Phuong Thi Le
- Evolutionary biology and modeling, LATP UMR-CNRS 7353, Aix-Marseille University, 13331, Marseille, France
| | | | | | | | | | | | | | | |
Collapse
|
39
|
Liu L, Chen X, Skogerbø G, Zhang P, Chen R, He S, Huang DW. The human microbiome: A hot spot of microbial horizontal gene transfer. Genomics 2012; 100:265-70. [DOI: 10.1016/j.ygeno.2012.07.012] [Citation(s) in RCA: 71] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2012] [Revised: 07/06/2012] [Accepted: 07/16/2012] [Indexed: 12/19/2022]
|
40
|
Saini V, Raghuvanshi S, Khurana JP, Ahmed N, Hasnain SE, Tyagi AK, Tyagi AK. Massive gene acquisitions in Mycobacterium indicus pranii provide a perspective on mycobacterial evolution. Nucleic Acids Res 2012; 40:10832-50. [PMID: 22965120 PMCID: PMC3505973 DOI: 10.1093/nar/gks793] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Understanding the evolutionary and genomic mechanisms responsible for turning the soil-derived saprophytic mycobacteria into lethal intracellular pathogens is a critical step towards the development of strategies for the control of mycobacterial diseases. In this context, Mycobacterium indicus pranii (MIP) is of specific interest because of its unique immunological and evolutionary significance. Evolutionarily, it is the progenitor of opportunistic pathogens belonging to M. avium complex and is endowed with features that place it between saprophytic and pathogenic species. Herein, we have sequenced the complete MIP genome to understand its unique life style, basis of immunomodulation and habitat diversification in mycobacteria. As a case of massive gene acquisitions, 50.5% of MIP open reading frames (ORFs) are laterally acquired. We show, for the first time for Mycobacterium, that MIP genome has mosaic architecture. These gene acquisitions have led to the enrichment of selected gene families critical to MIP physiology. Comparative genomic analysis indicates a higher antigenic potential of MIP imparting it a unique ability for immunomodulation. Besides, it also suggests an important role of genomic fluidity in habitat diversification within mycobacteria and provides a unique view of evolutionary divergence and putative bottlenecks that might have eventually led to intracellular survival and pathogenic attributes in mycobacteria.
Collapse
Affiliation(s)
- Vikram Saini
- Department of Biochemistry, University of Delhi South Campus, New Delhi 110021, India
| | | | | | | | | | | | | |
Collapse
|
41
|
Elhai J, Liu H, Taton A. Detection of horizontal transfer of individual genes by anomalous oligomer frequencies. BMC Genomics 2012; 13:245. [PMID: 22702893 PMCID: PMC3497702 DOI: 10.1186/1471-2164-13-245] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2011] [Accepted: 05/18/2012] [Indexed: 11/10/2022] Open
Abstract
Background Understanding the history of life requires that we understand the transfer of genetic material across phylogenetic boundaries. Detecting genes that were acquired by means other than vertical descent is a basic step in that process. Detection by discordant phylogenies is computationally expensive and not always definitive. Many have used easily computed compositional features as an alternative procedure. However, different compositional methods produce different predictions, and the effectiveness of any method is not well established. Results The ability of octamer frequency comparisons to detect genes artificially seeded in cyanobacterial genomes was markedly increased by using as a training set those genes that are highly conserved over all bacteria. Using a subset of octamer frequencies in such tests also increased effectiveness, but this depended on the specific target genome and the source of the contaminating genes. The presence of high frequency octamers and the GC content of the contaminating genes were important considerations. A method comprising best practices from these tests was devised, the Core Gene Similarity (CGS) method, and it performed better than simple octamer frequency analysis, codon bias, or GC contrasts in detecting seeded genes or naturally occurring transposons. From a comparison of predictions with phylogenetic trees, it appears that the effectiveness of the method is confined to horizontal transfer events that have occurred recently in evolutionary time. Conclusions The CGS method may be an improvement over existing surrogate methods to detect genes of foreign origin.
Collapse
Affiliation(s)
- Jeff Elhai
- Center for the Study of Biological Complexity, Virginia Commonwealth University, Richmond, VA 23284, USA.
| | | | | |
Collapse
|
42
|
Zhai Z, Reinert G, Song K, Waterman MS, Luan Y, Sun F. Normal and compound poisson approximations for pattern occurrences in NGS reads. J Comput Biol 2012; 19:839-54. [PMID: 22697250 PMCID: PMC3375642 DOI: 10.1089/cmb.2012.0029] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Next generation sequencing (NGS) technologies are now widely used in many biological studies. In NGS, sequence reads are randomly sampled from the genome sequence of interest. Most computational approaches for NGS data first map the reads to the genome and then analyze the data based on the mapped reads. Since many organisms have unknown genome sequences and many reads cannot be uniquely mapped to the genomes even if the genome sequences are known, alternative analytical methods are needed for the study of NGS data. Here we suggest using word patterns to analyze NGS data. Word pattern counting (the study of the probabilistic distribution of the number of occurrences of word patterns in one or multiple long sequences) has played an important role in molecular sequence analysis. However, no studies are available on the distribution of the number of occurrences of word patterns in NGS reads. In this article, we build probabilistic models for the background sequence and the sampling process of the sequence reads from the genome. Based on the models, we provide normal and compound Poisson approximations for the number of occurrences of word patterns from the sequence reads, with bounds on the approximation error. The main challenge is to consider the randomness in generating the long background sequence, as well as in the sampling of the reads using NGS. We show the accuracy of these approximations under a variety of conditions for different patterns with various characteristics. Under realistic assumptions, the compound Poisson approximation seems to outperform the normal approximation in most situations. These approximate distributions can be used to evaluate the statistical significance of the occurrence of patterns from NGS data. The theory and the computational algorithm for calculating the approximate distributions are then used to analyze ChIP-Seq data using transcription factor GABP. Software is available online (www-rcf.usc.edu/∼fsun/Programs/NGS_motif_power/NGS_motif_power.html). In addition, Supplementary Material can be found online (www.liebertonline.com/cmb).
Collapse
Affiliation(s)
- Zhiyuan Zhai
- School of Mathematics, Shandong University, Jinan, Shandong, China
| | - Gesine Reinert
- Department of Statistics, University of Oxford, Oxford, United Kingdom
| | - Kai Song
- School of Mathematics, Peking University, Beijing, China
| | - Michael S. Waterman
- Molecular and Computational Biology, University of Southern California, Los Angeles, California
- TNLIST/Department of Automation, Tsinghua University, Beijing, China
| | - Yihui Luan
- School of Mathematics, Shandong University, Jinan, Shandong, China
| | - Fengzhu Sun
- Molecular and Computational Biology, University of Southern California, Los Angeles, California
- TNLIST/Department of Automation, Tsinghua University, Beijing, China
| |
Collapse
|
43
|
Pattern matching through Chaos Game Representation: bridging numerical and discrete data structures for biological sequence analysis. Algorithms Mol Biol 2012; 7:10. [PMID: 22551152 PMCID: PMC3402988 DOI: 10.1186/1748-7188-7-10] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2012] [Accepted: 05/02/2012] [Indexed: 01/06/2023] Open
Abstract
Background Chaos Game Representation (CGR) is an iterated function that bijectively maps discrete sequences into a continuous domain. As a result, discrete sequences can be object of statistical and topological analyses otherwise reserved to numerical systems. Characteristically, CGR coordinates of substrings sharing an L-long suffix will be located within 2-L distance of each other. In the two decades since its original proposal, CGR has been generalized beyond its original focus on genomic sequences and has been successfully applied to a wide range of problems in bioinformatics. This report explores the possibility that it can be further extended to approach algorithms that rely on discrete, graph-based representations. Results The exploratory analysis described here consisted of selecting foundational string problems and refactoring them using CGR-based algorithms. We found that CGR can take the role of suffix trees and emulate sophisticated string algorithms, efficiently solving exact and approximate string matching problems such as finding all palindromes and tandem repeats, and matching with mismatches. The common feature of these problems is that they use longest common extension (LCE) queries as subtasks of their procedures, which we show to have a constant time solution with CGR. Additionally, we show that CGR can be used as a rolling hash function within the Rabin-Karp algorithm. Conclusions The analysis of biological sequences relies on algorithmic foundations facing mounting challenges, both logistic (performance) and analytical (lack of unifying mathematical framework). CGR is found to provide the latter and to promise the former: graph-based data structures for sequence analysis operations are entailed by numerical-based data structures produced by CGR maps, providing a unifying analytical framework for a diversity of pattern matching problems.
Collapse
|
44
|
Ménigaud S, Mallet L, Picord G, Churlaud C, Borrel A, Deschavanne P. GOHTAM: a website for 'Genomic Origin of Horizontal Transfers, Alignment and Metagenomics'. Bioinformatics 2012; 28:1270-1. [PMID: 22426345 PMCID: PMC3338014 DOI: 10.1093/bioinformatics/bts118] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023] Open
Abstract
Motivation: This website allows the detection of horizontal transfers based on a combination of parametric methods and proposes an origin by researching neighbors in a bank of genomic signatures. This bank is also used to research an origin to DNA fragments from metagenomics studies. Results: Different services are provided like the possibility of inferring a phylogenetic tree with sequence signatures or comparing two genomes and displaying the rearrangements that happened since their separation. Availability and implementation:http://gohtam.rpbs.univ-paris-diderot.fr/ Contact:patrick.deschavanne@univ-paris-diderot.fr; ludovic.mallet@jouy.inra.fr Supplementary information:Supplementary data are available at Bioinformatics online http://gohtam.rpbs.univ-paris-diderot.fr:8080/Data/bin/GOHTAM_bin.tgz
Collapse
Affiliation(s)
- Sabine Ménigaud
- Molécules Thérapeutiques in silico, Institut National de la Santé et de la Recherche Médicale (INSERM) UMR-S 973, Université Paris Diderot, Sorbonne Paris Cité, 35 rue Héléne Brion, Paris, France
| | | | | | | | | | | |
Collapse
|
45
|
Frenkel S, Kirzhner V, Korol A. Organizational heterogeneity of vertebrate genomes. PLoS One 2012; 7:e32076. [PMID: 22384143 PMCID: PMC3288070 DOI: 10.1371/journal.pone.0032076] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2011] [Accepted: 01/23/2012] [Indexed: 01/06/2023] Open
Abstract
Genomes of higher eukaryotes are mosaics of segments with various structural, functional, and evolutionary properties. The availability of whole-genome sequences allows the investigation of their structure as "texts" using different statistical and computational methods. One such method, referred to as Compositional Spectra (CS) analysis, is based on scoring the occurrences of fixed-length oligonucleotides (k-mers) in the target DNA sequence. CS analysis allows generating species- or region-specific characteristics of the genome, regardless of their length and the presence of coding DNA. In this study, we consider the heterogeneity of vertebrate genomes as a joint effect of regional variation in sequence organization superimposed on the differences in nucleotide composition. We estimated compositional and organizational heterogeneity of genome and chromosome sequences separately and found that both heterogeneity types vary widely among genomes as well as among chromosomes in all investigated taxonomic groups. The high correspondence of heterogeneity scores obtained on three genome fractions, coding, repetitive, and the remaining part of the noncoding DNA (the genome dark matter--GDM) allows the assumption that CS-heterogeneity may have functional relevance to genome regulation. Of special interest for such interpretation is the fact that natural GDM sequences display the highest deviation from the corresponding reshuffled sequences.
Collapse
Affiliation(s)
| | | | - Abraham Korol
- Department of Evolutionary and Environmental Biology and Institute of Evolution, University of Haifa, Mount Carmel, Haifa, Israel
| |
Collapse
|
46
|
Soares SC, Abreu VAC, Ramos RTJ, Cerdeira L, Silva A, Baumbach J, Trost E, Tauch A, Hirata R, Mattos-Guaraldi AL, Miyoshi A, Azevedo V. PIPS: pathogenicity island prediction software. PLoS One 2012; 7:e30848. [PMID: 22355329 PMCID: PMC3280268 DOI: 10.1371/journal.pone.0030848] [Citation(s) in RCA: 50] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2011] [Accepted: 12/22/2011] [Indexed: 01/08/2023] Open
Abstract
The adaptability of pathogenic bacteria to hosts is influenced by the genomic plasticity of the bacteria, which can be increased by such mechanisms as horizontal gene transfer. Pathogenicity islands play a major role in this type of gene transfer because they are large, horizontally acquired regions that harbor clusters of virulence genes that mediate the adhesion, colonization, invasion, immune system evasion, and toxigenic properties of the acceptor organism. Currently, pathogenicity islands are mainly identified in silico based on various characteristic features: (1) deviations in codon usage, G+C content or dinucleotide frequency and (2) insertion sequences and/or tRNA genetic flanking regions together with transposase coding genes. Several computational techniques for identifying pathogenicity islands exist. However, most of these techniques are only directed at the detection of horizontally transferred genes and/or the absence of certain genomic regions of the pathogenic bacterium in closely related non-pathogenic species. Here, we present a novel software suite designed for the prediction of pathogenicity islands (pathogenicity island prediction software, or PIPS). In contrast to other existing tools, our approach is capable of utilizing multiple features for pathogenicity island detection in an integrative manner. We show that PIPS provides better accuracy than other available software packages. As an example, we used PIPS to study the veterinary pathogen Corynebacterium pseudotuberculosis, in which we identified seven putative pathogenicity islands.
Collapse
Affiliation(s)
- Siomar C. Soares
- Department of General Biology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Vinícius A. C. Abreu
- Department of Biochemistry and Immunology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | | | - Louise Cerdeira
- Department of Genetics, Federal University of Pará, Belém, Pará, Brazil
| | - Artur Silva
- Department of Genetics, Federal University of Pará, Belém, Pará, Brazil
| | - Jan Baumbach
- Department of Computer Science, Max-Planck-Institut für Informatik, Saarbrücken, Saarland, Germany
| | - Eva Trost
- Center for Biotechnology, Bielefeld University, Bielefeld, Nordrhein-Westfalen, Germany
| | - Andreas Tauch
- Center for Biotechnology, Bielefeld University, Bielefeld, Nordrhein-Westfalen, Germany
| | - Raphael Hirata
- Microbiology and Immunology Discipline, Medical Sciences Faculty, State University of Rio de Janeiro, Rio de Janeiro, Brazil
| | - Ana L. Mattos-Guaraldi
- Microbiology and Immunology Discipline, Medical Sciences Faculty, State University of Rio de Janeiro, Rio de Janeiro, Brazil
| | - Anderson Miyoshi
- Department of General Biology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Vasco Azevedo
- Department of General Biology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
- Department of Biochemistry and Immunology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
- * E-mail:
| |
Collapse
|
47
|
Abstract
Methods for identifying alien genes in genomes fall into two general classes. Phylogenetic methods examine the distribution of a gene's homologues among genomes to find those with relationships not consistent with vertical inheritance. These approaches include identifying orphan genes which lack homologues in closely related genomes and genes with unduly high levels of similarity to genes in otherwise unrelated genomes. Rigorous statistical tests are available to place confidence intervals for predicted alien genes. Parametric methods examine the compositional properties of genes within a genome to find those with atypical properties, likely indicating the directional mutational pressures of a donor genome. These methods may compare the properties of genes to genomic averages, properties of genes to each other, or properties of large, multigene regions of the chromosome. Here, we discuss the strengths and weaknesses of each approach.
Collapse
Affiliation(s)
- Rajeev K Azad
- Department of Biological Sciences, University of Pittsburgh, Pittsburgh, PA, USA
| | | |
Collapse
|
48
|
Anderson CNK, Liu L, Pearl D, Edwards SV. Tangled trees: the challenge of inferring species trees from coalescent and noncoalescent genes. Methods Mol Biol 2012; 856:3-28. [PMID: 22399453 DOI: 10.1007/978-1-61779-585-5_1] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
Phylogenies based on different genes can produce conflicting phylogenies; methods that resolve such ambiguities are becoming more popular, and offer a number of advantages for phylogenetic analysis. We review so-called species tree methods and the biological forces that can undermine them by violating important aspects of the underlying models. Such forces include horizontal gene transfer, gene duplication, and natural selection. We review ways of detecting loci influenced by such forces and offer suggestions for identifying or accommodating them. The way forward involves identifying outlier loci, as is done in population genetic analysis of neutral and selected loci, and removing them from further analysis, or developing more complex species tree models that can accommodate such loci.
Collapse
Affiliation(s)
- Christian N K Anderson
- Department of Organismic and Evolutionary Biology & Museum of Comparative Zoology, Harvard University, Cambridge, MA, USA
| | | | | | | |
Collapse
|
49
|
Bezuidt O, Pierneef R, Mncube K, Lima-Mendez G, Reva ON. Mainstreams of horizontal gene exchange in enterobacteria: consideration of the outbreak of enterohemorrhagic E. coli O104:H4 in Germany in 2011. PLoS One 2011; 6:e25702. [PMID: 22022434 PMCID: PMC3195076 DOI: 10.1371/journal.pone.0025702] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2011] [Accepted: 09/08/2011] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND Escherichia coli O104:H4 caused a severe outbreak in Europe in 2011. The strain TY-2482 sequenced from this outbreak allowed the discovery of its closest relatives but failed to resolve ways in which it originated and evolved. On account of the previous statement, may we expect similar upcoming outbreaks to occur recurrently or spontaneously in the future? The inability to answer these questions shows limitations of the current comparative and evolutionary genomics methods. PRINCIPAL FINDINGS The study revealed oscillations of gene exchange in enterobacteria, which originated from marine γ-Proteobacteria. These mobile genetic elements have become recombination hotspots and effective 'vehicles' ensuring a wide distribution of successful combinations of fitness and virulence genes among enterobacteria. Two remarkable peculiarities of the strain TY-2482 and its relatives were observed: i) retaining the genetic primitiveness by these strains as they somehow avoided the main fluxes of horizontal gene transfer which effectively penetrated other enetrobacteria; ii) acquisition of antibiotic resistance genes in a plasmid genomic island of β-Proteobacteria origin which ontologically is unrelated to the predominant genomic islands of enterobacteria. CONCLUSIONS Oscillations of horizontal gene exchange activity were reported which result from a counterbalance between the acquired resistance of bacteria towards existing mobile vectors and the generation of new vectors in the environmental microflora. We hypothesized that TY-2482 may originate from a genetically primitive lineage of E. coli that has evolved in confined geographical areas and brought by human migration or cattle trade onto an intersection of several independent streams of horizontal gene exchange. Development of a system for monitoring the new and most active gene exchange events was proposed.
Collapse
Affiliation(s)
- Oliver Bezuidt
- Bioinformatics and Computational Biology Unit, Department of Biochemistry, University of Pretoria, Pretoria, South Africa
| | - Rian Pierneef
- Bioinformatics and Computational Biology Unit, Department of Biochemistry, University of Pretoria, Pretoria, South Africa
| | - Kingdom Mncube
- Bioinformatics and Computational Biology Unit, Department of Biochemistry, University of Pretoria, Pretoria, South Africa
| | - Gipsi Lima-Mendez
- Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe), Université Libre de Bruxelles, Bruxelles, Belgium
| | - Oleg N. Reva
- Bioinformatics and Computational Biology Unit, Department of Biochemistry, University of Pretoria, Pretoria, South Africa
- * E-mail:
| |
Collapse
|
50
|
Bioinformatic analysis reveals high diversity of bacterial genes for laccase-like enzymes. PLoS One 2011; 6:e25724. [PMID: 22022440 PMCID: PMC3192119 DOI: 10.1371/journal.pone.0025724] [Citation(s) in RCA: 95] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2011] [Accepted: 09/09/2011] [Indexed: 11/19/2022] Open
Abstract
Fungal laccases have been used in various fields ranging from processes in wood and paper industries to environmental applications. Although a few bacterial laccases have been characterized in recent years, prokaryotes have largely been neglected as a source of novel enzymes, in part due to the lack of knowledge about the diversity and distribution of laccases within Bacteria. In this work genes for laccase-like enzymes were searched for in over 2,200 complete and draft bacterial genomes and four metagenomic datasets, using the custom profile Hidden Markov Models for two- and three-domain laccases. More than 1,200 putative genes for laccase-like enzymes were retrieved from chromosomes and plasmids of diverse bacteria. In 76% of the genes, signal peptides were predicted, indicating that these bacterial laccases may be exported from the cytoplasm, which contrasts with the current belief. Moreover, several examples of putatively horizontally transferred bacterial laccase genes were described. Many metagenomic sequences encoding fragments of laccase-like enzymes could not be phylogenetically assigned, indicating considerable novelty. Laccase-like genes were also found in anaerobic bacteria, autotrophs and alkaliphiles, thus opening new hypotheses regarding their ecological functions. Bacteria identified as carrying laccase genes represent potential sources for future biotechnological applications.
Collapse
|