1
|
Ozeri E, Zehavi M, Ziv-Ukelson M. New algorithms for structure informed genome rearrangement. Algorithms Mol Biol 2023; 18:17. [PMID: 38037088 PMCID: PMC10691145 DOI: 10.1186/s13015-023-00239-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2023] [Accepted: 08/17/2023] [Indexed: 12/02/2023] Open
Abstract
We define two new computational problems in the domain of perfect genome rearrangements, and propose three algorithms to solve them. The rearrangement scenarios modeled by the problems consider Reversal and Block Interchange operations, and a PQ-tree is utilized to guide the allowed operations and to compute their weights. In the first problem, [Formula: see text] ([Formula: see text]), we define the basic structure-informed rearrangement measure. Here, we assume that the gene order members of the gene cluster from which the PQ-tree is constructed are permutations. The PQ-tree representing the gene cluster is ordered such that the series of gene IDs spelled by its leaves is equivalent to that of the reference gene order. Then, a structure-informed genome rearrangement distance is computed between the ordered PQ-tree and the target gene order. The second problem, [Formula: see text] ([Formula: see text]), generalizes [Formula: see text], where the gene order members are not necessarily permutations and the structure informed rearrangement measure is extended to also consider up to [Formula: see text] and [Formula: see text] gene insertion and deletion operations, respectively, when modelling the PQ-tree informed divergence process from the reference gene order to the target gene order. The first algorithm solves [Formula: see text] in [Formula: see text] time and [Formula: see text] space, where [Formula: see text] is the maximum number of children of a node, n is the length of the string and the number of leaves in the tree, and [Formula: see text] and [Formula: see text] are the number of P-nodes and Q-nodes in the tree, respectively. If one of the penalties of [Formula: see text] is 0, then the algorithm runs in [Formula: see text] time and [Formula: see text] space. The second algorithm solves [Formula: see text] in [Formula: see text] time and [Formula: see text] space, where [Formula: see text] is the maximum number of children of a node, n is the length of the string, m is the number of leaves in the tree, [Formula: see text] and [Formula: see text] are the number of P-nodes and Q-nodes in the tree, respectively, and allowing up to [Formula: see text] deletions from the tree and up to [Formula: see text] deletions from the string. The third algorithm is intended to reduce the space complexity of the second algorithm. It solves a variant of the problem (where one of the penalties of [Formula: see text] is 0) in [Formula: see text] time and [Formula: see text] space. The algorithm is implemented as a software tool, denoted MEM-Rearrange, and applied to the comparative and evolutionary analysis of 59 chromosomal gene clusters extracted from a dataset of 1487 prokaryotic genomes.
Collapse
Affiliation(s)
- Eden Ozeri
- Department of Computer Science, Ben Gurion University of the Negev, Be'er Sheva, Israel.
| | - Meirav Zehavi
- Department of Computer Science, Ben Gurion University of the Negev, Be'er Sheva, Israel
| | - Michal Ziv-Ukelson
- Department of Computer Science, Ben Gurion University of the Negev, Be'er Sheva, Israel
| |
Collapse
|
2
|
Zimerman GR, Svetlitsky D, Zehavi M, Ziv-Ukelson M. Approximate search for known gene clusters in new genomes using PQ-trees. Algorithms Mol Biol 2021; 16:16. [PMID: 34243815 PMCID: PMC8272295 DOI: 10.1186/s13015-021-00190-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2021] [Accepted: 06/05/2021] [Indexed: 11/13/2022] Open
Abstract
Gene clusters are groups of genes that are co-locally conserved across various genomes, not necessarily in the same order. Their discovery and analysis is valuable in tasks such as gene annotation and prediction of gene interactions, and in the study of genome organization and evolution. The discovery of conserved gene clusters in a given set of genomes is a well studied problem, but with the rapid sequencing of prokaryotic genomes a new problem is inspired. Namely, given an already known gene cluster that was discovered and studied in one genomic dataset, to identify all the instances of the gene cluster in a given new genomic sequence. Thus, we define a new problem in comparative genomics, denoted PQ-Tree Search that takes as input a PQ-tree T representing the known gene orders of a gene cluster of interest, a gene-to-gene substitution scoring function h, integer arguments \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$d_T$$\end{document}dT and \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$d_S$$\end{document}dS, and a new sequence of genes S. The objective is to identify in S approximate new instances of the gene cluster; These instances could vary from the known gene orders by genome rearrangements that are constrained by T, by gene substitutions that are governed by h, and by gene deletions and insertions that are bounded from above by \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$d_T$$\end{document}dT and \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$d_S$$\end{document}dS, respectively. We prove that PQ-Tree Search is NP-hard and propose a parameterized algorithm that solves the optimization variant of PQ-Tree Search in \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$O^*(2^{\gamma })$$\end{document}O∗(2γ) time, where \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\gamma$$\end{document}γ is the maximum degree of a node in T and \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$O^*$$\end{document}O∗ is used to hide factors polynomial in the input size. The algorithm is implemented as a search tool, denoted PQFinder, and applied to search for instances of chromosomal gene clusters in plasmids, within a dataset of 1,487 prokaryotic genomes. We report on 29 chromosomal gene clusters that are rearranged in plasmids, where the rearrangements are guided by the corresponding PQ-trees. One of these results, coding for a heavy metal efflux pump, is further analysed to exemplify how PQFinder can be harnessed to reveal interesting new structural variants of known gene clusters.
Collapse
|
3
|
Benshahar A, Chalifa-Caspi V, Hermelin D, Ziv-Ukelson M. A Biclique Approach to Reference-Anchored Gene Blocks and Its Applications to Genomic Islands. J Comput Biol 2018; 25:214-235. [DOI: 10.1089/cmb.2017.0108] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Affiliation(s)
- Arnon Benshahar
- Department of Computer Science, Ben-Gurion University, Beer-Sheva, Israel
| | - Vered Chalifa-Caspi
- National Institute for Biotechnology in the Negev, Ben-Gurion University, Beer-Sheva, Israel
| | - Danny Hermelin
- Department of Industrial Engineering and Management, Ben-Gurion University, Beer-Sheva, Israel
| | - Michal Ziv-Ukelson
- Department of Computer Science, Ben-Gurion University, Beer-Sheva, Israel
| |
Collapse
|
4
|
Winter S, Jahn K, Wehner S, Kuchenbecker L, Marz M, Stoye J, Böcker S. Finding approximate gene clusters with Gecko 3. Nucleic Acids Res 2016; 44:9600-9610. [PMID: 27679480 PMCID: PMC5175365 DOI: 10.1093/nar/gkw843] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2015] [Revised: 09/06/2016] [Accepted: 09/12/2016] [Indexed: 12/15/2022] Open
Abstract
Gene-order-based comparison of multiple genomes provides signals for functional analysis of genes and the evolutionary process of genome organization. Gene clusters are regions of co-localized genes on genomes of different species. The rapid increase in sequenced genomes necessitates bioinformatics tools for finding gene clusters in hundreds of genomes. Existing tools are often restricted to few (in many cases, only two) genomes, and often make restrictive assumptions such as short perfect conservation, conserved gene order or monophyletic gene clusters. We present Gecko 3, an open-source software for finding gene clusters in hundreds of bacterial genomes, that comes with an easy-to-use graphical user interface. The underlying gene cluster model is intuitive, can cope with low degrees of conservation as well as misannotations and is complemented by a sound statistical evaluation. To evaluate the biological benefit of Gecko 3 and to exemplify our method, we search for gene clusters in a dataset of 678 bacterial genomes using Synechocystis sp. PCC 6803 as a reference. We confirm detected gene clusters reviewing the literature and comparing them to a database of operons; we detect two novel clusters, which were confirmed by publicly available experimental RNA-Seq data. The computational analysis is carried out on a laptop computer in <40 min.
Collapse
Affiliation(s)
- Sascha Winter
- Chair for Bioinformatics, Institute for Computer Science, Friedrich-Schiller-University Jena, Jena, Germany
| | - Katharina Jahn
- Genome Informatics, Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
- Computational Biology Group, Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Stefanie Wehner
- RNA Bioinformatics and High Throughput Analysis, Institute for Computer Science, Friedrich-Schiller-University Jena, Jena, Germany
- Institute of Aquaculture, School of Natural Sciences, University of Stirling, Stirling, FK9LA, Scotland, UK
| | - Leon Kuchenbecker
- Genome Informatics, Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
- Berlin-Brandenburg Center for Regenerative Therapies, Charité University Medicine Berlin, Berlin, Germany
| | - Manja Marz
- RNA Bioinformatics and High Throughput Analysis, Institute for Computer Science, Friedrich-Schiller-University Jena, Jena, Germany
- Leibniz Institute for Age Research-Fritz Lipmann Institute (FLI), Jena, Germany
| | - Jens Stoye
- Genome Informatics, Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| | - Sebastian Böcker
- Chair for Bioinformatics, Institute for Computer Science, Friedrich-Schiller-University Jena, Jena, Germany
| |
Collapse
|
5
|
Heydari M, Marashi SA, Tusserkani R, Sadeghi M. Reconstruction of phylogenetic trees of prokaryotes using maximal common intervals. Biosystems 2014; 124:86-94. [PMID: 25195150 DOI: 10.1016/j.biosystems.2014.09.002] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2012] [Revised: 08/13/2014] [Accepted: 09/01/2014] [Indexed: 11/15/2022]
Abstract
One of the fundamental problems in bioinformatics is phylogenetic tree reconstruction, which can be used for classifying living organisms into different taxonomic clades. The classical approach to this problem is based on a marker such as 16S ribosomal RNA. Since evolutionary events like genomic rearrangements are not included in reconstructions of phylogenetic trees based on single genes, much effort has been made to find other characteristics for phylogenetic reconstruction in recent years. With the increasing availability of completely sequenced genomes, gene order can be considered as a new solution for this problem. In the present work, we applied maximal common intervals (MCIs) in two or more genomes to infer their distance and to reconstruct their evolutionary relationship. Additionally, measures based on uncommon segments (UCS's), i.e., those genomic segments which are not detected as part of any of the MCIs, are also used for phylogenetic tree reconstruction. We applied these two types of measures for reconstructing the phylogenetic tree of 63 prokaryotes with known COG (clusters of orthologous groups) families. Similarity between the MCI-based (resp. UCS-based) reconstructed phylogenetic trees and the phylogenetic tree obtained from NCBI taxonomy browser is as high as 93.1% (resp. 94.9%). We show that in the case of this diverse dataset of prokaryotes, tree reconstruction based on MCI and UCS outperforms most of the currently available methods based on gene orders, including breakpoint distance and DCJ. We additionally tested our new measures on a dataset of 13 closely-related bacteria from the genus Prochlorococcus. In this case, distances like rearrangement distance, breakpoint distance and DCJ proved to be useful, while our new measures are still appropriate for phylogenetic reconstruction.
Collapse
Affiliation(s)
- Mahdi Heydari
- Department of Algorithms and Computation, College of Engineering, University of Tehran, Tehran, Iran
| | - Sayed-Amir Marashi
- Department of Biotechnology, College of Science, University of Tehran, Tehran, Iran; School of Biological Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran.
| | - Ruzbeh Tusserkani
- School of Mathematics, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
| | - Mehdi Sadeghi
- National Institute of Genetic Engineering and Biotechnology, Tehran, Iran
| |
Collapse
|
6
|
Lechner M, Hernandez-Rosales M, Doerr D, Wieseke N, Thévenin A, Stoye J, Hartmann RK, Prohaska SJ, Stadler PF. Orthology detection combining clustering and synteny for very large datasets. PLoS One 2014; 9:e105015. [PMID: 25137074 PMCID: PMC4138177 DOI: 10.1371/journal.pone.0105015] [Citation(s) in RCA: 73] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2014] [Accepted: 07/14/2014] [Indexed: 11/18/2022] Open
Abstract
The elucidation of orthology relationships is an important step both in gene function prediction as well as towards understanding patterns of sequence evolution. Orthology assignments are usually derived directly from sequence similarities for large data because more exact approaches exhibit too high computational costs. Here we present PoFF, an extension for the standalone tool Proteinortho, which enhances orthology detection by combining clustering, sequence similarity, and synteny. In the course of this work, FFAdj-MCS, a heuristic that assesses pairwise gene order using adjacencies (a similarity measure related to the breakpoint distance) was adapted to support multiple linear chromosomes and extended to detect duplicated regions. PoFF largely reduces the number of false positives and enables more fine-grained predictions than purely similarity-based approaches. The extension maintains the low memory requirements and the efficient concurrency options of its basis Proteinortho, making the software applicable to very large datasets.
Collapse
Affiliation(s)
- Marcus Lechner
- Institut für Pharmazeutische Chemie, Philipps-Universität Marburg, Marburg, Germany
- * E-mail:
| | - Maribel Hernandez-Rosales
- Bioinformatics Group, Department of Computer Science, Universität Leipzig, Leipzig, Germany
- Interdisciplinary Center for Bioinformatics, Universität Leipzig, Leipzig, Germany
- Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany
- Departamento de Ciência da Computação, Instituto de Ciências Exatas, Universidade de Brasília, Brasília, Brasil
| | - Daniel Doerr
- Genome Informatics, Faculty of Technology, Bielefeld University, Bielefeld, Germany
- Institute for Bioinformatics, Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | - Nicolas Wieseke
- Faculty of Mathematics and Computer Science University of Leipzig, Leipzig, Germany
| | - Annelyse Thévenin
- Genome Informatics, Faculty of Technology, Bielefeld University, Bielefeld, Germany
- Institute for Bioinformatics, Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | - Jens Stoye
- Genome Informatics, Faculty of Technology, Bielefeld University, Bielefeld, Germany
- Institute for Bioinformatics, Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | - Roland K. Hartmann
- Institut für Pharmazeutische Chemie, Philipps-Universität Marburg, Marburg, Germany
| | - Sonja J. Prohaska
- Computational EvoDevo Group, Department of Computer Science, Universität Leipzig, Leipzig, Germany
| | - Peter F. Stadler
- Bioinformatics Group, Department of Computer Science, Universität Leipzig, Leipzig, Germany
- Interdisciplinary Center for Bioinformatics, Universität Leipzig, Leipzig, Germany
- Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany
- Institute for Theoretical Chemistry, University of Vienna, Vienna, Austria
- Center for non-coding RNA in Technology and Health, University of Copenhagen, Frederiksberg, Denmark
- The Santa Fe Institute, Santa Fe, New Mexico, United States of America
- RNomics Group, Fraunhofer Institut for Cell Therapy and Immunology, Leipzig, Germany
| |
Collapse
|
7
|
Lucas JMEX, Muffato M, Roest Crollius H. PhylDiag: identifying complex synteny blocks that include tandem duplications using phylogenetic gene trees. BMC Bioinformatics 2014; 15:268. [PMID: 25103980 PMCID: PMC4155083 DOI: 10.1186/1471-2105-15-268] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2014] [Accepted: 07/17/2014] [Indexed: 11/21/2022] Open
Abstract
Background Extant genomes share regions where genes have the same order and orientation, which are thought to arise from the conservation of an ancestral order of genes during evolution. Such regions of so-called conserved synteny, or synteny blocks, must be precisely identified and quantified, as a prerequisite to better understand the evolutionary history of genomes. Results Here we describe PhylDiag, a software that identifies statistically significant synteny blocks in pairwise comparisons of eukaryote genomes. Compared to previous methods, PhylDiag uses gene trees to define gene homologies, thus allowing gene deletions to be considered as events that may break the synteny. PhylDiag also accounts for gene orientations, blocks of tandem duplicates and lineage specific de novo gene births. Starting from two genomes and the corresponding gene trees, PhylDiag returns synteny blocks with gaps less than or equal to the maximum gap parameter gapmax. This parameter is theoretically estimated, and together with a utility to graphically display results, contributes to making PhylDiag a user friendly method. In addition, putative synteny blocks are subject to a statistical validation to verify that they are unlikely to be due to a random combination of genes. Conclusions We benchmark several known metrics to measure 2D-distances in a matrix of homologies and we compare PhylDiag to i-ADHoRe 3.0 on real and simulated data. We show that PhylDiag correctly identifies small synteny blocks even with insertions, deletions, incorrect annotations or micro-inversions. Finally, PhylDiag allowed us to identify the most relevant distance metric for 2D-distance calculation between homologies. Electronic supplementary material The online version of this article (doi:10.1186/1471-2105-15-268) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
| | | | - Hugues Roest Crollius
- Ecole Normale Supérieure, Institut de Biologie de l'ENS, IBENS, 46 rue d'Ulm, 75005 Paris, France.
| |
Collapse
|
8
|
Jahn K. Efficient computation of approximate gene clusters based on reference occurrences. J Comput Biol 2012; 18:1255-74. [PMID: 21899430 DOI: 10.1089/cmb.2011.0132] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Whole genome comparison based on the analysis of gene cluster conservation has become a popular approach in comparative genomics. While gene order and gene content as a whole randomize over time, it is observed that certain groups of genes which are often functionally related remain co-located across species. However, the conservation is usually not perfect which turns the identification of these structures, often referred to as approximate gene clusters, into a challenging task. In this article, we present an efficient set distance based approach that computes approximate gene clusters by means of reference occurrences. We show that it yields highly comparable results to the corresponding non-reference based approach, while its polynomial runtime allows for approximate gene cluster detection in parameter ranges that used to be feasible only with simpler, e.g., max-gap based, gene cluster models. To illustrate further the performance and predictive power of our algorithm, we compare it to a state-of-the art approach for max-gap gene cluster computation.
Collapse
Affiliation(s)
- Katharina Jahn
- AG Genominformatik, Technische Fakultät, Universität Bielefeld, Bielefeld, Germany.
| |
Collapse
|
9
|
Abstract
The purpose of this chapter is to provide a comprehensive review of the field of genome rearrangement, i.e., comparative genomics, based on the representation of genomes as ordered sequences of signed genes. We specifically focus on the "hard part" of genome rearrangement, how to handle duplicated genes. The main questions are: how have present-day genomes evolved from a common ancestor? What are the most realistic evolutionary scenarios explaining the observed gene orders? What was the content and structure of ancestral genomes? We aim to provide a concise but complete overview of the field, starting with the practical problem of finding an appropriate representation of a genome as a sequence of ordered genes or blocks, namely the problems of orthology, paralogy, and synteny block identification. We then consider three levels of gene organization: the gene family level (evolution by duplication, loss, and speciation), the cluster level (evolution by tandem duplications), and the genome level (all types of rearrangement events, including whole genome duplication).
Collapse
Affiliation(s)
- Nadia El-Mabrouk
- Département d'Informatique et de Recherche Opérationnelle, Université de Montréal, Montréal, QC, Canada
| | | |
Collapse
|
10
|
Grusea S. On the distribution of the number of cycles in the breakpoint graph of a random signed permutation. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:1411-1416. [PMID: 21116045 DOI: 10.1109/tcbb.2010.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
We use the finite Markov chain embedding technique to obtain the distribution of the number of cycles in the breakpoint graph of a random uniform signed permutation. This further gives a very good approximation of the distribution of the reversal distance between two random genomes.
Collapse
Affiliation(s)
- Simona Grusea
- LATP - UMR CNRS 6632, Equipe EBM, Université de Provence, France.
| |
Collapse
|
11
|
Grusea S, Pardoux E, Chabrol O, Pontarotti P. Compound Poisson Approximation and Testing for Gene Clusters with Multigene Families. J Comput Biol 2011; 18:579-94. [DOI: 10.1089/cmb.2010.0043] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022] Open
Affiliation(s)
- S. Grusea
- LATP–UMR CNRS 6632, Équipe Évolution Biologique et Modélisation, Université de Provence, Marseille, France
- Institut de Mathématiques de Toulouse, INSA de Toulouse, Université de Toulouse, Toulouse, France
| | - E. Pardoux
- LATP–UMR CNRS 6632, Équipe Évolution Biologique et Modélisation, Université de Provence, Marseille, France
| | - O. Chabrol
- LATP–UMR CNRS 6632, Équipe Évolution Biologique et Modélisation, Université de Provence, Marseille, France
| | - P. Pontarotti
- LATP–UMR CNRS 6632, Équipe Évolution Biologique et Modélisation, Université de Provence, Marseille, France
| |
Collapse
|
12
|
Affiliation(s)
- Guillaume Blin
- Université Paris-Est, LIGM–UMR CNRS 8049, Marne-la-vallíe cedex 2, France
| | - David Faye
- Université Paris-Est, LIGM–UMR CNRS 8049, Marne-la-vallíe cedex 2, France
- Université Gaston-Berger, LANI, Saint-Louis, Senegal
| | - Jens Stoye
- Technische Fakultät, Universität Bielefeld, Bielefeld, Germany
| |
Collapse
|
13
|
Yang Q, Yi G, Zhang F, Thon MR, Sze SH. Identifying gene clusters within localized regions in multiple genomes. J Comput Biol 2010; 17:657-68. [PMID: 20500020 DOI: 10.1089/cmb.2009.0116] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
An important strategy to study genome evolution is to investigate the clustering of orthologous genes among multiple genomes, in which the most popular approaches require that the distance between adjacent genes in a cluster be small. We investigate a different formulation based on constraining the overall size of a cluster and develop statistical significance estimates that allow direct comparison of clusters of different sizes. We first consider a restricted version which requires that orthologous genes are strictly ordered within each cluster and show that it can be solved in polynomial time. We then develop practical exact algorithms for the unrestricted problem that allows paralogous genes within a genome and clusters that may not appear in every genome while considering a general model in which a gene is allowed to appear in more than one orthologous group. We show that our algorithm can identify biologically relevant gene clusters on four bacterial genomes Bacillus subtilis, Streptococcus pyogenes, Streptococcus pneumoniae, and Clostridium acetobutylicum. We also show that our algorithm can identify significantly more functionally enriched gene clusters on four yeast genomes Saccharomyces cerevisiae, Saccharomyces paradoxus, Saccharomyces mikatae, and Saccharomyces bayanus than previous algorithms. A software program (GCFinder) and a list of gene clusters found on the bacterial and the yeast genomes are available at http://faculty.cse.tamu.edu/shsze/gcfinder .
Collapse
Affiliation(s)
- Qingwu Yang
- Department of Computer Science and Engineering, Texas A&M University, College Station, Texas 77843-3112, USA
| | | | | | | | | |
Collapse
|
14
|
|
15
|
Ling X, He X, Xin D. Detecting gene clusters under evolutionary constraint in a large number of genomes. ACTA ACUST UNITED AC 2009; 25:571-7. [PMID: 19158161 DOI: 10.1093/bioinformatics/btp027] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION Spatial clusters of genes conserved across multiple genomes provide important clues to gene functions and evolution of genome organization. Existing methods of identifying these clusters often made restrictive assumptions, such as exact conservation of gene order, and relied on heuristic algorithms. RESULTS We developed a very efficient algorithm based on a 'gene teams' model that allows genes in the clusters to appear in different orders. This allows us to detect conserved gene clusters under flexible evolutionary constraints in a large number of genomes. Our statistical evaluation incorporates the evolutionary relationship among genomes, a key aspect that has been missing in most previous studies. We conducted a large-scale analysis of 133 bacterial genomes. Our results confirm that our approach is an effective way of uncovering functionally related genes. The comparison with known operons and the analysis of the structural properties of our predicted clusters suggest that operons are an important source of constraint, but there are also other forces that determine evolution of gene order and arrangement. Using our method, we predicted functions of many poorly characterized genes in bacterial. The combined algorithmic and statistical methods we present here provide a rigorous framework for systematically studying evolutionary constraints of genomic contexts. AVAILABILITY The software, data and the full results of this article are available online at http://www.ews.uiuc.edu/~xuling/mcmusec.
Collapse
Affiliation(s)
- Xu Ling
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA.
| | | | | |
Collapse
|
16
|
Schmidt T, Stoye J. Gecko and GhostFam: rigorous and efficient gene cluster detection in prokaryotic genomes. Methods Mol Biol 2008; 396:165-82. [PMID: 18025693 DOI: 10.1007/978-1-59745-515-2_12] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/17/2023]
Abstract
A popular approach in comparative genomics is to locate groups or clusters of orthologous genes in multiple genomes and to postulate functional association between the genes contained in such clusters. For a rigorous and efficient detection in multiple genomes, it is essential to have an appropriate model of gene clusters accompanied by efficient algorithms locating them. The Gecko method described herein was designed to serve as a basic tool for the detection and visualization of gene cluster data in prokaryotic genomes founded on a formal string-based gene cluster model.
Collapse
|
17
|
Abstract
We present here a method to identify microsyntenies across several genomes. This method adopts the innovative approach of deconstructing proteins into their domains. This allows the detection of strings of domains that are conserved in their content, but not necessarily in their order, that we refer to as domain teams or syntenies of domains. The prominent feature of the method is that it relaxes the rigidity of the orthology criterion and avoids many of the pitfalls of gene families identification methods, often hampered by multidomain proteins or low levels of sequence similarity. This approach, that allows both inter- and intrachromosomal comparisons, proves to be more sensitive than the classical methods based on pairwise sequence comparisons, particularly in the simultaneous treatment of many species. The automated and fast detection of domain teams is implemented in the DomainTeam software. In this chapter, we describe the procedure to run DomainTeam. After formatting the input and setting up the parameters, running the algorithm produces an output file comprising all the syntenies of domains shared by two or more (sometimes all) of the compared genomes.
Collapse
|
18
|
Durand D, Hoberman R. Diagnosing duplications – can it be done? Trends Genet 2006; 22:156-64. [PMID: 16442663 DOI: 10.1016/j.tig.2006.01.002] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2005] [Revised: 11/30/2005] [Accepted: 01/11/2006] [Indexed: 01/10/2023]
Abstract
New genes arise through duplication and modification of DNA sequences on a range of scales: single gene duplication, duplication of large chromosomal fragments and whole-genome duplication. Each duplication mechanism has specific characteristics that influence the fate of the resulting duplicates, such as the size of the duplicated fragment, the potential for dosage imbalance, the preservation or disruption of regulatory control and genomic context. The ability to diagnose or identify the mechanism that produced a pair of paralogs has the potential to increase our ability to reconstruct evolutionary history, to understand the processes that govern genome evolution and to make functional predictions based on paralogy. The recent availability of large amounts of whole-genome sequence, often from several closely related species, has stimulated a wealth of new computational methods to diagnose gene duplications.
Collapse
Affiliation(s)
- Dannie Durand
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA.
| | | |
Collapse
|
19
|
|
20
|
Rahmann S, Klau GW. Integer Linear Programs for Discovering Approximate Gene Clusters. ACTA ACUST UNITED AC 2006. [DOI: 10.1007/11851561_28] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/20/2023]
|
21
|
Hoberman R, Sankoff D, Durand D. The statistical analysis of spatially clustered genes under the maximum gap criterion. J Comput Biol 2005; 12:1083-102. [PMID: 16241899 DOI: 10.1089/cmb.2005.12.1083] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Statistical validation of gene clusters is imperative for many important applications in comparative genomics which depend on the identification of genomic regions that are historically and/or functionally related. We develop the first rigorous statistical treatment of max-gap clusters, a cluster definition frequently used in empirical studies. We present exact expressions for the probability of observing an individual cluster of a set of marked genes in one genome, as well as upper and lower bounds on the probability of observing a cluster of h homologs in a pairwise whole-genome comparison. We demonstrate the utility of our approach by applying it to a whole-genome comparison of E. coli and B. subtilis. Code for statistical tests is available at.
Collapse
Affiliation(s)
- Rose Hoberman
- Computer Science Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA.
| | | | | |
Collapse
|
22
|
Pasek S, Bergeron A, Risler JL, Louis A, Ollivier E, Raffinot M. Identification of genomic features using microsyntenies of domains: domain teams. Genome Res 2005; 15:867-74. [PMID: 15899966 PMCID: PMC1142477 DOI: 10.1101/gr.3638405] [Citation(s) in RCA: 34] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
The detection, across several genomes, of local conservation of gene content and proximity considerably helps the prediction of features of interest, such as gene fusions or physical and functional interactions. Here, we want to process realistic models of chromosomes, in which genes (or genomic segments of several genes) can be duplicated within a chromosome, or be absent from some other chromosome(s). Our approach adopts the technique of temporarily forgetting genes and working directly with protein "domains" such as those found in Pfam. This allows the detection of strings of domains that are conserved in their content, but not necessarily in their order, which we refer to as domain teams. The prominent feature of the method is that it relaxes the rigidity of the orthology criterion and avoids many of the pitfalls of gene-families identification methods, often hampered by multidomain proteins or low levels of sequence similarity. This approach, that allows both inter- and intrachromosomal comparisons, proves to be more sensitive than the classical methods based on pairwise sequence comparisons, particularly in the simultaneous treatment of many species. The automated and fast detection of domain teams, together with its increased sensitivity at identifying segments of identical (protein-coding) gene contents as well as gene fusions, should prove a useful complement to other existing methods.
Collapse
Affiliation(s)
- Sophie Pasek
- Laboratoire Génome et Informatique, CNRS/UEVE, 91034 Evry cedex, France.
| | | | | | | | | | | |
Collapse
|
23
|
Hoberman R, Durand D. The Incompatible Desiderata of Gene Cluster Properties. COMPARATIVE GENOMICS 2005. [DOI: 10.1007/11554714_7] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
|
24
|
|
25
|
Abstract
Comparisons of the genome sequences of related species suggests varying patterns of chromosomal rearrangements in different evolutionary lineages. In this review, I focus on the quantitative characterization of rearrangement processes and discuss specific inventories that have been compiled to date. Of particular interest are the statistical distribution of the lengths of inverted or locally transposed chromosome fragments (notably very short ones), inhomogeneities in susceptibility to evolutionary breakpoints in chromosomal regions, the relative importance of genome doubling in the history of multicellular eukaryotes, and of lateral transfer versus gene gain and loss in prokaryotes. These developments provide challenges to computational biologists to refine, revise and scale up mathematical models and algorithms for analyzing genome rearrangements.
Collapse
Affiliation(s)
- David Sankoff
- Department of Mathematics and Statistics, University of Ottawa, 585 King Edward Avenue, Ottawa, K1N 6N5, Canada.
| |
Collapse
|