1
|
Braga MDV, Doerr D, Rubert DP, Stoye J. Family-Free Genome Comparison. Methods Mol Biol 2024; 2802:57-72. [PMID: 38819556 DOI: 10.1007/978-1-0716-3838-5_3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/01/2024]
Abstract
The comparison of large-scale genome structures across distinct species offers valuable insights into the species' phylogeny, genome organization, and gene associations. In this chapter, we review the family-free genome comparison tool FFGC that, relying on built-in interfaces with a sequence comparison tool (either BLAST+ or DIAMOND) and with an ILP solver (either CPLEX or Gurobi), provides several methods for analyses that do not require prior classification of genes across the studied genomes. Taking annotated genome sequences as input, FFGC is a complete workflow for genome comparison allowing not only the computation of measures of similarity and dissimilarity but also the inference of gene families, simultaneously based on sequence similarities and large-scale genomic features.
Collapse
Affiliation(s)
- Marilia D V Braga
- Faculty of Technology and Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | - Daniel Doerr
- Department for Endocrinology and Diabetology, Medical Faculty and University Hospital Düsseldorf, German Diabetes Center (DDZ), Leibniz Institute for Diabetes Research, and Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
| | - Diego P Rubert
- Faculdade de Computacão, Universidade Federal de Mato Grosso do Sul, Campo Grande, MS, Brazil
| | - Jens Stoye
- Faculty of Technology and Center for Biotechnology, Bielefeld University, Bielefeld, Germany.
| |
Collapse
|
2
|
Rubert DP, Doerr D, Braga MDV. The potential of family-free rearrangements towards gene orthology inference. J Bioinform Comput Biol 2021; 19:2140014. [PMID: 34775922 DOI: 10.1142/s021972002140014x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Recently, we proposed an efficient ILP formulation [Rubert DP, Martinez FV, Braga MDV, Natural family-free genomic distance, Algorithms Mol Biol 16:4, 2021] for exactly computing the rearrangement distance of two genomes in a family-free setting. In such a setting, neither prior classification of genes into families, nor further restrictions on the genomes are imposed. Given two genomes, the mentioned ILP computes an optimal matching of the genes taking into account simultaneously local mutations, given by gene similarities, and large-scale genome rearrangements. Here, we explore the potential of using this ILP for inferring groups of orthologs across several species. More precisely, given a set of genomes, our method first computes all pairwise optimal gene matchings, which are then integrated into gene families in the second step. Our approach is implemented into a pipeline incorporating the pre-computation of gene similarities. It can be downloaded from gitlab.ub.uni-bielefeld.de/gi/FFGC. We obtained promising results with experiments on both simulated and real data.
Collapse
Affiliation(s)
- Diego P Rubert
- Faculdade de Computação, Universidade Federal de Mato Grosso do Sul, Campo Grande, Brazil
| | - Daniel Doerr
- Faculty of Medicine, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Marília D V Braga
- Faculty of Technology and CeBiTec, Bielefeld University, Bielefeld, Germany
| |
Collapse
|
3
|
Rubert DP, Martinez FV, Braga MDV. Natural family-free genomic distance. Algorithms Mol Biol 2021; 16:4. [PMID: 33971908 PMCID: PMC8111734 DOI: 10.1186/s13015-021-00183-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2021] [Accepted: 04/13/2021] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND A classical problem in comparative genomics is to compute the rearrangement distance, that is the minimum number of large-scale rearrangements required to transform a given genome into another given genome. The traditional approaches in this area are family-based, i.e., require the classification of DNA fragments of both genomes into families. Furthermore, the most elementary family-based models, which are able to compute distances in polynomial time, restrict the families to occur at most once in each genome. In contrast, the distance computation in models that allow multifamilies (i.e., families with multiple occurrences) is NP-hard. Very recently, Bohnenkämper et al. (J Comput Biol 28:410-431, 2021) proposed an ILP formulation for computing the genomic distance of genomes with multifamilies, allowing structural rearrangements, represented by the generic double cut and join (DCJ) operation, and content-modifying insertions and deletions of DNA segments. This ILP is very efficient, but must maximize a matching of the genes in each multifamily, in order to prevent the free lunch artifact that would otherwise let empty or almost empty matchings give smaller distances. RESULTS In this paper, we adopt the alternative family-free setting that, instead of family classification, simply uses the pairwise similarities between DNA fragments of both genomes to compute their rearrangement distance. We adapted the ILP mentioned above and developed a model in which pairwise similarities are used to assign weights to both matched and unmatched genes, so that an optimal solution does not necessarily maximize the matching. Our model then results in a natural family-free genomic distance, that takes into consideration all given genes, without prior classification into families, and has a search space composed of matchings of any size. In spite of its bigger search space, our ILP seems to be boosted by a reduction of the number of co-optimal solutions due to the weights. Indeed, it converged faster than the original one by Bohnenkämper et al. for instances with the same number of multiple connections. We can handle not only bacterial genomes, but also fungi and insects, or sets of chromosomes of mammals and plants. In a comparison study of six fruit fly genomes, we obtained accurate results.
Collapse
Affiliation(s)
- Diego P. Rubert
- Faculdade de Computação, Universidade Federal de Mato Grosso do Sul, Campo Grande, Brazil
| | - Fábio V. Martinez
- Faculdade de Computação, Universidade Federal de Mato Grosso do Sul, Campo Grande, Brazil
| | - Marília D. V. Braga
- Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| |
Collapse
|
4
|
Abstract
This chapter covers the theory and practice of ortholog gene set computation. In the theoretical part we give detailed and formal descriptions of the relevant concepts. We also cover the topic of graph-based clustering as a tool to compute ortholog gene sets. In the second part we provide an overview of practical considerations intended for researchers who need to determine orthologous genes from a collection of annotated genomes, briefly describing some of the most popular programs and resources currently available for this task.
Collapse
|
5
|
Abstract
BACKGROUND The genomic similarity is a large-scale measure for comparing two given genomes. In this work we study the (NP-hard) problem of computing the genomic similarity under the DCJ model in a setting that does not assume that the genes of the compared genomes are grouped into gene families. This problem is called family-free DCJ similarity. RESULTS We propose an exact ILP algorithm to solve the family-free DCJ similarity problem, then we show its APX-hardness and present four combinatorial heuristics with computational experiments comparing their results to the ILP. CONCLUSIONS We show that the family-free DCJ similarity can be computed in reasonable time, although for larger genomes it is necessary to resort to heuristics. This provides a basis for further studies on the applicability and model refinement of family-free whole genome similarity measures.
Collapse
Affiliation(s)
- Diego P. Rubert
- Faculdade de Computação, Universidade Federal de Mato Grosso do Sul, Campo Grande, MS Brazil
| | - Edna A. Hoshino
- Faculdade de Computação, Universidade Federal de Mato Grosso do Sul, Campo Grande, MS Brazil
| | - Marília D. V. Braga
- Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| | - Jens Stoye
- Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| | - Fábio V. Martinez
- Faculdade de Computação, Universidade Federal de Mato Grosso do Sul, Campo Grande, MS Brazil
| |
Collapse
|
6
|
Abstract
The comparison of genome structures across distinct species offers valuable insights into the species' phylogeny, genome organization, and gene associations. In this chapter, we review the family-free genome comparison tool FFGC which provides several methods for gene order analyses that do not require prior knowledge of evolutionary relationships between the genes across the studied genomes. Moreover, the tool features a complete workflow for genome comparison, requiring nothing but annotated genome sequences as input.
Collapse
Affiliation(s)
- Daniel Doerr
- School of Computer and Communication Sciences, EPFL, Lausanne, Switzerland
| | - Pedro Feijão
- Faculty of Technology, Bielefeld University, Bielefeld, Germany
- Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | - Jens Stoye
- Faculty of Technology, Bielefeld University, Bielefeld, Germany.
| |
Collapse
|
7
|
Doerr D, Kowada LAB, Araujo E, Deshpande S, Dantas S, Moret BME, Stoye J. New Genome Similarity Measures based on Conserved Gene Adjacencies. J Comput Biol 2017; 24:616-634. [PMID: 28590847 DOI: 10.1089/cmb.2017.0065] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Many important questions in molecular biology, evolution, and biomedicine can be addressed by comparative genomic approaches. One of the basic tasks when comparing genomes is the definition of measures of similarity (or dissimilarity) between two genomes, for example, to elucidate the phylogenetic relationships between species. The power of different genome comparison methods varies with the underlying formal model of a genome. The simplest models impose the strong restriction that each genome under study must contain the same genes, each in exactly one copy. More realistic models allow several copies of a gene in a genome. One speaks of gene families, and comparative genomic methods that allow this kind of input are called gene family-based. The most powerful-but also most complex-models avoid this preprocessing of the input data and instead integrate the family assignment within the comparative analysis. Such methods are called gene family-free. In this article, we study an intermediate approach between family-based and family-free genomic similarity measures. Introducing this simpler model, called gene connections, we focus on the combinatorial aspects of gene family-free genome comparison. While in most cases, the computational costs to the general family-free case are the same, we also find an instance where the gene connections model has lower complexity. Within the gene connections model, we define three variants of genomic similarity measures that have different expression powers. We give polynomial-time algorithms for two of them, while we show NP-hardness for the third, most powerful one. We also generalize the measures and algorithms to make them more robust against recent local disruptions in gene order. Our theoretical findings are supported by experimental results, proving the applicability and performance of our newly defined similarity measures.
Collapse
Affiliation(s)
- Daniel Doerr
- 1 École Polytechnique Fédérale de Lausanne , Lausanne, Switzerland
| | | | - Eloi Araujo
- 3 Universidade Federal de Mato Grosso do Sul , Campo Grande, Brazil .,4 Faculty of Technology and Center for Biotechnology, Bielefeld University , Bielefeld, Germany
| | - Shachi Deshpande
- 1 École Polytechnique Fédérale de Lausanne , Lausanne, Switzerland .,5 Department of Computer Science and Engineering, IIT Bombay , Mumbai, India
| | | | | | - Jens Stoye
- 2 Universidade Federal Fluminense , Niterói, Brazil .,4 Faculty of Technology and Center for Biotechnology, Bielefeld University , Bielefeld, Germany
| |
Collapse
|
8
|
Doerr D, Balaban M, Feijão P, Chauve C. The gene family-free median of three. Algorithms Mol Biol 2017; 12:14. [PMID: 28559921 PMCID: PMC5446766 DOI: 10.1186/s13015-017-0106-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2017] [Accepted: 05/18/2017] [Indexed: 11/20/2022] Open
Abstract
Background The gene family-free framework for comparative genomics aims at providing methods for gene order analysis that do not require prior gene family assignment, but work directly on a sequence similarity graph. We study two problems related to the breakpoint median of three genomes, which asks for the construction of a fourth genome that minimizes the sum of breakpoint distances to the input genomes. Methods We present a model for constructing a median of three genomes in this family-free setting, based on maximizing an objective function that generalizes the classical breakpoint distance by integrating sequence similarity in the score of a gene adjacency. We study its computational complexity and we describe an integer linear program (ILP) for its exact solution. We further discuss a related problem called family-free adjacencies for k genomes for the special case of \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$k \le 3$$\end{document}k≤3 and present an ILP for its solution. However, for this problem, the computation of exact solutions remains intractable for sufficiently large instances. We then proceed to describe a heuristic method, FFAdj-AM, which performs well in practice. Results The developed methods compute accurate positional orthologs for genomes comparable in size of bacterial genomes on simulated data and genomic data acquired from the OMA orthology database. In particular, FFAdj-AM performs equally or better when compared to the well-established gene family prediction tool MultiMSOAR. Conclusions We study the computational complexity of a new family-free model and present algorithms for its solution. With FFAdj-AM, we propose an appealing alternative to established tools for identifying higher confidence positional orthologs. Electronic supplementary material The online version of this article (doi:10.1186/s13015-017-0106-z) contains supplementary material, which is available to authorized users.
Collapse
|
9
|
Genome-wide identification and characterization of Fox genes in the silkworm, Bombyx mori. Funct Integr Genomics 2015; 15:511-22. [DOI: 10.1007/s10142-015-0440-5] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2014] [Revised: 03/25/2015] [Accepted: 04/06/2015] [Indexed: 12/28/2022]
|
10
|
Martinez FV, Feijão P, Braga MDV, Stoye J. On the family-free DCJ distance and similarity. Algorithms Mol Biol 2015; 10:13. [PMID: 25859276 PMCID: PMC4391664 DOI: 10.1186/s13015-015-0041-9] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2015] [Accepted: 03/13/2015] [Indexed: 11/10/2022] Open
Abstract
Structural variation in genomes can be revealed by many (dis)similarity measures. Rearrangement operations, such as the so called double-cut-and-join (DCJ), are large-scale mutations that can create complex changes and produce such variations in genomes. A basic task in comparative genomics is to find the rearrangement distance between two given genomes, i.e., the minimum number of rearragement operations that transform one given genome into another one. In a family-based setting, genes are grouped into gene families and efficient algorithms have already been presented to compute the DCJ distance between two given genomes. In this work we propose the problem of computing the DCJ distance of two given genomes without prior gene family assignment, directly using the pairwise similarities between genes. We prove that this new family-free DCJ distance problem is APX-hard and provide an integer linear program to its solution. We also study a family-free DCJ similarity and prove that its computation is NP-hard.
Collapse
|
11
|
Lechner M, Hernandez-Rosales M, Doerr D, Wieseke N, Thévenin A, Stoye J, Hartmann RK, Prohaska SJ, Stadler PF. Orthology detection combining clustering and synteny for very large datasets. PLoS One 2014; 9:e105015. [PMID: 25137074 PMCID: PMC4138177 DOI: 10.1371/journal.pone.0105015] [Citation(s) in RCA: 73] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2014] [Accepted: 07/14/2014] [Indexed: 11/18/2022] Open
Abstract
The elucidation of orthology relationships is an important step both in gene function prediction as well as towards understanding patterns of sequence evolution. Orthology assignments are usually derived directly from sequence similarities for large data because more exact approaches exhibit too high computational costs. Here we present PoFF, an extension for the standalone tool Proteinortho, which enhances orthology detection by combining clustering, sequence similarity, and synteny. In the course of this work, FFAdj-MCS, a heuristic that assesses pairwise gene order using adjacencies (a similarity measure related to the breakpoint distance) was adapted to support multiple linear chromosomes and extended to detect duplicated regions. PoFF largely reduces the number of false positives and enables more fine-grained predictions than purely similarity-based approaches. The extension maintains the low memory requirements and the efficient concurrency options of its basis Proteinortho, making the software applicable to very large datasets.
Collapse
Affiliation(s)
- Marcus Lechner
- Institut für Pharmazeutische Chemie, Philipps-Universität Marburg, Marburg, Germany
- * E-mail:
| | - Maribel Hernandez-Rosales
- Bioinformatics Group, Department of Computer Science, Universität Leipzig, Leipzig, Germany
- Interdisciplinary Center for Bioinformatics, Universität Leipzig, Leipzig, Germany
- Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany
- Departamento de Ciência da Computação, Instituto de Ciências Exatas, Universidade de Brasília, Brasília, Brasil
| | - Daniel Doerr
- Genome Informatics, Faculty of Technology, Bielefeld University, Bielefeld, Germany
- Institute for Bioinformatics, Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | - Nicolas Wieseke
- Faculty of Mathematics and Computer Science University of Leipzig, Leipzig, Germany
| | - Annelyse Thévenin
- Genome Informatics, Faculty of Technology, Bielefeld University, Bielefeld, Germany
- Institute for Bioinformatics, Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | - Jens Stoye
- Genome Informatics, Faculty of Technology, Bielefeld University, Bielefeld, Germany
- Institute for Bioinformatics, Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | - Roland K. Hartmann
- Institut für Pharmazeutische Chemie, Philipps-Universität Marburg, Marburg, Germany
| | - Sonja J. Prohaska
- Computational EvoDevo Group, Department of Computer Science, Universität Leipzig, Leipzig, Germany
| | - Peter F. Stadler
- Bioinformatics Group, Department of Computer Science, Universität Leipzig, Leipzig, Germany
- Interdisciplinary Center for Bioinformatics, Universität Leipzig, Leipzig, Germany
- Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany
- Institute for Theoretical Chemistry, University of Vienna, Vienna, Austria
- Center for non-coding RNA in Technology and Health, University of Copenhagen, Frederiksberg, Denmark
- The Santa Fe Institute, Santa Fe, New Mexico, United States of America
- RNomics Group, Fraunhofer Institut for Cell Therapy and Immunology, Leipzig, Germany
| |
Collapse
|
12
|
Chauve C, El-Mabrouk N, Guéguen L, Semeria M, Tannier E. Duplication, Rearrangement and Reconciliation: A Follow-Up 13 Years Later. MODELS AND ALGORITHMS FOR GENOME EVOLUTION 2013. [DOI: 10.1007/978-1-4471-5298-9_4] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
|