1
|
Alexandrino AO, Oliveira AR, Jean G, Fertin G, Dias U, Dias Z. Reversal and Transposition Distance on Unbalanced Genomes Using Intergenic Information. J Comput Biol 2023; 30:861-876. [PMID: 37222724 DOI: 10.1089/cmb.2023.0087] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/25/2023] Open
Abstract
The most common way to calculate the rearrangement distance between two genomes is to use the size of a minimum length sequence of rearrangements that transforms one of the two given genomes into the other, where the genomes are represented as permutations using only their gene order, based on the assumption that genomes have the same gene content. With the advance of research in genome rearrangements, new works extended the classical models by either considering genomes with different gene content (unbalanced genomes) or including more genomic characteristics to the mathematical representation of the genomes, such as the distribution of intergenic regions sizes. In this study, we study the Reversal, Transposition, and Indel (Insertion and Deletion) Distance using intergenic information, which allows comparing unbalanced genomes, because indels are included in the rearrangement model (i.e., the set of possible rearrangements allowed when we compute the distance). For the particular case of transpositions and indels on unbalanced genomes, we present a 4-approximation algorithm, improving a previous 4.5 approximation. This algorithm is extended so as to deal with gene orientation and to maintain the 4-approximation factor for the Reversal, Transposition, and Indel Distance on unbalanced genomes. Furthermore, we evaluate the proposed algorithms using experiments on simulated data.
Collapse
Affiliation(s)
| | | | - Géraldine Jean
- Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004, Nantes, France
| | - Guillaume Fertin
- Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004, Nantes, France
| | - Ulisses Dias
- School of Technology, University of Campinas, Limeira, Brazil
| | - Zanoni Dias
- Institute of Computing, University of Campinas, Campinas, Brazil
| |
Collapse
|
2
|
Alexandrino AO, Brito KL, Oliveira AR, Dias U, Dias Z. Reversal and Indel Distance With Intergenic Region Information. IEEE/ACM Trans Comput Biol Bioinform 2023; 20:1628-1640. [PMID: 36260590 DOI: 10.1109/tcbb.2022.3215615] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Recent works on genome rearrangements have shown that incorporating intergenic region information along with gene order in models provides better estimations for the rearrangement distance than using gene order alone. The reversal distance is one of the main problems in genome rearrangements. It has a polynomial time algorithm when only gene order is used to model genomes, assuming that repeated genes do not exist and that gene orientation is known, even when the genomes have distinct gene sets. The reversal distance is NP-hard and has a 2-approximation algorithm when incorporating intergenic regions. However, the problem has only been studied assuming genomes with the same set of genes. In this work, we consider the variation that incorporates intergenic regions and that allows genomes to have distinct sets of genes, a scenario that leads us to include indels operations (insertions and deletions). We present a 2.5-approximation algorithm using the labeled intergenic breakpoint graph, which is based on the well-known breakpoint graph structure. We also present an experimental analysis of the proposed algorithm using simulated data, which showed that the practical approximation factor is considerably less than 2.5. Furthermore, we used the algorithm in real genomes to construct a phylogenetic tree.
Collapse
|
3
|
Brito KL, Alexandrino AO, Oliveira AR, Dias U, Dias Z. Genome Rearrangement Distance With a Flexible Intergenic Regions Aspect. IEEE/ACM Trans Comput Biol Bioinform 2023; 20:1641-1653. [PMID: 35385387 DOI: 10.1109/tcbb.2022.3165443] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Most mathematical models for genome rearrangement problems have considered only gene order. In this way, the rearrangement distance considering some set of events, such as reversal and transposition events, is commonly defined as the minimum number of rearrangement events that transform the gene order from a genome G1 into the gene order from a genome G2. Recent works initiate incorporating more information such as the sizes of the intergenic regions (i.e., number of nucleotides between pairs of consecutive genes), which yields good results for estimated distances on real data. In these models, besides transforming the gene order, the sequence of rearrangement events must transform the list of intergenic regions sizes from G1 into the list of intergenic regions sizes from G2 (target list). We study a new variation where the target list is flexible, in the sense that each target intergenic region size is in a range of acceptable values. This allows us to model scenarios where the main objective is still to transform the order of genes from the source genome into the target genome, allowing flexibility in the sizes of the intergenic regions, since the nucleotides in these regions tend to undergo more changes when compared to genes. We investigate the rearrangement distance considering three sets of events, two with the exclusive use of reversals or transpositions, and the other allowing both rearrangement events. We present approximation algorithms for the problems and an NP-hardness proof. Our results rely on the Flexible Weighted Cycle Graph, adapted from the breakpoint graph to deal with flexible intergenic regions sizes.
Collapse
|
4
|
Brito KL, Oliveira AR, Alexandrino AO, Dias U, Dias Z. Rearrangement Distance with Reversals, Indels and Moves in Intergenic Regions on Signed and Unsigned Permutations. J Bioinform Comput Biol 2023; 21:2350009. [PMID: 37104034 DOI: 10.1142/s0219720023500099] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/19/2023]
Abstract
Genome rearrangement events are widely used to estimate a minimum-size sequence of mutations capable of transforming a genome into another. The length of this sequence is called distance, and determining it is the main goal in genome rearrangement distance problems. Problems in the genome rearrangement field differ regarding the set of rearrangement events allowed and the genome representation. In this work, we consider the scenario where the genomes share the same set of genes, gene orientation is known or unknown, and intergenic regions (structures between a pair of genes and at the extremities of the genome) are taken into account. We use two models, the first model allows only conservative events (reversals and moves), and the second model includes non-conservative events (insertions and deletions) in the intergenic regions. We show that both models result in NP-hard problems no matter if gene orientation is known or unknown. When the information regarding the orientation of genes is available, we present for both models an approximation algorithm with a factor of 2. For the scenario where this information is unavailable, we propose a 4-approximation algorithm for both models.
Collapse
Affiliation(s)
- Klairton Lima Brito
- Institute of Computing, University of Campinas, 1251 Albert Einstein Avenue, Campinas, São Paulo, Brazil
| | - Andre Rodrigues Oliveira
- Institute of Computing, University of Campinas, 1251 Albert Einstein Avenue, Campinas, São Paulo, Brazil
- Computing and Informatics Department, Mackenzie Presbyterian University, 905 Mackenzie Avenue, Barueri, São Paulo, Brazil
| | | | - Ulisses Dias
- School of Technology, University of Campinas, 1888 Paschoal Marmo Street, Limeira, São Paulo, Brazil
| | - Zanoni Dias
- Institute of Computing, University of Campinas, 1251 Albert Einstein Avenue, Campinas, São Paulo, Brazil
| |
Collapse
|
5
|
Alexandrino AO, Oliveira AR, Dias U, Dias Z. Incorporating intergenic regions into reversal and transposition distances with indels. J Bioinform Comput Biol 2021; 19:2140011. [PMID: 34775923 DOI: 10.1142/s0219720021400114] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Problems in the genome rearrangement field are often formulated in terms of pairwise genome comparison: given two genomes [Formula: see text] and [Formula: see text], find the minimum number of genome rearrangements that may have occurred during the evolutionary process. This broad definition lacks at least two important considerations: the first being which features are extracted from genomes to create a useful mathematical model, and the second being which types of genome rearrangement events should be represented. Regarding the first consideration, seminal works in the genome rearrangement field solely used gene order to represent genomes as permutations of integer numbers, neglecting many important aspects like gene duplication, intergenic regions, and complex interactions between genes. Regarding the second consideration, some rearrangement events are widely studied such as reversals and transpositions. In this paper, we shed light on the first consideration and created a model that takes into account gene order and the number of nucleotides in intergenic regions. In addition, we consider events of reversals, transpositions, and indels (insertions and deletions) of genomic material. We present a 4-approximation algorithm for reversals and indels, a [Formula: see text]-approximation algorithm for transpositions and indels, and a 6-approximation for reversals, transpositions, and indels.
Collapse
Affiliation(s)
| | - Andre Rodrigues Oliveira
- Institute of Computing, University of Campinas, 1251 Albert Einstein Ave., 13083-852 Campinas, São Paulo, Brazil
| | - Ulisses Dias
- School of Technology, University of Campinas, 1888 Paschoal Marmo St., 13484-332 Limeira, São Paulo, Brazil
| | - Zanoni Dias
- Institute of Computing, University of Campinas, 1251 Albert Einstein Ave., 13083-852 Campinas, São Paulo, Brazil
| |
Collapse
|
6
|
Alexandrino AO, Oliveira AR, Dias U, Dias Z. Labeled Cycle Graph for Transposition and Indel Distance. J Comput Biol 2021; 29:243-256. [PMID: 34724796 DOI: 10.1089/cmb.2021.0279] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
In the comparative genomics field, one way to infer the evolutionary distance between two organisms of related species is by finding the minimum number of large-scale mutations, called genome rearrangements, that transform one genome into the other. This number is referred to as the rearrangement distance. Since problems in this area emerged in the mid-1990s, several genome rearrangements have been proposed. Rearrangements that do not alter the genome content are called conservative, and in this group we have the following: the reversal, which inverts a segment of the genome; the transposition, which exchanges two consecutive segments; and the double cut and join, which cuts two different pairs of adjacent blocks and joins them differently. Seminal works compared genomes sharing the same set of conserved blocks, but nowadays, researchers started looking at genomes with unequal gene content, by allowing the use of nonconservative rearrangements such as insertion and deletion (jointly called indel). The transposition distance and the transposition and indel distance are both NP-hard. We investigate the transposition and indel distance and present a structure called labeled cycle graph, representing an instance of rearrangement distance problems for genomes with unequal gene content. This structure is used to devise a lower bound and a 2-approximation algorithm for the transposition and indel distance.
Collapse
Affiliation(s)
| | | | - Ulisses Dias
- School of Technology, University of Campinas, Limeira, Brazil
| | - Zanoni Dias
- Institute of Computing, University of Campinas, Campinas, Brazil
| |
Collapse
|
7
|
Siqueira G, Alexandrino AO, Oliveira AR, Dias Z. Approximation algorithm for rearrangement distances considering repeated genes and intergenic regions. Algorithms Mol Biol 2021; 16:21. [PMID: 34645469 PMCID: PMC8513232 DOI: 10.1186/s13015-021-00200-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2021] [Accepted: 08/31/2021] [Indexed: 01/02/2023] Open
Abstract
The rearrangement distance is a method to compare genomes of different species. Such distance is the number of rearrangement events necessary to transform one genome into another. Two commonly studied events are the transposition, which exchanges two consecutive blocks of the genome, and the reversal, which reverts a block of the genome. When dealing with such problems, seminal works represented genomes as sequences of genes without repetition. More realistic models started to consider gene repetition or the presence of intergenic regions, sequences of nucleotides between genes and in the extremities of the genome. This work explores the transposition and reversal events applied in a genome representation considering both gene repetition and intergenic regions. We define two problems called Minimum Common Intergenic String Partition and Reverse Minimum Common Intergenic String Partition. Using a relation with these two problems, we show a \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\Theta \left( k \right)$$\end{document}Θk-approximation for the Intergenic Transposition Distance, the Intergenic Reversal Distance, and the Intergenic Reversal and Transposition Distance problems, where k is the maximum number of copies of a gene in the genomes. Our practical experiments on simulated genomes show that the use of partitions improves the estimates for the distances.
Collapse
|
8
|
Brito KL, Alexandrino AO, Oliveira AR, Dias U, Dias Z. Reversals and transpositions distance with proportion restriction. J Bioinform Comput Biol 2021; 19:2150013. [PMID: 34162319 DOI: 10.1142/s021972002150013x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
In the field of comparative genomics, one way of comparing two genomes is through the analysis of how they distinguish themselves based on a set of mutations called rearrangement events. When considering that genomes undergo different types of rearrangements, it can be assumed that some events are more common than others. To model this assumption, one can assign different weights to different events, where common events tend to cost less than others. However, this approach, called weighted, does not guarantee that the rearrangement assumed to be the most frequent will be also the most frequently returned by proposed algorithms. To overcome this issue, we investigate a new problem where we seek the shortest sequence of rearrangement events able to transform one genome into the other, with a restriction regarding the proportion between the events returned. Here, we consider two rearrangement events: reversal, that inverts the order and the orientation of the genes inside a segment of the genome, and transposition, that moves a segment of the genome to another position. We show the complexity of this problem for any desired proportion considering scenarios where the orientation of the genes is known or unknown. We also develop an approximation algorithm with a constant approximation factor for each scenario and, in particular, we describe an improved (asymptotic) approximation algorithm for the case where the gene orientation is known. At last, we present the experimental tests comparing the proposed algorithms with others from the literature for the version of the problem without the proportion restriction.
Collapse
Affiliation(s)
- Klairton Lima Brito
- Institute of Computing, University of Campinas, 1251 Albert Einstein Ave., 13083-852 Campinas, So Paulo, Brazil
| | | | - Andre Rodrigues Oliveira
- Institute of Computing, University of Campinas, 1251 Albert Einstein Ave., 13083-852 Campinas, So Paulo, Brazil
| | - Ulisses Dias
- School of Technology, University of Campinas, 1888 Paschoal Marmo St., 13484-332 Limeira, So Paulo, Brazil
| | - Zanoni Dias
- Institute of Computing, University of Campinas, 1251 Albert Einstein Ave., 13083-852 Campinas, So Paulo, Brazil
| |
Collapse
|
9
|
Abstract
The rearrangement distance is a well-known problem in the field of comparative genomics. Given two genomes, the rearrangement distance is the minimum number of rearrangements in a set of allowed rearrangements (rearrangement model), which transforms one genome into the other. In rearrangement distance problems, a genome is modeled as a string, where each element represents a conserved region within the two genomes. When the orientation of the genes is known, it is represented by (plus or minus) signs assigned to the elements of the string. Two of the most studied rearrangements are reversals, which invert a segment of the genome, and transpositions, which exchange the relative positions of two adjacent segments of the genome. The first works in genome rearrangements considered that the genomes being compared had the same genetic material and that rearrangement events were restricted to reversals, transpositions, or both. El-Mabrouk extended the reversal model on signed strings to include the operations of insertion and deletion of segments in the genome, which allowed the comparison of genomes with different genetic material. Other studies also addressed this problem and, recently, this problem was proved to be solvable in polynomial time by Willing et al. For unsigned strings, we still observe a lack of results. That said, in this study we prove that computing the rearrangement distance for the following models is NP-Hard: reversals and indels on unsigned strings; transpositions and indels on unsigned strings; and reversals, transpositions, and indels on signed and unsigned strings. Along with the NP-hardness proofs, we present a 2-approximation algorithm for reversals on unsigned strings and 3-approximation algorithms for the other models.
Collapse
Affiliation(s)
| | | | - Ulisses Dias
- School of Technology, University of Campinas, Limeira, Brazil
| | - Zanoni Dias
- Institute of Computing, University of Campinas, Campinas, Brazil
| |
Collapse
|
10
|
Abstract
One of the main problems in Computational Biology is to find the evolutionary distance among species. In most approaches, such distance only involves rearrangements, which are mutations that alter large pieces of the species' genome. When we represent genomes as permutations, the problem of transforming one genome into another is equivalent to the problem of Sorting Permutations by Rearrangement Operations. The traditional approach is to consider that any rearrangement has the same probability to happen, and so, the goal is to find a minimum sequence of operations which sorts the permutation. However, studies have shown that some rearrangements are more likely to happen than others, and so a weighted approach is more realistic. In a weighted approach, the goal is to find a sequence which sorts the permutations, such that the cost of that sequence is minimum. This work introduces a new type of cost function, which is related to the amount of fragmentation caused by a rearrangement. We present some results about the lower and upper bounds for the fragmentation-weighted problems and the relation between the unweighted and the fragmentation-weighted approach. Our main results are 2-approximation algorithms for five versions of this problem involving reversals and transpositions. We also give bounds for the diameters concerning these problems and provide an improved approximation factor for simple permutations considering transpositions.
Collapse
Affiliation(s)
| | - Carla Negri Lintzmayer
- Center for Mathematics, Computation and Cognition, Federal University of ABC (UFABC), Santo André, São Paulo 09210-580, Brazil
| | - Zanoni Dias
- Institute of Computing, University of Campinas (Unicamp), Campinas, São Paulo 13083-852, Brazil
| |
Collapse
|