1
|
RecGraph: recombination-aware alignment of sequences to variation graphs. BIOINFORMATICS (OXFORD, ENGLAND) 2024; 40:btae292. [PMID: 38676570 DOI: 10.1093/bioinformatics/btae292] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/12/2023] [Revised: 02/23/2024] [Accepted: 04/25/2024] [Indexed: 04/29/2024]
Abstract
MOTIVATION Bacterial genomes present more variability than human genomes, which requires important adjustments in computational tools that are developed for human data. In particular, bacteria exhibit a mosaic structure due to homologous recombinations, but this fact is not sufficiently captured by standard read mappers that align against linear reference genomes. The recent introduction of pangenomics provides some insights in that context, as a pangenome graph can represent the variability within a species. However, the concept of sequence-to-graph alignment that captures the presence of recombinations has not been previously investigated. RESULTS In this paper, we present the extension of the notion of sequence-to-graph alignment to a variation graph that incorporates a recombination, so that the latter are explicitly represented and evaluated in an alignment. Moreover, we present a dynamic programming approach for the special case where there is at most a recombination-we implement this case as RecGraph. From a modelling point of view, a recombination corresponds to identifying a new path of the variation graph, where the new arc is composed of two halves, each extracted from an original path, possibly joined by a new arc. Our experiments show that RecGraph accurately aligns simulated recombinant bacterial sequences that have at most a recombination, providing evidence for the presence of recombination events. AVAILABILITY AND IMPLEMENTATION Our implementation is open source and available at https://github.com/AlgoLab/RecGraph.
Collapse
|
2
|
Reads2Vec: Efficient Embedding of Raw High-Throughput Sequencing Reads Data. JOURNAL OF COMPUTATIONAL BIOLOGY : A JOURNAL OF COMPUTATIONAL MOLECULAR CELL BIOLOGY 2023; 30:469-491. [PMID: 36730750 DOI: 10.1089/cmb.2022.0424] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
The massive amount of genomic data appearing for SARS-CoV-2 since the beginning of the COVID-19 pandemic has challenged traditional methods for studying its dynamics. As a result, new methods such as Pangolin, which can scale to the millions of samples of SARS-CoV-2 currently available, have appeared. Such a tool is tailored to take as input assembled, aligned, and curated full-length sequences, such as those found in the GISAID database. As high-throughput sequencing technologies continue to advance, such assembly, alignment, and curation may become a bottleneck, creating a need for methods that can process raw sequencing reads directly. In this article, we propose Reads2Vec, an alignment-free embedding approach that can generate a fixed-length feature vector representation directly from the raw sequencing reads without requiring assembly. Furthermore, since such an embedding is a numerical representation, it may be applied to highly optimized classification and clustering algorithms. Experiments on simulated data show that our proposed embedding obtains better classification results and better clustering properties contrary to existing alignment-free baselines. In a study on real data, we show that alignment-free embeddings have better clustering properties than the Pangolin tool and that the spike region of the SARS-CoV-2 genome heavily informs the alignment-free clusterings, which is consistent with current biological knowledge of SARS-CoV-2.
Collapse
|
3
|
Abstract
BACKGROUND Being able to efficiently call variants from the increasing amount of sequencing data daily produced from multiple viral strains is of the utmost importance, as demonstrated during the COVID-19 pandemic, in order to track the spread of the viral strains across the globe. RESULTS We present MALVIRUS, an easy-to-install and easy-to-use application that assists users in multiple tasks required for the analysis of a viral population, such as the SARS-CoV-2. MALVIRUS allows to: (1) construct a variant catalog consisting in a set of variations (SNPs/indels) from the population sequences, (2) efficiently genotype and annotate variants of the catalog supported by a read sample, and (3) when the considered viral species is the SARS-CoV-2, assign the input sample to the most likely Pango lineages using the genotyped variations. CONCLUSIONS Tests on Illumina and Nanopore samples proved the efficiency and the effectiveness of MALVIRUS in analyzing SARS-CoV-2 strain samples with respect to publicly available data provided by NCBI and the more complete dataset provided by GISAID. A comparison with state-of-the-art tools showed that MALVIRUS is always more precise and often have a better recall.
Collapse
|
4
|
Computational graph pangenomics: a tutorial on data structures and their applications. NATURAL COMPUTING 2022; 21:81-108. [PMID: 36969737 PMCID: PMC10038355 DOI: 10.1007/s11047-022-09882-6] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/08/2023]
Abstract
Computational pangenomics is an emerging research field that is changing the way computer scientists are facing challenges in biological sequence analysis. In past decades, contributions from combinatorics, stringology, graph theory and data structures were essential in the development of a plethora of software tools for the analysis of the human genome. These tools allowed computational biologists to approach ambitious projects at population scale, such as the 1000 Genomes Project. A major contribution of the 1000 Genomes Project is the characterization of a broad spectrum of genetic variations in the human genome, including the discovery of novel variations in the South Asian, African and European populations-thus enhancing the catalogue of variability within the reference genome. Currently, the need to take into account the high variability in population genomes as well as the specificity of an individual genome in a personalized approach to medicine is rapidly pushing the abandonment of the traditional paradigm of using a single reference genome. A graph-based representation of multiple genomes, or a graph pangenome, is replacing the linear reference genome. This means completely rethinking well-established procedures to analyze, store, and access information from genome representations. Properly addressing these challenges is crucial to face the computational tasks of ambitious healthcare projects aiming to characterize human diversity by sequencing 1M individuals (Stark et al. 2019). This tutorial aims to introduce readers to the most recent advances in the theory of data structures for the representation of graph pangenomes. We discuss efficient representations of haplotypes and the variability of genotypes in graph pangenomes, and highlight applications in solving computational problems in human and microbial (viral) pangenomes.
Collapse
|
5
|
Abstract
In the recent years, there has been an increasing amount of single-cell sequencing studies, producing a considerable number of new data sets. This has particularly affected the field of cancer analysis, where more and more articles are published using this sequencing technique that allows for capturing more detailed information regarding the specific genetic mutations on each individually sampled cell. As the amount of information increases, it is necessary to have more sophisticated and rapid tools for analyzing the samples. To this goal, we developed plastic (PipeLine Amalgamating Single-cell Tree Inference Components), an easy-to-use and quick to adapt pipeline that integrates three different steps: (1) to simplify the input data, (2) to infer tumor phylogenies, and (3) to compare the phylogenies. We have created a pipeline submodule for each of those steps and developed new in-memory data structures that allow for easy and transparent sharing of the information across the tools implementing the above steps. While we use existing open source tools for those steps, we have extended the tool used for simplifying the input data, incorporating two machine learning procedures-which greatly reduce the running time without affecting the quality of the downstream analysis. Moreover, we have introduced the capability of producing some plots to quickly visualize results.
Collapse
|
6
|
Shark: fishing relevant reads in an RNA-Seq sample. Bioinformatics 2021; 37:464-472. [PMID: 32926128 PMCID: PMC8088329 DOI: 10.1093/bioinformatics/btaa779] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2020] [Revised: 08/17/2020] [Accepted: 09/02/2020] [Indexed: 11/19/2022] Open
Abstract
Motivation Recent advances in high-throughput RNA-Seq technologies allow to produce massive datasets. When a study focuses only on a handful of genes, most reads are not relevant and degrade the performance of the tools used to analyze the data. Removing irrelevant reads from the input dataset leads to improved efficiency without compromising the results of the study. Results We introduce a novel computational problem, called gene assignment and we propose an efficient alignment-free approach to solve it. Given an RNA-Seq sample and a panel of genes, a gene assignment consists in extracting from the sample, the reads that most probably were sequenced from those genes. The problem becomes more complicated when the sample exhibits evidence of novel alternative splicing events. We implemented our approach in a tool called Shark and assessed its effectiveness in speeding up differential splicing analysis pipelines. This evaluation shows that Shark is able to significantly improve the performance of RNA-Seq analysis tools without having any impact on the final results. Availability and implementation The tool is distributed as a stand-alone module and the software is freely available at https://github.com/AlgoLab/shark. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
|
7
|
Abstract
Single cell sequencing (SCS) technologies provide a level of resolution that makes it indispensable for inferring from a sequenced tumor, evolutionary trees or phylogenies representing an accumulation of cancerous mutations. A drawback of SCS is elevated false negative and missing value rates, resulting in a large space of possible solutions, which in turn makes it difficult, sometimes infeasible using current approaches and tools. One possible solution is to reduce the size of an SCS instance --- usually represented as a matrix of presence, absence, and uncertainty of the mutations found in the different sequenced cells --- and to infer the tree from this reduced-size instance. In this work, we present a new clustering procedure aimed at clustering such categorical vector, or matrix data --- here representing SCS instances, called celluloid. We show that celluloid clusters mutations with high precision: never pairing too many mutations that are unrelated in the ground truth, but also obtains accurate results in terms of the phylogeny inferred downstream from the reduced instance produced by this method. We demonstrate the usefulness of a clustering step by applying the entire pipeline (clustering + inference method) to a real dataset, showing a significant reduction in the runtime, raising considerably the upper bound on the size of SCS instances which can be solved in practice. Our approach, celluloid: clustering single cell sequencing data around centroids is available at https://github.com/AlgoLab/celluloid/ under an MIT license, as well as on the Python Package Index (PyPI) at https://pypi.org/project/celluloid-clust/.
Collapse
|
8
|
Triplet-based similarity score for fully multilabeled trees with poly-occurring labels. Bioinformatics 2021; 37:178-184. [PMID: 32730595 PMCID: PMC8055217 DOI: 10.1093/bioinformatics/btaa676] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2020] [Revised: 06/29/2020] [Accepted: 07/22/2020] [Indexed: 01/06/2023] Open
Abstract
MOTIVATION The latest advances in cancer sequencing, and the availability of a wide range of methods to infer the evolutionary history of tumors, have made it important to evaluate, reconcile and cluster different tumor phylogenies. Recently, several notions of distance or similarities have been proposed in the literature, but none of them has emerged as the golden standard. Moreover, none of the known similarity measures is able to manage mutations occurring multiple times in the tree, a circumstance often occurring in real cases. RESULTS To overcome these limitations, in this article, we propose MP3, the first similarity measure for tumor phylogenies able to effectively manage cases where multiple mutations can occur at the same time and mutations can occur multiple times. Moreover, a comparison of MP3 with other measures shows that it is able to classify correctly similar and dissimilar trees, both on simulated and on real data. AVAILABILITY AND IMPLEMENTATION An open source implementation of MP3 is publicly available at https://github.com/AlgoLab/mp3treesim. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
9
|
Inferring cancer progression from Single-Cell Sequencing while allowing mutation losses. Bioinformatics 2021; 37:326-333. [PMID: 32805010 PMCID: PMC8058767 DOI: 10.1093/bioinformatics/btaa722] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2019] [Revised: 08/06/2020] [Accepted: 08/11/2020] [Indexed: 01/21/2023] Open
Abstract
Motivation In recent years, the well-known Infinite Sites Assumption has been a fundamental feature of computational methods devised for reconstructing tumor phylogenies and inferring cancer progressions. However, recent studies leveraging single-cell sequencing (SCS) techniques have shown evidence of the widespread recurrence and, especially, loss of mutations in several tumor samples. While there exist established computational methods that infer phylogenies with mutation losses, there remain some advancements to be made. Results We present Simulated Annealing Single-Cell inference (SASC): a new and robust approach based on simulated annealing for the inference of cancer progression from SCS datasets. In particular, we introduce an extension of the model of evolution where mutations are only accumulated, by allowing also a limited amount of mutation loss in the evolutionary history of the tumor: the Dollo-k model. We demonstrate that SASC achieves high levels of accuracy when tested on both simulated and real datasets and in comparison with some other available methods. Availability and implementation The SASC tool is open source and available at https://github.com/sciccolella/sasc. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
|
10
|
gpps: an ILP-based approach for inferring cancer progression with mutation losses from single cell data. BMC Bioinformatics 2020; 21:413. [PMID: 33297943 PMCID: PMC7725124 DOI: 10.1186/s12859-020-03736-7] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2020] [Accepted: 09/03/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Cancer progression reconstruction is an important development stemming from the phylogenetics field. In this context, the reconstruction of the phylogeny representing the evolutionary history presents some peculiar aspects that depend on the technology used to obtain the data to analyze: Single Cell DNA Sequencing data have great specificity, but are affected by moderate false negative and missing value rates. Moreover, there has been some recent evidence of back mutations in cancer: this phenomenon is currently widely ignored. RESULTS We present a new tool, gpps, that reconstructs a tumor phylogeny from Single Cell Sequencing data, allowing each mutation to be lost at most a fixed number of times. The General Parsimony Phylogeny from Single cell (gpps) tool is open source and available at https://github.com/AlgoLab/gpps . CONCLUSIONS gpps provides new insights to the analysis of intra-tumor heterogeneity by proposing a new progression model to the field of cancer phylogeny reconstruction on Single Cell data.
Collapse
|
11
|
A Note on Computable Embeddings for Ordinals and Their Reverses. LECTURE NOTES IN COMPUTER SCIENCE 2020. [PMCID: PMC7309500 DOI: 10.1007/978-3-030-51466-2_1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
12
|
Abstract
If \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\mathcal {T}$$\end{document} is a topology of open sets on a set X, a real-valued function on X is of Baire class one over \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\mathcal {T}$$\end{document}, if it is the pointwise limit of a sequence of functions in the corresponding ring of continuous functions C(X). If F is a Bishop topology of functions on X, a constructive and function-theoretic alternative to \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\mathcal {T}$$\end{document} introduced by Bishop, we define a real-valued function on X to be of Baire class one over F, if it is the pointwise limit of a sequence of functions in F. We show that the set \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$B_1(F)$$\end{document} of functions of Baire class one over a given Bishop topology F on a set X is a Bishop topology on X. Consequently, notions and results from the general theory of Bishop spaces are naturally translated to the study of Baire class one-functions. We work within Bishop’s informal system of constructive mathematics \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\mathrm {BISH}^*$$\end{document}, that is \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\mathrm {BISH}$$\end{document} extended with inductive definitions with rules of countably many premises.
Collapse
|
13
|
Abstract
Specifying a computational problem requires fixing encodings for input and output: encoding graphs as adjacency matrices, characters as integers, integers as bit strings, and vice versa. For such discrete data, the actual encoding is usually straightforward and/or complexity-theoretically inessential (up to polynomial time, say); but concerning continuous data, already real numbers naturally suggest various encodings with very different computational properties. With respect to qualitative computability, Kreitz and Weihrauch (1985) had identified admissibility as crucial property for “reasonable” encodings over the Cantor space of infinite binary sequences, so-called representations. For (precisely) these does the Kreitz-Weihrauch representation (aka Main) Theorem apply, characterizing continuity of functions in terms of continuous realizers. We similarly identify refined criteria for representations suitable for quantitative complexity investigations. Higher type complexity is captured by replacing Cantor’s as ground space with more general compact metric spaces, similar to equilogical spaces in computability.
Collapse
|
14
|
Abstract
Suppose we are given a collection of mathematical objects such as the class of connected compact Polish groups or the set of all real numbers which are normal to some base.
Collapse
|
15
|
Abstract
We study the computational complexity of uniformly converting the base-a expansion of an irrational numbers to the base-b expansion. In particular, we are interested in subsets of the irrationals where such conversion can be performed with little overhead. We show that such conversion is possible, essentially with polynomial overhead, for the set of irrationals that are not Liouville numbers. Furthermore, it is known that there are irrational numbers x such that the expansion of x in one integer base is efficiently computable, but the expansion of x in certain other integer bases is not. We prove that any such number must be a Liouville number.
Collapse
|
16
|
Abstract
Amalgamation SNP (ASNP) is a fragment of existential second-order logic that strictly contains binary connected MMSNP of Feder and Vardi and binary connected guarded monotone SNP of Bienvenu, ten Cate, Lutz, and Wolter; it is a promising candidate for an expressive subclass of NP that exhibits a complexity dichotomy. We show that ASNP has a complexity dichotomy if and only if the infinite-domain dichotomy conjecture holds for constraint satisfaction problems for first-order reducts of binary finitely bounded homogeneous structures. For such CSPs, powerful universal-algebraic hardness conditions are known that are conjectured to describe the border between NP-hard and polynomial-time tractable CSPs. The connection to CSPs also implies that every ASNP sentence can be evaluated in polynomial time on classes of finite structures of bounded treewidth. We show that the syntax of ASNP is decidable. The proof relies on the fact that for classes of finite binary structures given by finitely many forbidden substructures, the amalgamation property is decidable.
Collapse
|
17
|
Faster Online Computation of the Succinct Longest Previous Factor Array. LECTURE NOTES IN COMPUTER SCIENCE 2020. [PMCID: PMC7309487 DOI: 10.1007/978-3-030-51466-2_31] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/03/2022]
Abstract
We consider the problem of computing online the Longest Previous Factor array LPF[1, n] of a text T of length n. For each \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$1 \le i \le n$$\end{document}, LPF[i] stores the length of the longest factor of T with at least two occurrences, one ending at i and the other at a previous position \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$j<i$$\end{document}. We present an improvement over the previous solution by Okanohara and Sadakane (ESA 2008): our solution uses less space (compressed instead of succinct) and runs in \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$O(n\log ^2n)$$\end{document} time, thus being faster by a logarithmic factor. As a by-product, we also obtain the first online algorithm computing the Longest Common Suffix (LCS) array (that is, the LCP array of the reversed text) in \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$O(n\log ^2n)$$\end{document} time and compressed space. We also observe that the LPF array can be represented succinctly in 2n bits. Our online algorithm computes directly the succinct LPF and LCS arrays.
Collapse
|
18
|
Deterministic and Nondeterministic Iterated Uniform Finite-State Transducers: Computational and Descriptional Power. LECTURE NOTES IN COMPUTER SCIENCE 2020. [PMCID: PMC7309513 DOI: 10.1007/978-3-030-51466-2_8] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|
19
|
Balancing Straight-Line Programs for Strings and Trees. LECTURE NOTES IN COMPUTER SCIENCE 2020. [PMCID: PMC7309493 DOI: 10.1007/978-3-030-51466-2_26] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
The talk will explain a recent balancing result according to which a context-free grammar in Chomsky normal form of size m that produces a single string w of length n (such a grammar is also called a straight-line program) can be transformed in linear time into a context-free grammar in Chomsky normal form for w of size \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\mathcal {O}(m)$$\end{document}, whose unique derivation tree has depth \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\mathcal {O}(\log n)$$\end{document}. This solves an open problem in the area of grammar-based compression, improves many results in this area and greatly simplifies many existing constructions. Similar balancing results can be formulated for various grammar-based tree compression formalism like top DAGs and forest straight-line programs. The talk is based on joint work with Moses Ganardi and Artur Jeż. An extended abstract appeared in [11]; a long version of the paper can be found in [12].
Collapse
|
20
|
Recent Advances in Text-to-Pattern Distance Algorithms. LECTURE NOTES IN COMPUTER SCIENCE 2020. [PMCID: PMC7309496 DOI: 10.1007/978-3-030-51466-2_32] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Computing text-to-pattern distances is a fundamental problem in pattern matching. Given a text of length n and a pattern of length m, we are asked to output the distance between the pattern and every n-substring of the text. A basic variant of this problem is computation of Hamming distances, that is counting the number of mismatches (different characters aligned), for each alignment. Other popular variants include \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\ell _1$$\end{document} distance (Manhattan distance), \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\ell _2$$\end{document} distance (Euclidean distance) and general \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\ell _p$$\end{document} distance. While each of those problems trivially generalizes classical pattern-matching, the efficient algorithms for them require a broader set of tools, usually involving both algebraic and combinatorial insights. We briefly survey the history of the problems, and then focus on the progress made in the past few years in many specific settings: fine-grained complexity and lower-bounds, \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$(1+\varepsilon )$$\end{document} multiplicative approximations, k-bounded relaxations, streaming algorithms, purely combinatorial algorithms, and other recently proposed variants.
Collapse
|
21
|
Non-coding Enumeration Operators. LECTURE NOTES IN COMPUTER SCIENCE 2020. [PMCID: PMC7309498 DOI: 10.1007/978-3-030-51466-2_10] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
22
|
Does Relaxing the Infinite Sites Assumption Give Better Tumor Phylogenies? An ILP-Based Comparative Approach. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1410-1423. [PMID: 31603766 DOI: 10.1109/tcbb.2018.2865729] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Most of the evolutionary history reconstruction approaches are based on the infinite sites assumption, which states that mutations appear once in the evolutionary history. The Perfect Phylogeny model is the result of the infinite sites assumption and has been widely used to infer cancer evolution. Nonetheless, recent results show that recurrent and back mutations are present in the evolutionary history of tumors, hence the Perfect Phylogeny model might be too restrictive. We propose an approach that allows losing previously acquired mutations and multiple acquisitions of a character. Moreover, we provide an ILP formulation for the evolutionary tree reconstruction problem. Our formulation allows us to tackle both the Incomplete Directed Phylogeny problem and the Clonal Reconstruction problem when general evolutionary models are considered. The latter problem is fundamental in cancer genomics, the goal is to study the evolutionary history of a tumor considering as input data the fraction of cells having a certain mutation in a set of cancer samples. For the Clonal Reconstruction problem, an experimental analysis shows the advantage of allowing mutation losses. Namely, by analyzing real and simulated datasets, our ILP approach provides a better interpretation of the evolutionary history than a Perfect Phylogeny. The software is at https://github.com/AlgoLab/gppf.
Collapse
|
23
|
Multithread Multistring Burrows-Wheeler Transform and Longest Common Prefix Array. J Comput Biol 2019; 26:948-961. [PMID: 31140836 DOI: 10.1089/cmb.2018.0230] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Indexing huge collections of strings, such as those produced by the widespread sequencing technologies, heavily relies on multistring generalizations of the Burrows-Wheeler transform (BWT) and the longest common prefix (LCP) array, since solving efficiently both problems are essential ingredients of several algorithms on a collection of strings, such as those for genome assembly. In this article, we explore a multithread computational strategy for building the BWT and LCP array. Our algorithm applies a divide and conquer approach that leads to parallel computation of multistring BWT and LCP array.
Collapse
|
24
|
ASGAL: aligning RNA-Seq data to a splicing graph to detect novel alternative splicing events. BMC Bioinformatics 2018; 19:444. [PMID: 30458725 PMCID: PMC6247705 DOI: 10.1186/s12859-018-2436-3] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2018] [Accepted: 10/15/2018] [Indexed: 11/14/2022] Open
Abstract
Background While the reconstruction of transcripts from a sample of RNA-Seq data is a computationally expensive and complicated task, the detection of splicing events from RNA-Seq data and a gene annotation is computationally feasible. This latter task, which is adequate for many transcriptome analyses, is usually achieved by aligning the reads to a reference genome, followed by comparing the alignments with a gene annotation, often implicitly represented by a graph: the splicing graph. Results We present ASGAL (Alternative Splicing Graph ALigner): a tool for mapping RNA-Seq data to the splicing graph, with the specific goal of detecting novel splicing events, involving either annotated or unannotated splice sites. ASGAL takes as input the annotated transcripts of a gene and a RNA-Seq sample, and computes (1) the spliced alignments of each read in input, and (2) a list of novel events with respect to the gene annotation. Conclusions An experimental analysis shows that ASGAL allows to enrich the annotation with novel alternative splicing events even when genes in an experiment express at most one isoform. Compared with other tools which use the spliced alignment of reads against a reference genome for differential analysis, ASGAL better predicts events that use splice sites which are novel with respect to a splicing graph, showing a higher accuracy. To the best of our knowledge, ASGAL is the first tool that detects novel alternative splicing events by directly aligning reads to a splicing graph. Availability Source code, documentation, and data are available for download at http://asgal.algolab.eu.
Collapse
|
25
|
HapCHAT: adaptive haplotype assembly for efficiently leveraging high coverage in long reads. BMC Bioinformatics 2018; 19:252. [PMID: 29970002 PMCID: PMC6029272 DOI: 10.1186/s12859-018-2253-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2017] [Accepted: 06/18/2018] [Indexed: 01/08/2023] Open
Abstract
Background Haplotype assembly is the process of assigning the different alleles of the variants covered by mapped sequencing reads to the two haplotypes of the genome of a human individual. Long reads, which are nowadays cheaper to produce and more widely available than ever before, have been used to reduce the fragmentation of the assembled haplotypes since their ability to span several variants along the genome. These long reads are also characterized by a high error rate, an issue which may be mitigated, however, with larger sets of reads, when this error rate is uniform across genome positions. Unfortunately, current state-of-the-art dynamic programming approaches designed for long reads deal only with limited coverages. Results Here, we propose a new method for assembling haplotypes which combines and extends the features of previous approaches to deal with long reads and higher coverages. In particular, our algorithm is able to dynamically adapt the estimated number of errors at each variant site, while minimizing the total number of error corrections necessary for finding a feasible solution. This allows our method to significantly reduce the required computational resources, allowing to consider datasets composed of higher coverages. The algorithm has been implemented in a freely available tool, HapCHAT: Haplotype Assembly Coverage Handling by Adapting Thresholds. An experimental analysis on sequencing reads with up to 60 × coverage reveals improvements in accuracy and recall achieved by considering a higher coverage with lower runtimes. Conclusions Our method leverages the long-range information of sequencing reads that allows to obtain assembled haplotypes fragmented in a lower number of unphased haplotype blocks. At the same time, our method is also able to deal with higher coverages to better correct the errors in the original reads and to obtain more accurate haplotypes as a result. Availability HapCHAT is available at http://hapchat.algolab.euunder the GNU Public License (GPL).
Collapse
|
26
|
Abstract
The string graph for a collection of next-generation reads is a lossless data representation that is fundamental for de novo assemblers based on the overlap-layout-consensus paradigm. In this article, we explore a novel approach to compute the string graph, based on the FM-index and Burrows and Wheeler Transform. We describe a simple algorithm that uses only the FM-index representation of the collection of reads to construct the string graph, without accessing the input reads. Our algorithm has been integrated into the string graph assembler (SGA) as a standalone module to construct the string graph. The new integrated assembler has been assessed on a standard benchmark, showing that fast string graph (FSG) is significantly faster than SGA while maintaining a moderate use of main memory, and showing practical advantages in running FSG on multiple threads. Moreover, we have studied the effect of coverage rates on the running times.
Collapse
|
27
|
LSG: An External-Memory Tool to Compute String Graphs for Next-Generation Sequencing Data Assembly. J Comput Biol 2016; 23:137-49. [PMID: 26953874 DOI: 10.1089/cmb.2015.0172] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The large amount of short read data that has to be assembled in future applications, such as in metagenomics or cancer genomics, strongly motivates the investigation of disk-based approaches to index next-generation sequencing (NGS) data. Positive results in this direction stimulate the investigation of efficient external memory algorithms for de novo assembly from NGS data. Our article is also motivated by the open problem of designing a space-efficient algorithm to compute a string graph using an indexing procedure based on the Burrows-Wheeler transform (BWT). We have developed a disk-based algorithm for computing string graphs in external memory: the light string graph (LSG). LSG relies on a new representation of the FM-index that is exploited to use an amount of main memory requirement that is independent from the size of the data set. Moreover, we have developed a pipeline for genome assembly from NGS data that integrates LSG with the assembly step of SGA (Simpson and Durbin, 2012 ), a state-of-the-art string graph-based assembler, and uses BEETL for indexing the input data. LSG is open source software and is available online. We have analyzed our implementation on a 875-million read whole-genome dataset, on which LSG has built the string graph using only 1GB of main memory (reducing the memory occupation by a factor of 50 with respect to SGA), while requiring slightly more than twice the time than SGA. The analysis of the entire pipeline shows an important decrease in memory usage, while managing to have only a moderate increase in the running time.
Collapse
|
28
|
Tools and data services registry: a community effort to document bioinformatics resources. Nucleic Acids Res 2015; 44:D38-47. [PMID: 26538599 PMCID: PMC4702812 DOI: 10.1093/nar/gkv1116] [Citation(s) in RCA: 86] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2015] [Accepted: 10/13/2015] [Indexed: 01/24/2023] Open
Abstract
Life sciences are yielding huge data sets that underpin scientific discoveries fundamental to improvement in human health, agriculture and the environment. In support of these discoveries, a plethora of databases and tools are deployed, in technically complex and diverse implementations, across a spectrum of scientific disciplines. The corpus of documentation of these resources is fragmented across the Web, with much redundancy, and has lacked a common standard of information. The outcome is that scientists must often struggle to find, understand, compare and use the best resources for the task at hand. Here we present a community-driven curation effort, supported by ELIXIR—the European infrastructure for biological information—that aspires to a comprehensive and consistent registry of information about bioinformatics resources. The sustainable upkeep of this Tools and Data Services Registry is assured by a curation effort driven by and tailored to local needs, and shared amongst a network of engaged partners. As of November 2015, the registry includes 1785 resources, with depositions from 126 individual registrations including 52 institutional providers and 74 individuals. With community support, the registry can become a standard for dissemination of information about bioinformatics resources: we welcome everyone to join us in this common endeavour. The registry is freely available at https://bio.tools.
Collapse
|
29
|
Abstract
Alternative Splicing (AS) is the molecular phenomenon whereby multiple transcripts are produced from the same gene locus. As a consequence, it is responsible for the expansion of eukaryotic transcriptomes. Aberrant AS is involved in the onset and progression of several human diseases. Therefore, the characterization of exon-intron structure of a gene and the detection of corresponding transcript isoforms is an extremely relevant biological task. Nonetheless, the computational prediction of AS events and the repertoire of alternative transcripts is yet a challenging issue. Hereafter we introduce PIntron, a software package to predict the exon-intron structure and the full-length isoforms of a gene given a genomic region and a set of transcripts (ESTs and/or mRNAs). The software is open source and available at http://pintron.algolab.eu. PIntron has been designed for (and extensively tested on) a standard workstation without requiring dedicated expensive hardware. It easily manages large genomic regions and more than 20,000 ESTs, achieving good accuracy as shown in an experimental evaluation performed on 112 well-annotated genes selected from the ENCODE human regions used as training set in the EGASP competition.
Collapse
|
30
|
Abstract
BACKGROUND The perfect phylogeny is an often used model in phylogenetics since it provides an efficient basic procedure for representing the evolution of genomic binary characters in several frameworks, such as for example in haplotype inference. The model, which is conceptually the simplest, is based on the infinite sites assumption, that is no character can mutate more than once in the whole tree. A main open problem regarding the model is finding generalizations that retain the computational tractability of the original model but are more flexible in modeling biological data when the infinite site assumption is violated because of e.g. back mutations. A special case of back mutations that has been considered in the study of the evolution of protein domains (where a domain is acquired and then lost) is persistency, that is the fact that a character is allowed to return back to the ancestral state. In this model characters can be gained and lost at most once. In this paper we consider the computational problem of explaining binary data by the Persistent Perfect Phylogeny model (referred as PPP) and for this purpose we investigate the problem of reconstructing an evolution where some constraints are imposed on the paths of the tree. RESULTS We define a natural generalization of the PPP problem obtained by requiring that for some pairs (character, species), neither the species nor any of its ancestors can have the character. In other words, some characters cannot be persistent for some species. This new problem is called Constrained PPP (CPPP). Based on a graph formulation of the CPPP problem, we are able to provide a polynomial time solution for the CPPP problem for matrices whose conflict graph has no edges. Using this result, we develop a parameterized algorithm for solving the CPPP problem where the parameter is the number of characters. CONCLUSIONS A preliminary experimental analysis shows that the constrained persistent perfect phylogeny model allows to explain efficiently data that do not conform with the classical perfect phylogeny model.
Collapse
|
31
|
Effectiveness of oral bisphosphonates for primary prevention of osteoporotic fractures. Eur J Clin Pharmacol 2014; 70:1129-37. [DOI: 10.1007/s00228-014-1708-8] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2014] [Accepted: 06/10/2014] [Indexed: 11/28/2022]
|
32
|
User-only design to assess drug effectiveness in clinical practice: application to bisphosphonates and secondary prevention of fractures. Pharmacoepidemiol Drug Saf 2014; 23:859-67. [PMID: 24911392 DOI: 10.1002/pds.3650] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2014] [Revised: 04/10/2014] [Accepted: 04/28/2014] [Indexed: 02/01/2023]
Abstract
PURPOSE Different strategies applicable to control for confounding by indication in observational studies were compared in a large population-based study regarding the effect of bisphosphonates (BPs) for secondary prevention of fractures. METHODS The cohort was drawn from healthcare utilization databases of 13 Italian territorial units. Patients aged 55 years or more who were hospitalized for fracture during 2003-2005 entered into the cohort. A nested case-control design was used to compare BPs use in cohort members who did (cases) and who did not experience (controls) a new fracture until 2007 (outcome). Three designs were employed: conventional-matching (D1 ), propensity score-matching (D2 ), and user-only (D3 ) designs. They differed for (i) cohort composition, restricted to patients who received BPs straight after cohort entry (D3 ); (ii) using propensity score for case-control matching (D2 ); and (iii) compared groups of BPs users versus no users (D1 and D2 ) and long-term versus short-term users (D3 ). RESULTS Bisphosphonate users had odds ratios (95% confidence interval) of 1.20 (1.01 to 1.44) and 0.95 (0.74 to 1.24) by applying D1 and D2 designs, respectively. Statistical evidence that long-term BPs use protects the outcome onset with respect to short-term use was observed for user-only design (D3 ) being the corresponding odds ratio (95% confidence interval) 0.64 (0.44 to 0.93). CONCLUSIONS User-only design yielded closer results to those seen in RCTs. This approach is one possible strategy to account for confounding by indication.
Collapse
|
33
|
Oral bisphosphonates do not increase the risk of severe upper gastrointestinal complications: a nested case-control study. BMC Gastroenterol 2014; 14:5. [PMID: 24397769 PMCID: PMC3897893 DOI: 10.1186/1471-230x-14-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/08/2012] [Accepted: 12/18/2013] [Indexed: 11/10/2022] Open
Abstract
Background Data on the effect of oral bisphosphonates (BPs) on risk of upper gastrointestinal complications (UGIC) are conflicting. We conducted a large population-based study from a network of Italian healthcare utilization databases aimed to assess the UGIC risk associated with use of BPs in the setting of secondary prevention of osteoporotic fractures. Methods A nested case–control study was carried out within a cohort of 68,970 patients aged 45 years or older, who have been hospitalized for osteoporotic fracture from 2003 until 2005. Cases were the 804 patients who experienced hospitalization for UGIC until 2007. Up to 20 controls were randomly selected for each case. Conditional logistic regression model was used to estimate odds ratio (OR) associated with current and past use of BPs (i.e. for drug dispensation within 30 days and over 31 days prior the outcome onset, respectively) after adjusting for several covariates. Results Compared with patients who did not use BPs, current and past users had OR (and 95% confidence interval) of 0.86 (0.60 to 1.22) and 1.07 (0.80 to 1.44) respectively. There was no difference in the ORs estimated according with BPs type (alendronate or risedronate) and regimen (daily or weekly), nor with co-therapies and comorbidities. Conclusions Further evidence that BPs dispensed for secondary prevention of osteoporotic fractures are not associated with increased risk of severe gastrointestinal complications is supplied from this study. Further research is required to clarify the role BPs and other drugs of co-medication in inducing UGIC.
Collapse
|
34
|
Risk of severe upper gastrointestinal complications among oral bisphosphonate users. PLoS One 2013; 8:e73159. [PMID: 24348985 PMCID: PMC3857168 DOI: 10.1371/journal.pone.0073159] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2013] [Accepted: 07/17/2013] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND Oral bisphosphonates (BPs) are the primary agents for the treatment of osteoporosis. Although BPs are generally well tolerated, serious gastrointestinal adverse events have been observed. AIM To assess the risk of severe upper gastrointestinal complications (UGIC) among BP users by means of a large study based on a network of Italian healthcare utilization databases. METHODS A nested case-control study was carried out by including 110,220 patients aged 45 years or older who, from 2003 until 2005, were treated with oral BPs. Cases were the 862 patients who experienced the outcome (hospitalization for UGIC) until 2007. Up to 20 controls were randomly selected for each case. Conditional logistic regression model was used to estimate odds ratio (OR) associated with current use of BPs after adjusting for several covariates. A set of sensitivity analyses was performed in order to account for sources of systematic uncertainty. RESULTS The adjusted OR for current use of BPs with respect to past use was 0.94 (95% CI 0.81 to 1.08). There was no evidence that this risk changed either with BP type and regimen, or concurrent use of other drugs or previous hospitalizations. CONCLUSIONS No evidence was found that current use of BPs increases the risk of severe upper gastrointestinal complications compared to past use.
Collapse
|
35
|
Abstract
Next-generation sequencing (NGS) technologies need new methodologies for alternative splicing (AS) analysis. Current computational methods for AS analysis from NGS data are mainly based on aligning short reads against a reference genome, while methods that do not need a reference genome are mostly underdeveloped. In this context, the main developed tools for NGS data focus on de novo transcriptome assembly (Grabherr et al., 2011 ; Schulz et al., 2012). While these tools are extensively applied for biological investigations and often show intrinsic shortcomings from the obtained results, a theoretical investigation of the inherent computational limits of transcriptome analysis from NGS data, when a reference genome is unknown or highly unreliable, is still missing. On the other hand, we still lack methods for computing the gene structures due to AS events under the above assumptions--a problem that we start to tackle with this article. More precisely, based on the notion of isoform graph (Lacroix et al., 2008), we define a compact representation of gene structures--called splicing graph--and investigate the computational problem of building a splicing graph that is (i) compatible with NGS data and (ii) isomorphic to the isoform graph. We characterize when there is only one representative splicing graph compatible with input data, and we propose an efficient algorithmic approach to compute this graph.
Collapse
|
36
|
A fast and practical approach to genotype phasing and imputation on a pedigree with erroneous and incomplete information. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012; 9:1582-1594. [PMID: 22848137 DOI: 10.1109/tcbb.2012.100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
The MINIMUM-RECOMBINANT HAPLOTYPE CONFIGURATION problem (MRHC) has been highly successful in providing a sound combinatorial formulation for the important problem of genotype phasing on pedigrees. Despite several algorithmic advances that have improved the efficiency, its applicability to real data sets has been limited since it does not take into account some important phenomena such as mutations, genotyping errors, and missing data. In this work, we propose the MINIMUM-RECOMBINANT HAPLOTYPE CONFIGURATION WITH BOUNDED ERRORS problem (MRHCE), which extends the original MRHC formulation by incorporating the two most common characteristics of real data: errors and missing genotypes (including untyped individuals). We describe a practical algorithm for MRHCE that is based on a reduction to the well-known Satisfiability problem (SAT) and exploits recent advances in the constraint programming literature. An experimental analysis demonstrates the biological soundness of the phasing model and the effectiveness (on both accuracy and performance) of the algorithm under several scenarios. The analysis on real data and the comparison with state-of-the-art programs reveals that our approach couples better scalability to large and complex pedigrees with the explicit inclusion of genotyping errors into the model.
Collapse
|
37
|
PIntron: a fast method for detecting the gene structure due to alternative splicing via maximal pairings of a pattern and a text. BMC Bioinformatics 2012; 13 Suppl 5:S2. [PMID: 22537006 PMCID: PMC3358663 DOI: 10.1186/1471-2105-13-s5-s2] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
BACKGROUND A challenging issue in designing computational methods for predicting the gene structure into exons and introns from a cluster of transcript (EST, mRNA) sequences, is guaranteeing accuracy as well as efficiency in time and space, when large clusters of more than 20,000 ESTs and genes longer than 1 Mb are processed. Traditionally, the problem has been faced by combining different tools, not specifically designed for this task. RESULTS We propose a fast method based on ad hoc procedures for solving the problem. Our method combines two ideas: a novel algorithm of proved small time complexity for computing spliced alignments of a transcript against a genome, and an efficient algorithm that exploits the inherent redundancy of information in a cluster of transcripts to select, among all possible factorizations of EST sequences, those allowing to infer splice site junctions that are largely confirmed by the input data. The EST alignment procedure is based on the construction of maximal embeddings, that are sequences obtained from paths of a graph structure, called embedding graph, whose vertices are the maximal pairings of a genomic sequence T and an EST P. The procedure runs in time linear in the length of P and T and in the size of the output.The method was implemented into the PIntron package. PIntron requires as input a genomic sequence or region and a set of EST and/or mRNA sequences. Besides the prediction of the full-length transcript isoforms potentially expressed by the gene, the PIntron package includes a module for the CDS annotation of the predicted transcripts. CONCLUSIONS PIntron, the software tool implementing our methodology, is available at http://www.algolab.eu/PIntron under GNU AGPL. PIntron has been shown to outperform state-of-the-art methods, and to quickly process some critical genes. At the same time, PIntron exhibits high accuracy (sensitivity and specificity) when benchmarked with ENCODE annotations.
Collapse
|
38
|
Improving probe set selection for microbial community analysis by leveraging taxonomic information of training sequences. BMC Bioinformatics 2011; 12:394. [PMID: 21985453 PMCID: PMC3224148 DOI: 10.1186/1471-2105-12-394] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2011] [Accepted: 10/10/2011] [Indexed: 01/17/2023] Open
Abstract
Background Population levels of microbial phylotypes can be examined using a hybridization-based method that utilizes a small set of computationally-designed DNA probes targeted to a gene common to all. Our previous algorithm attempts to select a set of probes such that each training sequence manifests a unique theoretical hybridization pattern (a binary fingerprint) to a probe set. It does so without taking into account similarity between training gene sequences or their putative taxonomic classifications, however. We present an improved algorithm for probe set selection that utilizes the available taxonomic information of training gene sequences and attempts to choose probes such that the resultant binary fingerprints cluster into real taxonomic groups. Results Gene sequences manifesting identical fingerprints with probes chosen by the new algorithm are more likely to be from the same taxonomic group than probes chosen by the previous algorithm. In cases where they are from different taxonomic groups, underlying DNA sequences of identical fingerprints are more similar to each other in probe sets made with the new versus the previous algorithm. Complete removal of large taxonomic groups from training data does not greatly decrease the ability of probe sets to distinguish those groups. Conclusions Probe sets made from the new algorithm create fingerprints that more reliably cluster into biologically meaningful groups. The method can readily distinguish microbial phylotypes that were excluded from the training sequences, suggesting novel microbes can also be detected.
Collapse
|
39
|
Pure parsimony xor haplotyping. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2010; 7:598-610. [PMID: 20498511 DOI: 10.1109/tcbb.2010.52] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
The haplotype resolution from xor-genotype data has been recently formulated as a new model for genetic studies. The xor-genotype data is a cheaply obtainable type of data distinguishing heterozygous from homozygous sites without identifying the homozygous alleles. In this paper, we propose a formulation based on a well-known model used in haplotype inference: pure parsimony. We exhibit exact solutions of the problem by providing polynomial time algorithms for some restricted cases and a fixed-parameter algorithm for the general case. These results are based on some interesting combinatorial properties of a graph representation of the solutions. Furthermore, we show that the problem has a polynomial time k-approximation, where k is the maximum number of xor-genotypes containing a given single nucleotide polymorphisms (SNP). Finally, we propose a heuristic and produce an experimental analysis showing that it scales to real-world large instances taken from the HapMap project.
Collapse
|
40
|
|
41
|
|
42
|
Experimental analysis of a new algorithm for partial haplotype completion. INTERNATIONAL JOURNAL OF BIOINFORMATICS RESEARCH AND APPLICATIONS 2007; 1:461-73. [PMID: 18048149 DOI: 10.1504/ijbra.2005.008448] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
This paper deals with the computational problem of inferring complete information on haplotypes from haplotypes with missing data. This problem is one of the main issues in haplotyping, as the current DNA sequencing technology often produces haplotypes with missing bases and therefore the complete information on haplotypes has to be inferred through computational methods. In this paper, we propose a new algorithmic approach to the problem that assumes both the Coalescent and the Minimum Entropy models and we provide an experimental analysis relating it to the previously investigated approaches. In particular, the reconstruction of a perfect phylogeny from haplotypes with missing data is addressed.
Collapse
|
43
|
Exemplar longest common subsequence. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2007; 4:535-543. [PMID: 17975265 DOI: 10.1109/tcbb.2007.1066] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
In this paper, we investigate the computational and approximation complexity of the Exemplar Longest Common Subsequence of a set of sequences (ELCS problem), a generalization of the Longest Common Subsequence problem, where the input sequences are over the union of two disjoint sets of symbols, a set of mandatory symbols and a set of optional symbols. We show that different versions of the problem are APX-hard even for instances with two sequences. Moreover, we show that the related problem of determining the existence of a feasible solution of the Exemplar Longest Common Subsequence of two sequences is NP-hard. On the positive side, we first present an efficient algorithm for the ELCS problem over instances of two sequences where each mandatory symbol can appear in total at most three times in the sequences. Furthermore, we present two fixed-parameter algorithms for the ELCS problem over instances of two sequences where the parameter is the number of mandatory symbols.
Collapse
|
44
|
A library of efficient bioinformatics algorithms. APPLIED BIOINFORMATICS 2003; 2:117-21. [PMID: 15130828] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 04/29/2023]
Abstract
In this paper we review some of the existing projects available in the bioinformatics field for facilitating the development of programs, but for which minimising the running time is not of primary importance. We point out the advantages of open source libraries for such tasks and we discuss some of the open source licenses available. Finally, we present the project ALiBio, which is aimed at facilitating the development of efficient programs in bioinformatics.
Collapse
|
45
|
Oligonucleotide fingerprinting of rRNA genes for analysis of fungal community composition. Appl Environ Microbiol 2002; 68:5999-6004. [PMID: 12450821 PMCID: PMC134423 DOI: 10.1128/aem.68.12.5999-6004.2002] [Citation(s) in RCA: 62] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Thorough assessments of fungal diversity are currently hindered by technological limitations. Here we describe a new method for identifying fungi, oligonucleotide fingerprinting of rRNA genes (OFRG). ORFG sorts arrayed rRNA gene (ribosomal DNA [rDNA]) clones into taxonomic clusters through a series of hybridization experiments, each using a single oligonucleotide probe. A simulated annealing algorithm was used to design an OFRG probe set for fungal rDNA. Analysis of 1,536 fungal rDNA clones derived from soil generated 455 clusters. A pairwise sequence analysis showed that clones with average sequence identities of 99.2% were grouped into the same cluster. To examine the accuracy of the taxonomic identities produced by this OFRG experiment, we determined the nucleotide sequences for 117 clones distributed throughout the tree. For all but two of these clones, the taxonomic identities generated by this OFRG experiment were consistent with those generated by a nucleotide sequence analysis. Eighty-eight percent of the clones were affiliated with Ascomycota, while 12% belonged to BASIDIOMYCOTA: A large fraction of the clones were affiliated with the genera Fusarium (404 clones) and Raciborskiomyces (176 clones). Smaller assemblages of clones had high sequence identities to the Alternaria, Ascobolus, Chaetomium, Cryptococcus, and Rhizoctonia clades.
Collapse
|
46
|
Abstract
MOTIVATION Reconstructing evolutionary trees is an important problem in biology. A response to the computational intractability of most of the traditional criteria for inferring evolutionary trees has been a focus on new criteria, particularly quartet-based methods that seek to merge trees derived on subsets of four species from a given species-set into a tree for that entire set. Unfortunately, most of these methods are very sensitive to errors in the reconstruction of the trees for individual quartets of species. A recently developed technique called quartet cleaning can alleviate this difficulty in certain cases by using redundant information in the complete set of quartet topologies for a given species-set to correct such errors. RESULTS In this paper, we describe two new local vertex quartet cleaning algorithms which have optimal time complexity and error-correction bound, respectively. These are the first known local vertex quartet cleaning algorithms that are optimal with respect to either of these attributes.
Collapse
|
47
|
Analysis of bacterial community composition by oligonucleotide fingerprinting of rRNA genes. Appl Environ Microbiol 2002; 68:3243-50. [PMID: 12089000 PMCID: PMC126790 DOI: 10.1128/aem.68.7.3243-3250.2002] [Citation(s) in RCA: 91] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
One of the first steps in characterizing an ecosystem is to describe the organisms inhabiting it. For microbial studies, experimental limitations have hindered the ability to depict diverse communities. Here we describe oligonucleotide fingerprinting of rRNA genes (OFRG), a method that permits identification of arrayed rRNA genes (rDNA) through a series of hybridization experiments using small DNA probes. To demonstrate this strategy, we examined the bacteria inhabiting two different soils. Analysis of 1,536 rDNA clones revealed 766 clusters grouped into five major taxa: Bacillus, Actinobacteria, Proteobacteria, and two undefined assemblages. Soil-specific taxa were identified and then independently confirmed through cluster-specific PCR of the original soil DNA. Near-species-level resolution was obtained by this analysis as clones with average sequence identities of 97% were grouped in the same cluster. A comparison of these OFRG results with the results obtained in a denaturing gradient gel electrophoresis analysis of the same two soils demonstrated the significance of this methodological advance. OFRG provides a cost-effective means to extensively analyze microbial communities and should have applications in medicine, biotechnology, and ecosystem studies.
Collapse
|
48
|
|