1
|
Yao HT, Marchand B, Berkemer SJ, Ponty Y, Will S. Infrared: a declarative tree decomposition-powered framework for bioinformatics. Algorithms Mol Biol 2024; 19:13. [PMID: 38493130 PMCID: PMC10943887 DOI: 10.1186/s13015-024-00258-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Accepted: 02/13/2024] [Indexed: 03/18/2024] Open
Abstract
MOTIVATION Many bioinformatics problems can be approached as optimization or controlled sampling tasks, and solved exactly and efficiently using Dynamic Programming (DP). However, such exact methods are typically tailored towards specific settings, complex to develop, and hard to implement and adapt to problem variations. METHODS We introduce the Infrared framework to overcome such hindrances for a large class of problems. Its underlying paradigm is tailored toward problems that can be declaratively formalized as sparse feature networks, a generalization of constraint networks. Classic Boolean constraints specify a search space, consisting of putative solutions whose evaluation is performed through a combination of features. Problems are then solved using generic cluster tree elimination algorithms over a tree decomposition of the feature network. Their overall complexities are linear on the number of variables, and only exponential in the treewidth of the feature network. For sparse feature networks, associated with low to moderate treewidths, these algorithms allow to find optimal solutions, or generate controlled samples, with practical empirical efficiency. RESULTS Implementing these methods, the Infrared software allows Python programmers to rapidly develop exact optimization and sampling applications based on a tree decomposition-based efficient processing. Instead of directly coding specialized algorithms, problems are declaratively modeled as sets of variables over finite domains, whose dependencies are captured by constraints and functions. Such models are then automatically solved by generic DP algorithms. To illustrate the applicability of Infrared in bioinformatics and guide new users, we model and discuss variants of bioinformatics applications. We provide reimplementations and extensions of methods for RNA design, RNA sequence-structure alignment, parsimony-driven inference of ancestral traits in phylogenetic trees/networks, and design of coding sequences. Moreover, we demonstrate multidimensional Boltzmann sampling. These applications of the framework-together with our novel results-underline the practical relevance of Infrared. Remarkably, the achieved complexities are typically equivalent to the ones of specialized algorithms and implementations. AVAILABILITY Infrared is available at https://amibio.gitlabpages.inria.fr/Infrared with extensive documentation, including various usage examples and API reference; it can be installed using Conda or from source.
Collapse
Affiliation(s)
- Hua-Ting Yao
- LIX, CNRS UMR 7161, Ecole Polytechnique, Institut Polytechnique de Paris, Palaiseau, France.
- Department of Theoretical Chemistry, University of Vienna, Vienna, Austria.
- School of Computer Science, McGill University, Montreal, Canada.
| | - Bertrand Marchand
- LIX, CNRS UMR 7161, Ecole Polytechnique, Institut Polytechnique de Paris, Palaiseau, France
| | - Sarah J Berkemer
- LIX, CNRS UMR 7161, Ecole Polytechnique, Institut Polytechnique de Paris, Palaiseau, France
- Earth-Life Science Institute, Tokyo Institute of Technology, Tokyo, Japan
| | - Yann Ponty
- LIX, CNRS UMR 7161, Ecole Polytechnique, Institut Polytechnique de Paris, Palaiseau, France
| | - Sebastian Will
- LIX, CNRS UMR 7161, Ecole Polytechnique, Institut Polytechnique de Paris, Palaiseau, France.
| |
Collapse
|
2
|
Huang J, Voß B. Simulation of Folding Kinetics for Aligned RNAs. Genes (Basel) 2021; 12:genes12030347. [PMID: 33652983 PMCID: PMC7996734 DOI: 10.3390/genes12030347] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2021] [Revised: 02/18/2021] [Accepted: 02/22/2021] [Indexed: 11/16/2022] Open
Abstract
Studying the folding kinetics of an RNA can provide insight into its function and is thus a valuable method for RNA analyses. Computational approaches to the simulation of folding kinetics suffer from the exponentially large folding space that needs to be evaluated. Here, we present a new approach that combines structure abstraction with evolutionary conservation to restrict the analysis to common parts of folding spaces of related RNAs. The resulting algorithm can recapitulate the folding kinetics known for single RNAs and is able to analyse even long RNAs in reasonable time. Our program RNAliHiKinetics is the first algorithm for the simulation of consensus folding kinetics and addresses a long-standing problem in a new and unique way.
Collapse
Affiliation(s)
- Jiabin Huang
- Institute of Medical Microbiology, Virology and Hygiene, University Medical Center Hamburg-Eppendorf, Martinistraße 52, 20246 Hamburg, Germany;
| | - Björn Voß
- Computational Biology Group, Institute of Biochemical Engineering, University of Stuttgart, Allmandring 31, 70569 Stuttgart, Germany
- Correspondence:
| |
Collapse
|
3
|
AbouEisha H, Amin T, Chikalov I, Hussain S, Moshkov M. Introduction. EXTENSIONS OF DYNAMIC PROGRAMMING FOR COMBINATORIAL OPTIMIZATION AND DATA MINING 2019:1-12. [DOI: 10.1007/978-3-319-91839-6_1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
|
4
|
|
5
|
Abstract
RNA family models describe classes of functionally related, non-coding RNAs based on sequence and structure conservation. The most important method for modeling RNA families is the use of covariance models, which are stochastic models that serve in the discovery of yet unknown, homologous RNAs. However, the performance of covariance models in finding remote homologs is poor for RNA families with high sequence conservation, while for families with high structure but low sequence conservation, these models are difficult to built in the first place. A complementary approach to RNA family modeling involves the use of thermodynamic matchers. Thermodynamic matchers are RNA folding programs, based on the established thermodynamic model, but tailored to a specific structural motif. As thermodynamic matchers focus on structure and folding energy, they unfold their potential in discovering homologs, when high structure conservation is paired with low sequence conservation. In contrast to covariance models, construction of thermodynamic matchers does not require an input alignment, but requires human design decisions and experimentation, and hence, model construction is more laborious. Here we report a case study on an RNA family that was constructed by means of thermodynamic matchers. It starts from a set of known but structurally different members of the same RNA family. The consensus secondary structure of this family consists of 2 to 4 adjacent hairpins. Each hairpin loop carries the same motif, CCUCCUCCC, while the stems show high variability in their nucleotide content. The present study describes (1) a novel approach for the integration of the structurally varying family into a single RNA family model by means of the thermodynamic matcher methodology, and (2) provides the results of homology searches that were conducted with this model in a wide spectrum of bacterial species.
Collapse
Key Words
- CIN, conserved intergenic neighborhood
- CM, covariance model
- HMM, hidden Markov model
- MFE, minimum free energy
- OG, orthologous group of genes
- RBS, ribosome binding site
- RFM, RNA family model
- TDM, thermodynamic matcher
- aSD, anti Shine-Dalgarno
- alphaproteobacteria
- cuckoo RNA
- dRNA-seq, differential RNA sequencing
- family model
- homology search
- sRNA, small non-coding RNA
- small RNA
- structural RNA
- thermodynamic matcher
Collapse
Affiliation(s)
- Jan Reinkensmeier
- a Universität Bielefeld ; Technische Fakultät and Center of Biotechnology ; Bielefeld , Germany
| | | |
Collapse
|
6
|
Saule C, Giegerich R. Pareto optimization in algebraic dynamic programming. Algorithms Mol Biol 2015; 10:22. [PMID: 26150892 PMCID: PMC4491898 DOI: 10.1186/s13015-015-0051-7] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2014] [Accepted: 05/07/2015] [Indexed: 11/10/2022] Open
Abstract
Pareto optimization combines independent objectives by computing the Pareto front of its search space, defined as the set of all solutions for which no other candidate solution scores better under all objectives. This gives, in a precise sense, better information than an artificial amalgamation of different scores into a single objective, but is more costly to compute. Pareto optimization naturally occurs with genetic algorithms, albeit in a heuristic fashion. Non-heuristic Pareto optimization so far has been used only with a few applications in bioinformatics. We study exact Pareto optimization for two objectives in a dynamic programming framework. We define a binary Pareto product operator [Formula: see text] on arbitrary scoring schemes. Independent of a particular algorithm, we prove that for two scoring schemes A and B used in dynamic programming, the scoring scheme [Formula: see text] correctly performs Pareto optimization over the same search space. We study different implementations of the Pareto operator with respect to their asymptotic and empirical efficiency. Without artificial amalgamation of objectives, and with no heuristics involved, Pareto optimization is faster than computing the same number of answers separately for each objective. For RNA structure prediction under the minimum free energy versus the maximum expected accuracy model, we show that the empirical size of the Pareto front remains within reasonable bounds. Pareto optimization lends itself to the comparative investigation of the behavior of two alternative scoring schemes for the same purpose. For the above scoring schemes, we observe that the Pareto front can be seen as a composition of a few macrostates, each consisting of several microstates that differ in the same limited way. We also study the relationship between abstract shape analysis and the Pareto front, and find that they extract information of a different nature from the folding space and can be meaningfully combined.
Collapse
|
7
|
Janssen S, Giegerich R. Ambivalent covariance models. BMC Bioinformatics 2015; 16:178. [PMID: 26017195 PMCID: PMC4504443 DOI: 10.1186/s12859-015-0569-1] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2014] [Accepted: 04/10/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Evolutionary variations let us define a set of similar nucleic acid sequences as a family if these different molecules execute a common function. Capturing their sequence variation by using e. g. position specific scoring matrices significantly improves sensitivity of detection tools. Members of a functional (non-coding) RNA family are affected by these variations not only on the sequence, but also on the structural level. For example, some transfer-RNAs exhibit a fifth helix in addition to the typical cloverleaf structure. Current covariance models - the unrivaled homology search approach for structured RNA - do not benefit from structural variation within a family, but rather penalize it. This leads to artificial subdivision of families and loss of information in the RFAM database. RESULTS We propose an extension to the fundamental architecture of covariance models to allow for several, compatible consensus structures. The resulting models are called ambivalent covariance models. Evaluation on several RFAM families shows that coalescence of structural variation within a family by using ambivalent consensus models is superior to subdividing the family into multiple classical covariance models. CONCLUSION A prototype and source code is available at http://bibiserv.cebitec.uni-bielefeld.de/acms.
Collapse
Affiliation(s)
- Stefan Janssen
- Practical Computer Science, Faculty of Technology, Bielefeld University, Universitätsstraße 25, Bielefeld, 33615, Germany.
| | - Robert Giegerich
- Practical Computer Science, Faculty of Technology, Bielefeld University, Universitätsstraße 25, Bielefeld, 33615, Germany.
| |
Collapse
|
8
|
Abstract
MOTIVATION Abstract shape analysis, first proposed in 2004, allows one to extract several relevant structures from the folding space of an RNA sequence, preferable to focusing in a single structure of minimal free energy. We report recent extensions to this approach. RESULTS We have rebuilt the original RNAshapes as a repository of components that allows us to integrate several established tools for RNA structure analysis: RNAshapes, RNAalishapes and pknotsRG, including its recent extension pKiss. As a spin-off, we obtain heretofore unavailable functionality: e. g. with pKiss, we can now perform abstract shape analysis for structures holding pseudoknots up to the complexity of kissing hairpin motifs. The new tool pAliKiss can predict kissing hairpin motifs from aligned sequences. Along with the integration, the functionality of the tools was also extended in manifold ways. AVAILABILITY AND IMPLEMENTATION As before, the tool is available on the Bielefeld Bioinformatics server at http://bibiserv.cebitec.uni-bielefeld.de/rnashapesstudio. CONTACT bibi-help@cebitec.uni-bielefeld.de.
Collapse
Affiliation(s)
- Stefan Janssen
- Practical Computer Science, Faculty of Technology, Bielefeld University, D-33615 Bielefeld, Germany
| | - Robert Giegerich
- Practical Computer Science, Faculty of Technology, Bielefeld University, D-33615 Bielefeld, Germany
| |
Collapse
|
9
|
Becker A, Overlöper A, Schlüter JP, Reinkensmeier J, Robledo M, Giegerich R, Narberhaus F, Evguenieva-Hackenberg E. Riboregulation in plant-associated α-proteobacteria. RNA Biol 2014; 11:550-62. [PMID: 25003187 DOI: 10.4161/rna.29625] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023] Open
Abstract
The symbiotic α-rhizobia Sinorhizobium meliloti, Bradyrhizobium japonicum, Rhizobium etli and the related plant pathogen Agrobacterium tumefaciens are important model organisms for studying plant-microbe interactions. These metabolically versatile soil bacteria are characterized by complex lifestyles and large genomes. Here we summarize the recent knowledge on their small non-coding RNAs (sRNAs) including conservation, function, and interaction of the sRNAs with the RNA chaperone Hfq. In each of these organisms, an inventory of hundreds of cis- and trans-encoded sRNAs with regulatory potential was uncovered by high-throughput approaches and used for the construction of 39 sRNA family models. Genome-wide analyses of hfq mutants and co-immunoprecipitation with tagged Hfq revealed a major impact of the RNA chaperone on the physiology of plant-associated α-proteobacteria including symbiosis and virulence. Highly conserved members of the SmelC411 family are the AbcR sRNAs, which predominantly regulate ABC transport systems. AbcR1 of A. tumefaciens controls the uptake of the plant-generated signaling molecule GABA and is a central regulator of nutrient uptake systems. It has similar functions in S. meliloti and the human pathogen Brucella abortus. As RNA degradation is an important process in RNA-based gene regulation, a short overview on ribonucleases in plant-associated α-proteobacteria concludes this review.
Collapse
Affiliation(s)
- Anke Becker
- LOEWE Centre for Synthetic Microbiology and Faculty of Biology; Philipps-Universität Marburg; Marburg, Germany
| | | | - Jan-Philip Schlüter
- LOEWE Centre for Synthetic Microbiology and Faculty of Biology; Philipps-Universität Marburg; Marburg, Germany
| | - Jan Reinkensmeier
- Center for Biotechnology (CeBiTec); Bielefeld University; Bielefeld, Germany
| | - Marta Robledo
- LOEWE Centre for Synthetic Microbiology and Faculty of Biology; Philipps-Universität Marburg; Marburg, Germany
| | - Robert Giegerich
- Center for Biotechnology (CeBiTec); Bielefeld University; Bielefeld, Germany
| | | | | |
Collapse
|
10
|
Modeling Dynamic Programming Problems over Sequences and Trees with Inverse Coupled Rewrite Systems. ALGORITHMS 2014. [DOI: 10.3390/a7010062] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
11
|
Huang J, Voß B. Analysing RNA-kinetics based on folding space abstraction. BMC Bioinformatics 2014; 15:60. [PMID: 24575751 PMCID: PMC3974018 DOI: 10.1186/1471-2105-15-60] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2013] [Accepted: 02/24/2014] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND RNA molecules, especially non-coding RNAs, play vital roles in the cell and their biological functions are mostly determined by structural properties. Often, these properties are related to dynamic changes in the structure, as in the case of riboswitches, and thus the analysis of RNA folding kinetics is crucial for their study. Exact approaches to kinetic folding are computationally expensive and, thus, limited to short sequences. In a previous study, we introduced a position-specific abstraction based on helices which we termed helix index shapes (hishapes) and a hishape-based algorithm for near-optimal folding pathway computation, called HiPath. The combination of these approaches provides an abstract view of the folding space that offers information about the global features. RESULTS In this paper we present HiKinetics, an algorithm that can predict RNA folding kinetics for sequences up to several hundred nucleotides long. This algorithm is based on RNAHeliCes, which decomposes the folding space into abstract classes, namely hishapes, and an improved version of HiPath, namely HiPath2, which estimates plausible folding pathways that connect these classes. Furthermore, we analyse the relationship of hishapes to locally optimal structures, the results of which strengthen the use of the hishape abstraction for studying folding kinetics. Finally, we show the application of HiKinetics to the folding kinetics of two well-studied RNAs. CONCLUSIONS HiKinetics can calculate kinetic folding based on a novel hishape decomposition. HiKinetics, together with HiPath2 and RNAHeliCes, is available for download at http://www.cyanolab.de/software/RNAHeliCes.htm.
Collapse
Affiliation(s)
- Jiabin Huang
- Genetics & Experimental Bioinformatics, Faculty of Biology, University of Freiburg, Schänzlestr. 1, 79104, Freiburg, Germany
| | - Björn Voß
- Genetics & Experimental Bioinformatics, Faculty of Biology, University of Freiburg, Schänzlestr. 1, 79104, Freiburg, Germany
| |
Collapse
|
12
|
Rivas E. The four ingredients of single-sequence RNA secondary structure prediction. A unifying perspective. RNA Biol 2013; 10:1185-96. [PMID: 23695796 PMCID: PMC3849167 DOI: 10.4161/rna.24971] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2013] [Revised: 05/06/2013] [Accepted: 05/08/2013] [Indexed: 12/31/2022] Open
Abstract
Any method for RNA secondary structure prediction is determined by four ingredients. The architecture is the choice of features implemented by the model (such as stacked basepairs, loop length distributions, etc.). The architecture determines the number of parameters in the model. The scoring scheme is the nature of those parameters (whether thermodynamic, probabilistic, or weights). The parameterization stands for the specific values assigned to the parameters. These three ingredients are referred to as "the model." The fourth ingredient is the folding algorithms used to predict plausible secondary structures given the model and the sequence of a structural RNA. Here, I make several unifying observations drawn from looking at more than 40 years of methods for RNA secondary structure prediction in the light of this classification. As a final observation, there seems to be a performance ceiling that affects all methods with complex architectures, a ceiling that impacts all scoring schemes with remarkable similarity. This suggests that modeling RNA secondary structure by using intrinsic sequence-based plausible "foldability" will require the incorporation of other forms of information in order to constrain the folding space and to improve prediction accuracy. This could give an advantage to probabilistic scoring systems since a probabilistic framework is a natural platform to incorporate different sources of information into one single inference problem.
Collapse
Affiliation(s)
- Elena Rivas
- Janelia Farm Research Campus; Howard Hughes Medical Institute; Ashburn, VA USA
| |
Collapse
|