1
|
Pietrosanto M, Mattei E, Helmer-Citterich M, Ferrè F. A novel method for the identification of conserved structural patterns in RNA: From small scale to high-throughput applications. Nucleic Acids Res 2016; 44:8600-8609. [PMID: 27580722 PMCID: PMC5062999 DOI: 10.1093/nar/gkw750] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2016] [Accepted: 08/17/2016] [Indexed: 12/21/2022] Open
Abstract
Functional RNA regions are often related to recurrent secondary structure patterns (or motifs), which can exert their role in several different ways, particularly in dictating the interaction with RNA-binding proteins, and acting in the regulation of a large number of cellular processes. Among the available motif-finding tools, the majority focuses on sequence patterns, sometimes including secondary structure as additional constraints to improve their performance. Nonetheless, secondary structures motifs may be concurrent to their sequence counterparts or even encode a stronger functional signal. Current methods for searching structural motifs generally require long pipelines and/or high computational efforts or previously aligned sequences. Here, we present BEAM (BEAr Motif finder), a novel method for structural motif discovery from a set of unaligned RNAs, taking advantage of a recently developed encoding for RNA secondary structure named BEAR (Brand nEw Alphabet for RNAs) and of evolutionary substitution rates of secondary structure elements. Tested in a varied set of scenarios, from small- to large-scale, BEAM is successful in retrieving structural motifs even in highly noisy data sets, such as those that can arise in CLIP-Seq or other high-throughput experiments.
Collapse
Affiliation(s)
- Marco Pietrosanto
- Centre for Molecular Bioinformatics, Department of Biology, University of Rome Tor Vergata, Via della Ricerca Scientifica snc, 00133 Rome, Italy
| | - Eugenio Mattei
- Centre for Molecular Bioinformatics, Department of Biology, University of Rome Tor Vergata, Via della Ricerca Scientifica snc, 00133 Rome, Italy
| | - Manuela Helmer-Citterich
- Centre for Molecular Bioinformatics, Department of Biology, University of Rome Tor Vergata, Via della Ricerca Scientifica snc, 00133 Rome, Italy
| | - Fabrizio Ferrè
- Department of Pharmacy and Biotechnology (FaBiT), University of Bologna Alma Mater, Via Belmeloro 8/2, 40126 Bologna, Italy
| |
Collapse
|
2
|
Gawronski AR, Turcotte M. RiboFSM: frequent subgraph mining for the discovery of RNA structures and interactions. BMC Bioinformatics 2014; 15 Suppl 13:S2. [PMID: 25434643 PMCID: PMC4248650 DOI: 10.1186/1471-2105-15-s13-s2] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Frequent subgraph mining is a useful method for extracting meaningful patterns from a set of graphs or a single large graph. Here, the graph represents all possible RNA structures and interactions. Patterns that are significantly more frequent in this graph over a random graph are extracted. We hypothesize that these patterns are most likely to represent biological mechanisms. The graph representation used is a directed dual graph, extended to handle intermolecular interactions. The graph is sampled for subgraphs, which are labeled using a canonical labeling method and counted. The resulting patterns are compared to those created from a randomized dataset and scored. The algorithm was applied to the mitochondrial genome of the kinetoplastid species Trypanosoma brucei, which has a unique RNA editing mechanism. The most significant patterns contain two stem-loops, indicative of gRNA, and represent interactions of these structures with target mRNA.
Collapse
|
3
|
Badr G, Al-Turaiki I, Turcotte M, Mathkour H. IncMD: incremental trie-based structural motif discovery algorithm. J Bioinform Comput Biol 2014; 12:1450027. [PMID: 25362841 DOI: 10.1142/s0219720014500279] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The discovery of common RNA secondary structure motifs is an important problem in bioinformatics. The presence of such motifs is usually associated with key biological functions. However, the identification of structural motifs is far from easy. Unlike motifs in sequences, which have conserved bases, structural motifs have common structure arrangements even if the underlying sequences are different. Over the past few years, hundreds of algorithms have been published for the discovery of sequential motifs, while less work has been done for the structural motifs case. Current structural motif discovery algorithms are limited in terms of accuracy and scalability. In this paper, we present an incremental and scalable algorithm for discovering RNA secondary structure motifs, namely IncMD. We consider the structural motif discovery as a frequent pattern mining problem and tackle it using a modified a priori algorithm. IncMD uses data structures, trie-based linked lists of prefixes (LLP), to accelerate the search and retrieval of patterns, support counting, and candidate generation. We modify the candidate generation step in order to adapt it to the RNA secondary structure representation. IncMD constructs the frequent patterns incrementally from RNA secondary structure basic elements, using nesting and joining operations. The notion of a motif group is introduced in order to simulate an alignment of motifs that only differ in the number of unpaired bases. In addition, we use a cluster beam approach to select motifs that will survive to the next iterations of the search. Results indicate that IncMD can perform better than some of the available structural motif discovery algorithms in terms of sensitivity (Sn), positive predictive value (PPV), and specificity (Sp). The empirical results also show that the algorithm is scalable and runs faster than all of the compared algorithms.
Collapse
Affiliation(s)
- Ghada Badr
- College of Computer and Information Sciences, King Saud University, Riyadh, Kingdom of Saudi Arabia , IRI - The City of Scientific Research and Technological Applications, University and Research District, P. O. 21934, New Borg Alarab, Alexandria, Egypt
| | | | | | | |
Collapse
|
4
|
Badr G, Al-Turaiki I, Mathkour H. Classification and assessment tools for structural motif discovery algorithms. BMC Bioinformatics 2013; 14 Suppl 9:S4. [PMID: 23902564 PMCID: PMC3698030 DOI: 10.1186/1471-2105-14-s9-s4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND Motif discovery is the problem of finding recurring patterns in biological data. Patterns can be sequential, mainly when discovered in DNA sequences. They can also be structural (e.g. when discovering RNA motifs). Finding common structural patterns helps to gain a better understanding of the mechanism of action (e.g. post-transcriptional regulation). Unlike DNA motifs, which are sequentially conserved, RNA motifs exhibit conservation in structure, which may be common even if the sequences are different. Over the past few years, hundreds of algorithms have been developed to solve the sequential motif discovery problem, while less work has been done for the structural case. METHODS In this paper, we survey, classify, and compare different algorithms that solve the structural motif discovery problem, where the underlying sequences may be different. We highlight their strengths and weaknesses. We start by proposing a benchmark dataset and a measurement tool that can be used to evaluate different motif discovery approaches. Then, we proceed by proposing our experimental setup. Finally, results are obtained using the proposed benchmark to compare available tools. To the best of our knowledge, this is the first attempt to compare tools solely designed for structural motif discovery. RESULTS Results show that the accuracy of discovered motifs is relatively low. The results also suggest a complementary behavior among tools where some tools perform well on simple structures, while other tools are better for complex structures. CONCLUSIONS We have classified and evaluated the performance of available structural motif discovery tools. In addition, we have proposed a benchmark dataset with tools that can be used to evaluate newly developed tools.
Collapse
Affiliation(s)
- Ghada Badr
- King Saud University, College of Computer and Information Sciences, Riyadh, Kingdom of Saudi Arabia
- IRI - The City of Scientific Research and Technological Applications, University and Research District, P.O. 21934 New Borg Alarab, Alexandria, Egypt
| | - Isra Al-Turaiki
- King Saud University, College of Computer and Information Sciences, Riyadh, Kingdom of Saudi Arabia
| | - Hassan Mathkour
- King Saud University, College of Computer and Information Sciences, Riyadh, Kingdom of Saudi Arabia
| |
Collapse
|
5
|
Vangone A, Oliva R, Cavallo L. CONS-COCOMAPS: a novel tool to measure and visualize the conservation of inter-residue contacts in multiple docking solutions. BMC Bioinformatics 2012; 13 Suppl 4:S19. [PMID: 22536965 PMCID: PMC3434444 DOI: 10.1186/1471-2105-13-s4-s19] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
Background The development of accurate protein-protein docking programs is making this kind of simulations an effective tool to predict the 3D structure and the surface of interaction between the molecular partners in macromolecular complexes. However, correctly scoring multiple docking solutions is still an open problem. As a consequence, the accurate and tedious screening of many docking models is usually required in the analysis step. Methods All the programs under CONS-COCOMAPS have been written in python, taking advantage of python libraries such as SciPy and Matplotlib. CONS-COCOMAPS is freely available as a web tool at the URL: http://www.molnac.unisa.it/BioTools/conscocomaps/. Results Here we presented CONS-COCOMAPS, a novel tool to easily measure and visualize the consensus in multiple docking solutions. CONS-COCOMAPS uses the conservation of inter-residue contacts as an estimate of the similarity between different docking solutions. To visualize the conservation, CONS-COCOMAPS uses intermolecular contact maps. Conclusions The application of CONS-COCOMAPS to test-cases taken from recent CAPRI rounds has shown that it is very efficient in highlighting even a very weak consensus that often is biologically meaningful.
Collapse
Affiliation(s)
- Anna Vangone
- Department of Applied Sciences, University Parthenope of Naples, Centro Direzionale Isola C4, Naples, 80143, Italy
| | | | | |
Collapse
|
6
|
Sahraeian SME, Yoon BJ. PicXAA-R: efficient structural alignment of multiple RNA sequences using a greedy approach. BMC Bioinformatics 2011; 12 Suppl 1:S38. [PMID: 21342569 PMCID: PMC3044294 DOI: 10.1186/1471-2105-12-s1-s38] [Citation(s) in RCA: 61] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Abstract
Background Accurate and efficient structural alignment of non-coding RNAs (ncRNAs) has grasped more and more attentions as recent studies unveiled the significance of ncRNAs in living organisms. While the Sankoff style structural alignment algorithms cannot efficiently serve for multiple sequences, mostly progressive schemes are used to reduce the complexity. However, this idea tends to propagate the early stage errors throughout the entire process, thereby degrading the quality of the final alignment. For multiple protein sequence alignment, we have recently proposed PicXAA which constructs an accurate alignment in a non-progressive fashion. Results Here, we propose PicXAA-R as an extension to PicXAA for greedy structural alignment of ncRNAs. PicXAA-R efficiently grasps both folding information within each sequence and local similarities between sequences. It uses a set of probabilistic consistency transformations to improve the posterior base-pairing and base alignment probabilities using the information of all sequences in the alignment. Using a graph-based scheme, we greedily build up the structural alignment from sequence regions with high base-pairing and base alignment probabilities. Conclusions Several experiments on datasets with different characteristics confirm that PicXAA-R is one of the fastest algorithms for structural alignment of multiple RNAs and it consistently yields accurate alignment results, especially for datasets with locally similar sequences. PicXAA-R source code is freely available at: http://www.ece.tamu.edu/~bjyoon/picxaa/.
Collapse
|
7
|
Abstract
Like protein coding sequences, functional motifs in RNA elements are frequently conserved, but this conservation is most often at the structure level rather than sequence based. Proper characterization of these structural RNA motifs is both the key and the limiting step to understanding the nature of RNA-protein interactions. The discovery of elements targeted by RNA-binding proteins and how they function remains one of the most active, yet elusive areas of RNA biology. Only a limited number of these elements have been well characterized with many of the fundamental rules yet to be discovered. Here we present a comprehensive list of web based resources that can be used in the study and identification of RNA-based structural and regulatory motifs and provide a survey of the informatic resources that can have been developed to facilitate this research.
Collapse
Affiliation(s)
- Ajish D George
- Department of Biomedical Sciences, School of Public Health, Gen∗NY∗Sis Center for Excellence in Cancer Genomics, University at Albany-SUNY, Rensselaer, NY, USA.
| | | |
Collapse
|
8
|
Hamada M, Sato K, Kiryu H, Mituyama T, Asai K. CentroidAlign: fast and accurate aligner for structured RNAs by maximizing expected sum-of-pairs score. Bioinformatics 2009; 25:3236-43. [DOI: 10.1093/bioinformatics/btp580] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
|
9
|
Taufer M, Leung MY, Solorio T, Licon A, Mireles D, Araiza R, Johnson KL. RNAVLab: A virtual laboratory for studying RNA secondary structures based on grid computing technology. PARALLEL COMPUTING 2008; 34:661-680. [PMID: 19885376 PMCID: PMC2714649 DOI: 10.1016/j.parco.2008.08.002] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/05/2007] [Revised: 06/06/2008] [Accepted: 08/21/2008] [Indexed: 05/28/2023]
Abstract
As ribonucleic acid (RNA) molecules play important roles in many biological processes including gene expression and regulation, their secondary structures have been the focus of many recent studies. Despite the computing power of supercomputers, computationally predicting secondary structures with thermodynamic methods is still not feasible when the RNA molecules have long nucleotide sequences and include complex motifs such as pseudoknots. This paper presents RNAVLab (RNA Virtual Laboratory), a virtual laboratory for studying RNA secondary structures including pseudoknots that allows scientists to address this challenge. Two important case studies show the versatility and functionalities of RNAVLab. The first study quantifies its capability to rebuild longer secondary structures from motifs found in systematically sampled nucleotide segments. The extensive sampling and predictions are made feasible in a short turnaround time because of the grid technology used. The second study shows how RNAVLab allows scientists to study the viral RNA genome replication mechanisms used by members of the virus family Nodaviridae.
Collapse
Affiliation(s)
- Michela Taufer
- Department of Computer and Information Sciences, University of Delaware, Newark, DE 19716, United States
| | - Ming-Ying Leung
- Department of Mathematical Sciences, The University of Texas at El Paso, El Paso, TX 79968, United States
- Bioinformatics Program, The University of Texas at El Paso, El Paso, TX 79968, United States
- Border Biomedical Research Center, The University of Texas at El Paso, El Paso, TX 79968, United States
| | - Thamar Solorio
- Department of Computer Science, The University of Texas at Dallas, Richardson, TX 75080, United States
| | - Abel Licon
- Department of Computer and Information Sciences, University of Delaware, Newark, DE 19716, United States
| | - David Mireles
- Department of Computer Science, The University of Texas at El Paso, El Paso, TX 79968, United States
| | - Roberto Araiza
- Department of Computer Science, The University of Texas at El Paso, El Paso, TX 79968, United States
| | - Kyle L. Johnson
- Border Biomedical Research Center, The University of Texas at El Paso, El Paso, TX 79968, United States
- Department of Biological Sciences, The University of Texas at El Paso, El Paso, TX 79968, United States
| |
Collapse
|
10
|
Informatic resources for identifying and annotating structural RNA motifs. Mol Biotechnol 2008; 41:180-93. [PMID: 18979204 DOI: 10.1007/s12033-008-9114-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2008] [Accepted: 10/01/2008] [Indexed: 10/21/2022]
Abstract
Post-transcriptional regulation of genes and transcripts is a vital aspect of cellular processes, and unlike transcriptional regulation, remains a largely unexplored domain. One of the most obvious and most important questions to explore is the discovery of functional RNA elements. Many RNA elements have been characterized to date ranging from cis-regulatory motifs within mRNAs to large families of non-coding RNAs. Like protein coding genes, the functional motifs of these RNA elements are highly conserved, but unlike protein coding genes, it is most often the structure and not the sequence that is conserved. Proper characterization of these structural RNA motifs is both the key and the limiting step to understanding the post-transcriptional aspects of the genomic world. Here, we focus on the task of structural motif discovery and provide a survey of the informatics resources geared towards this task.
Collapse
|
11
|
Seemann SE, Gorodkin J, Backofen R. Unifying evolutionary and thermodynamic information for RNA folding of multiple alignments. Nucleic Acids Res 2008; 36:6355-62. [PMID: 18836192 PMCID: PMC2582601 DOI: 10.1093/nar/gkn544] [Citation(s) in RCA: 65] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Computational methods for determining the secondary structure of RNA sequences from given alignments are currently either based on thermodynamic folding, compensatory base pair substitutions or both. However, there is currently no approach that combines both sources of information in a single optimization problem. Here, we present a model that formally integrates both the energy-based and evolution-based approaches to predict the folding of multiple aligned RNA sequences. We have implemented an extended version of Pfold that identifies base pairs that have high probabilities of being conserved and of being energetically favorable. The consensus structure is predicted using a maximum expected accuracy scoring scheme to smoothen the effect of incorrectly predicted base pairs. Parameter tuning revealed that the probability of base pairing has a higher impact on the RNA structure prediction than the corresponding probability of being single stranded. Furthermore, we found that structurally conserved RNA motifs are mostly supported by folding energies. Other problems (e.g. RNA-folding kinetics) may also benefit from employing the principles of the model we introduce. Our implementation, PETfold, was tested on a set of 46 well-curated Rfam families and its performance compared favorably to that of Pfold and RNAalifold.
Collapse
Affiliation(s)
- Stefan E Seemann
- Division of Genetics and Bioinformatics, IBHV and Center for Applied Bioinformatics, University of Copenhagen, Groennegårdsvej 3, DK-1870 Frederiksberg C, Denmark
| | | | | |
Collapse
|
12
|
Can Clustal-style progressive pairwise alignment of multiple sequences be used in RNA secondary structure prediction? BMC Bioinformatics 2007; 8:190. [PMID: 17559658 PMCID: PMC1904245 DOI: 10.1186/1471-2105-8-190] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2006] [Accepted: 06/08/2007] [Indexed: 11/15/2022] Open
Abstract
Background In ribonucleic acid (RNA) molecules whose function depends on their final, folded three-dimensional shape (such as those in ribosomes or spliceosome complexes), the secondary structure, defined by the set of internal basepair interactions, is more consistently conserved than the primary structure, defined by the sequence of nucleotides. Results The research presented here investigates the possibility of applying a progressive, pairwise approach to the alignment of multiple RNA sequences by simultaneously predicting an energy-optimized consensus secondary structure. We take an existing algorithm for finding the secondary structure common to two RNA sequences, Dynalign, and alter it to align profiles of multiple sequences. We then explore the relative successes of different approaches to designing the tree that will guide progressive alignments of sequence profiles to create a multiple alignment and prediction of conserved structure. Conclusion We have found that applying a progressive, pairwise approach to the alignment of multiple ribonucleic acid sequences produces highly reliable predictions of conserved basepairs, and we have shown how these predictions can be used as constraints to improve the results of a single-sequence structure prediction algorithm. However, we have also discovered that the amount of detail included in a consensus structure prediction is highly dependent on the order in which sequences are added to the alignment (the guide tree), and that if a consensus structure does not have sufficient detail, it is less likely to provide useful constraints for the single-sequence method.
Collapse
|
13
|
Jossinet F, Ludwig TE, Westhof E. RNA structure: bioinformatic analysis. Curr Opin Microbiol 2007; 10:279-85. [PMID: 17548241 DOI: 10.1016/j.mib.2007.05.010] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2007] [Accepted: 05/23/2007] [Indexed: 01/30/2023]
Abstract
The range of functions ascribed to RNA molecules has grown considerably during recent years. Consequently, the analysis and comparison of RNA sequences have become recurrent tasks in molecular biology. Because the biological function of an RNA is expressed more by its folded architecture than by its sequence, original computational tools adapted to the multifaceted RNA functions have to be developed. Such tools, recently published, enable a user to solve classical problems related to RNA research: constructing 'structural' multiple alignments, inferring complete structures and structural motifs from RNA alignments, or searching structural homology in genomic databases.
Collapse
Affiliation(s)
- Fabrice Jossinet
- Architecture et Réactivité de l'ARN, Université Louis Pasteur, Institut de Biologie Moléculaire et Cellulaire, CNRS, F-67084 Strasbourg, France
| | | | | |
Collapse
|
14
|
Putonti C, Pettitt B, Reid J, Fofanov Y. PIDA:A new algorithm for pattern identification. ONLINE JOURNAL OF BIOINFORMATICS : OJB 2007; 8:30-40. [PMID: 19834570 PMCID: PMC2761635] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
Algorithms for motif identification in sequence space have predominately been focused on recognizing patterns of a fixed length containing regions of perfect conservation with possible regions of unconstrained sequence. Such motifs can be found in everything from proteins with distinct active sites to non-coding RNAs with specific structural elements that are necessary to maintain functionality. In the event that an insertion/deletion has occurred within an unconstrained portion of the pattern, it is possible that the pattern retains its functionality. In such a case the length of the pattern is now variable and may be overlooked when utilizing existing motif detection methods. The Pattern Island Detection Algorithm (PIDA) presented here has been developed to recognize patterns that have occurrences of varying length within sequences of any size alphabet. PIDA works by identifying all regions of perfect conservation (for lengths longer than a user-specified threshold), and then builds those conservation "islands" into fixed-length patterns. Next the algorithm modifies these fixed-length patterns by identifying additional (and different) islands that can be incorporated into each pattern through insertions/deletions within the "water" separating the islands. To provide some benchmarks for this analysis, PIDA was used to search for patterns within randomly generated sequences as well as sequences known to contain conserved patterns. For each of the patterns found, the statistical significance is calculated based upon the pattern's likelihood to appear by chance, thus providing a means to determine those patterns which are likely to have a functional role. The PIDA approach to motif finding is designed to perform best when searching for patterns of variable length although it is also able to identify patterns of a fixed length. PIDA has been created to be as generally applicable as possible since there are a variety of sequence problems of this type. The algorithm was implemented in C++ and is freely available upon request from the authors.
Collapse
Affiliation(s)
- C Putonti
- Department of Computer Science, University of Houston, Houston, Texas, USA
| | | | | | | |
Collapse
|