1
|
Malik A, Zhang L, Gautam M, Dai N, Li S, Zhang H, Mathews DH, Huang L. LinearAlifold: Linear-Time Consensus Structure Prediction for RNA Alignments. J Mol Biol 2024:168694. [PMID: 38971557 DOI: 10.1016/j.jmb.2024.168694] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Revised: 06/28/2024] [Accepted: 07/01/2024] [Indexed: 07/08/2024]
Abstract
Predicting the consensus structure of a set of aligned RNA homologs is a convenient method to find conserved structures in an RNA genome, which has many applications including viral diagnostics and therapeutics. However, the most commonly used tool for this task, RNAalifold, is prohibitively slow for long sequences, due to a cubic scaling with the sequence length, taking over a day on 400 SARS-CoV-2 and SARS-related genomes (∼30,000nt). We present LinearAlifold, a much faster alternative that scales linearly with both the sequence length and the number of sequences, based on our work LinearFold that folds a single RNA in linear time. Our work is orders of magnitude faster than RNAalifold (0.7 hours on the above 400 genomes, or ∼36× speedup) and achieves higher accuracies when compared to a database of known structures. More interestingly, LinearAlifold's prediction on SARS-CoV-2 correlates well with experimentally determined structures, substantially outperforming RNAalifold. Finally, LinearAlifold supports two energy models (Vienna and BL*) and four modes: minimum free energy (MFE), maximum expected accuracy (MEA), ThreshKnot, and stochastic sampling, each of which takes under an hour for hundreds of SARS-CoV variants. Our resource is at: https://github.com/LinearFold/LinearAlifold (code) and http://linearfold.org/linear-alifold (server).
Collapse
Affiliation(s)
- Apoorv Malik
- School of EECS , Oregon State University, Corvallis, OR 97330, USA
| | - Liang Zhang
- School of EECS , Oregon State University, Corvallis, OR 97330, USA
| | - Milan Gautam
- School of EECS , Oregon State University, Corvallis, OR 97330, USA
| | - Ning Dai
- School of EECS , Oregon State University, Corvallis, OR 97330, USA
| | - Sizhen Li
- School of EECS , Oregon State University, Corvallis, OR 97330, USA
| | - He Zhang
- School of EECS , Oregon State University, Corvallis, OR 97330, USA
| | - David H Mathews
- Dept. of Biochemistry & Biophysics, University of Rochester Medical Center, Rochester, NY 14642, USA; Center for RNA Biology, University of Rochester Medical Center, Rochester, NY 14642, USA; Dept. of Biostatistics & Computational Biology, University of Rochester Medical Center, Rochester, NY 14642, USA
| | - Liang Huang
- School of EECS , Oregon State University, Corvallis, OR 97330, USA; Dept. of Biochemistry & Biophysics, Oregon State University, Corvallis, OR 97330, USA.
| |
Collapse
|
2
|
Durrant MG, Perry NT, Pai JJ, Jangid AR, Athukoralage JS, Hiraizumi M, McSpedon JP, Pawluk A, Nishimasu H, Konermann S, Hsu PD. Bridge RNAs direct programmable recombination of target and donor DNA. Nature 2024; 630:984-993. [PMID: 38926615 PMCID: PMC11208160 DOI: 10.1038/s41586-024-07552-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Accepted: 05/09/2024] [Indexed: 06/28/2024]
Abstract
Genomic rearrangements, encompassing mutational changes in the genome such as insertions, deletions or inversions, are essential for genetic diversity. These rearrangements are typically orchestrated by enzymes that are involved in fundamental DNA repair processes, such as homologous recombination, or in the transposition of foreign genetic material by viruses and mobile genetic elements1,2. Here we report that IS110 insertion sequences, a family of minimal and autonomous mobile genetic elements, express a structured non-coding RNA that binds specifically to their encoded recombinase. This bridge RNA contains two internal loops encoding nucleotide stretches that base-pair with the target DNA and the donor DNA, which is the IS110 element itself. We demonstrate that the target-binding and donor-binding loops can be independently reprogrammed to direct sequence-specific recombination between two DNA molecules. This modularity enables the insertion of DNA into genomic target sites, as well as programmable DNA excision and inversion. The IS110 bridge recombination system expands the diversity of nucleic-acid-guided systems beyond CRISPR and RNA interference, offering a unified mechanism for the three fundamental DNA rearrangements-insertion, excision and inversion-that are required for genome design.
Collapse
Affiliation(s)
- Matthew G Durrant
- Arc Institute, Palo Alto, CA, USA
- Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA
| | - Nicholas T Perry
- Arc Institute, Palo Alto, CA, USA
- Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA
- University of California, Berkeley-University of California, San Francisco Graduate Program in Bioengineering, Berkeley, CA, USA
| | | | - Aditya R Jangid
- Arc Institute, Palo Alto, CA, USA
- Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA
| | | | - Masahiro Hiraizumi
- Department of Chemistry and Biotechnology, Graduate School of Engineering, University of Tokyo, Tokyo, Japan
| | | | | | - Hiroshi Nishimasu
- Department of Chemistry and Biotechnology, Graduate School of Engineering, University of Tokyo, Tokyo, Japan
- Structural Biology Division, Research Center for Advanced Science and Technology, University of Tokyo, Tokyo, Japan
- Department of Biological Sciences, Graduate School of Science, University of Tokyo, Tokyo, Japan
- Inamori Research Institute for Science, Kyoto, Japan
- Japan Science and Technology Agency, Core Research for Evolutional Science and Technology, Saitama, Japan
| | - Silvana Konermann
- Arc Institute, Palo Alto, CA, USA
- Department of Biochemistry, Stanford University School of Medicine, Stanford, CA, USA
| | - Patrick D Hsu
- Arc Institute, Palo Alto, CA, USA.
- Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA.
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA.
| |
Collapse
|
3
|
Bugnon LA, Di Persia L, Gerard M, Raad J, Prochetto S, Fenoy E, Chorostecki U, Ariel F, Stegmayer G, Milone DH. sincFold: end-to-end learning of short- and long-range interactions in RNA secondary structure. Brief Bioinform 2024; 25:bbae271. [PMID: 38855913 PMCID: PMC11163250 DOI: 10.1093/bib/bbae271] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2024] [Revised: 05/03/2024] [Accepted: 05/24/2024] [Indexed: 06/11/2024] Open
Abstract
MOTIVATION Coding and noncoding RNA molecules participate in many important biological processes. Noncoding RNAs fold into well-defined secondary structures to exert their functions. However, the computational prediction of the secondary structure from a raw RNA sequence is a long-standing unsolved problem, which after decades of almost unchanged performance has now re-emerged due to deep learning. Traditional RNA secondary structure prediction algorithms have been mostly based on thermodynamic models and dynamic programming for free energy minimization. More recently deep learning methods have shown competitive performance compared with the classical ones, but there is still a wide margin for improvement. RESULTS In this work we present sincFold, an end-to-end deep learning approach, that predicts the nucleotides contact matrix using only the RNA sequence as input. The model is based on 1D and 2D residual neural networks that can learn short- and long-range interaction patterns. We show that structures can be accurately predicted with minimal physical assumptions. Extensive experiments were conducted on several benchmark datasets, considering sequence homology and cross-family validation. sincFold was compared with classical methods and recent deep learning models, showing that it can outperform the state-of-the-art methods.
Collapse
Affiliation(s)
- Leandro A Bugnon
- Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH-UNL, CONICET, Ciudad Universitaria UNL, 3000, Santa Fe, Argentina
| | - Leandro Di Persia
- Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH-UNL, CONICET, Ciudad Universitaria UNL, 3000, Santa Fe, Argentina
| | - Matias Gerard
- Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH-UNL, CONICET, Ciudad Universitaria UNL, 3000, Santa Fe, Argentina
| | - Jonathan Raad
- Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH-UNL, CONICET, Ciudad Universitaria UNL, 3000, Santa Fe, Argentina
| | - Santiago Prochetto
- Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH-UNL, CONICET, Ciudad Universitaria UNL, 3000, Santa Fe, Argentina
- Instituto de Agrobiotecnología del Litoral, CONICET-UNL, CCT-Santa Fe, Ruta Nacional N° 168 Km 0, s/n, Paraje el Pozo, 3000, Santa Fe, Argentina
| | - Emilio Fenoy
- Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH-UNL, CONICET, Ciudad Universitaria UNL, 3000, Santa Fe, Argentina
| | - Uciel Chorostecki
- Faculty of Medicine and Health Sciences, Universitat Internacional de Catalunya, Barcelona, Spain
| | - Federico Ariel
- Instituto de Agrobiotecnología del Litoral, CONICET-UNL, CCT-Santa Fe, Ruta Nacional N° 168 Km 0, s/n, Paraje el Pozo, 3000, Santa Fe, Argentina
| | - Georgina Stegmayer
- Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH-UNL, CONICET, Ciudad Universitaria UNL, 3000, Santa Fe, Argentina
| | - Diego H Milone
- Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH-UNL, CONICET, Ciudad Universitaria UNL, 3000, Santa Fe, Argentina
| |
Collapse
|
4
|
Mittal A, Turner DH, Mathews DH. NNDB: An Expanded Database of Nearest Neighbor Parameters for Predicting Stability of Nucleic Acid Secondary Structures. J Mol Biol 2024:168549. [PMID: 38522645 DOI: 10.1016/j.jmb.2024.168549] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2024] [Revised: 03/18/2024] [Accepted: 03/19/2024] [Indexed: 03/26/2024]
Abstract
Nearest neighbor thermodynamic parameters are widely used for RNA and DNA secondary structure prediction and to model thermodynamic ensembles of secondary structures. The Nearest Neighbor Database (NNDB) is a freely available web resource (https://rna.urmc.rochester.edu/NNDB) that provides the functional forms, parameter values, and example calculations. The NNDB provides the 1999 and 2004 set of RNA folding nearest neighbor parameters. We expanded the database to include a set of DNA parameters and a set of RNA parameters that includes m6A in addition to the canonical RNA nucleobases. The site was redesigned using the Quarto open-source publishing system. A downloadable PDF version of the complete resource and downloadable sets of nearest neighbor parameters are available.
Collapse
Affiliation(s)
- Abhinav Mittal
- Department of Biochemistry & Biophysics, University of Rochester Medical Center, Rochester, NY 14642, USA; Center for RNA Biology, University of Rochester Medical Center, Rochester, NY 14642, USA
| | - Douglas H Turner
- Center for RNA Biology, University of Rochester Medical Center, Rochester, NY 14642, USA; Department of Chemistry, University of Rochester, Rochester, NY 14627, USA
| | - David H Mathews
- Department of Biochemistry & Biophysics, University of Rochester Medical Center, Rochester, NY 14642, USA; Center for RNA Biology, University of Rochester Medical Center, Rochester, NY 14642, USA.
| |
Collapse
|
5
|
Gray M, Will S, Jabbari H. SparseRNAfolD: optimized sparse RNA pseudoknot-free folding with dangle consideration. Algorithms Mol Biol 2024; 19:9. [PMID: 38433200 DOI: 10.1186/s13015-024-00256-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Accepted: 02/13/2024] [Indexed: 03/05/2024] Open
Abstract
MOTIVATION Computational RNA secondary structure prediction by free energy minimization is indispensable for analyzing structural RNAs and their interactions. These methods find the structure with the minimum free energy (MFE) among exponentially many possible structures and have a restrictive time and space complexity ( O ( n 3 ) time and O ( n 2 ) space for pseudoknot-free structures) for longer RNA sequences. Furthermore, accurate free energy calculations, including dangle contributions can be difficult and costly to implement, particularly when optimizing for time and space requirements. RESULTS Here we introduce a fast and efficient sparsified MFE pseudoknot-free structure prediction algorithm, SparseRNAFolD, that utilizes an accurate energy model that accounts for dangle contributions. While the sparsification technique was previously employed to improve the time and space complexity of a pseudoknot-free structure prediction method with a realistic energy model, SparseMFEFold, it was not extended to include dangle contributions due to the complexity of computation. This may come at the cost of prediction accuracy. In this work, we compare three different sparsified implementations for dangle contributions and provide pros and cons of each method. As well, we compare our algorithm to LinearFold, a linear time and space algorithm, where we find that in practice, SparseRNAFolD has lower memory consumption across all lengths of sequence and a faster time for lengths up to 1000 bases. CONCLUSION Our SparseRNAFolD algorithm is an MFE-based algorithm that guarantees optimality of result and employs the most general energy model, including dangle contributions. We provide a basis for applying dangles to sparsified recursion in a pseudoknot-free model that has the potential to be extended to pseudoknots.
Collapse
Affiliation(s)
- Mateo Gray
- Department of Biomedical Engineering, University of Alberta, Street, Edmonton, T6G2R3, AB, Canada.
| | - Sebastian Will
- Department of Computer Science CNRS/LIX (UMR 7161), Institut Polytechnique de Paris, Street, Paris, 10587, France
| | - Hosna Jabbari
- Department of Biomedical Engineering, University of Alberta, Street, Edmonton, T6G2R3, AB, Canada.
| |
Collapse
|
6
|
Zhukova M, Schedl P, Shidlovskii YV. The role of secondary structures in the functioning of 3' untranslated regions of mRNA: A review of functions of 3' UTRs' secondary structures and hypothetical involvement of secondary structures in cytoplasmic polyadenylation in Drosophila. Bioessays 2024; 46:e2300099. [PMID: 38161240 DOI: 10.1002/bies.202300099] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Revised: 12/11/2023] [Accepted: 12/12/2023] [Indexed: 01/03/2024]
Abstract
3' untranslated regions (3' UTRs) of mRNAs have many functions, including mRNA processing and transport, translational regulation, and mRNA degradation and stability. These different functions require cis-elements in 3' UTRs that can be either sequence motifs or RNA structures. Here we review the role of secondary structures in the functioning of 3' UTRs and discuss some of the trans-acting factors that interact with these secondary structures in eukaryotic organisms. We propose potential participation of 3'-UTR secondary structures in cytoplasmic polyadenylation in the model organism Drosophila melanogaster. Because the secondary structures of 3' UTRs are essential for post-transcriptional regulation of gene expression, their disruption leads to a wide range of disorders, including cancer and cardiovascular diseases. Trans-acting factors, such as STAU1 and nucleolin, which interact with 3'-UTR secondary structures of target transcripts, influence the pathogenesis of neurodegenerative diseases and tumor metastasis, suggesting that they are possible therapeutic targets.
Collapse
Affiliation(s)
- Mariya Zhukova
- Laboratory of Gene Expression Regulation in Development, Russian Academy of Sciences, Institute of Gene Biology, Moscow, Russia
| | - Paul Schedl
- Laboratory of Gene Expression Regulation in Development, Russian Academy of Sciences, Institute of Gene Biology, Moscow, Russia
- Department of Molecular Biology, Princeton University, Princeton, New Jersey, USA
| | - Yulii V Shidlovskii
- Laboratory of Gene Expression Regulation in Development, Russian Academy of Sciences, Institute of Gene Biology, Moscow, Russia
- Department of Biology and General Genetics, Sechenov First Moscow State Medical University (Sechenov University), Moscow, Russia
| |
Collapse
|
7
|
McNair K, Salamon P, Edwards RA, Segall AM. PRFect: a tool to predict programmed ribosomal frameshifts in prokaryotic and viral genomes. BMC Bioinformatics 2024; 25:82. [PMID: 38389044 PMCID: PMC10885494 DOI: 10.1186/s12859-024-05701-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2023] [Accepted: 02/13/2024] [Indexed: 02/24/2024] Open
Abstract
BACKGROUND One of the stranger phenomena that can occur during gene translation is where, as a ribosome reads along the mRNA, various cellular and molecular properties contribute to stalling the ribosome on a slippery sequence and shifting the ribosome into one of the other two alternate reading frames. The alternate frame has different codons, so different amino acids are added to the peptide chain. More importantly, the original stop codon is no longer in-frame, so the ribosome can bypass the stop codon and continue to translate the codons past it. This produces a longer version of the protein, a fusion of the original in-frame amino acids, followed by all the alternate frame amino acids. There is currently no automated software to predict the occurrence of these programmed ribosomal frameshifts (PRF), and they are currently only identified by manual curation. RESULTS Here we present PRFect, an innovative machine-learning method for the detection and prediction of PRFs in coding genes of various types. PRFect combines advanced machine learning techniques with the integration of multiple complex cellular properties, such as secondary structure, codon usage, ribosomal binding site interference, direction, and slippery site motif. Calculating and incorporating these diverse properties posed significant challenges, but through extensive research and development, we have achieved a user-friendly approach. The code for PRFect is freely available, open-source, and can be easily installed via a single command in the terminal. Our comprehensive evaluations on diverse organisms, including bacteria, archaea, and phages, demonstrate PRFect's strong performance, achieving high sensitivity, specificity, and an accuracy exceeding 90%. The code for PRFect is freely available and installs with a single terminal command. CONCLUSION PRFect represents a significant advancement in the field of PRF detection and prediction, offering a powerful tool for researchers and scientists to unravel the intricacies of programmed ribosomal frameshifting in coding genes.
Collapse
Affiliation(s)
- Katelyn McNair
- Computational Science Research Center, San Diego State University, San Diego, CA, USA.
- Department of Computational Science, University of California Irvine, Irvine, CA, USA.
| | - Peter Salamon
- Computational Science Research Center, San Diego State University, San Diego, CA, USA
- Department of Mathematics and Statistics, San Diego State University, San Diego, CA, USA
| | - Robert A Edwards
- College of Science and Engineering, Flinders University, Bedford Park, Adelaide, SA, 5042, Australia
| | - Anca M Segall
- Computational Science Research Center, San Diego State University, San Diego, CA, USA
- Department of Biology and Viral Information Institute, San Diego State University, San Diego, CA, USA
| |
Collapse
|
8
|
Loyer G, Reinharz V. Concurrent prediction of RNA secondary structures with pseudoknots and local 3D motifs in an integer programming framework. Bioinformatics 2024; 40:btae022. [PMID: 38230755 PMCID: PMC10868335 DOI: 10.1093/bioinformatics/btae022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Revised: 11/30/2023] [Accepted: 01/12/2024] [Indexed: 01/18/2024] Open
Abstract
MOTIVATION The prediction of RNA structure canonical base pairs from a single sequence, especially pseudoknotted ones, remains challenging in a thermodynamic models that approximates the energy of the local 3D motifs joining canonical stems. It has become more and more apparent in recent years that the structural motifs in the loops, composed of noncanonical interactions, are essential for the final shape of the molecule enabling its multiple functions. Our capacity to predict accurate 3D structures is also limited when it comes to the organization of the large intricate network of interactions that form inside those loops. RESULTS We previously developed the integer programming framework RNA Motifs over Integer Programming (RNAMoIP) to reconcile RNA secondary structure and local 3D motif information available in databases. We further develop our model to now simultaneously predict the canonical base pairs (with pseudoknots) from base pair probability matrices with or without alignment. We benchmarked our new method over the all nonredundant RNAs below 150 nucleotides. We show that the joined prediction of canonical base pairs structure and local conserved motifs (i) improves the ratio of well-predicted interactions in the secondary structure, (ii) predicts well canonical and Wobble pairs at the location where motifs are inserted, (iii) is greatly improved with evolutionary information, and (iv) noncanonical motifs at kink-turn locations. AVAILABILITY AND IMPLEMENTATION The source code of the framework is available at https://gitlab.info.uqam.ca/cbe/RNAMoIP and an interactive web server at https://rnamoip.cbe.uqam.ca/.
Collapse
Affiliation(s)
- Gabriel Loyer
- Department of Computer Science, Université du Québec à Montréal, Montréal, QC H2X 3Y7, Canada
| | - Vladimir Reinharz
- Department of Computer Science, Université du Québec à Montréal, Montréal, QC H2X 3Y7, Canada
| |
Collapse
|
9
|
Gaucherand L, Gaglia MM. [The influenza A virus ribonuclease PA-X can differentiate between cellular and viral RNAs through its cut site preference]. Med Sci (Paris) 2024; 40:127-129. [PMID: 38411415 DOI: 10.1051/medsci/2023204] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/28/2024] Open
Affiliation(s)
- Léa Gaucherand
- Université de Strasbourg, Architecture et réactivité de l'ARN, Institut de biologie moléculaire et cellulaire du CNRS, Strasbourg, France
| | - Marta M Gaglia
- Institute for molecular virology and department of medical microbiology and immunology, Université de Wisconsin-Madison, Madison, États-Unis
| |
Collapse
|
10
|
Durrant MG, Perry NT, Pai JJ, Jangid AR, Athukoralage JS, Hiraizumi M, McSpedon JP, Pawluk A, Nishimasu H, Konermann S, Hsu PD. Bridge RNAs direct modular and programmable recombination of target and donor DNA. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.24.577089. [PMID: 38328150 PMCID: PMC10849738 DOI: 10.1101/2024.01.24.577089] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/09/2024]
Abstract
Genomic rearrangements, encompassing mutational changes in the genome such as insertions, deletions, or inversions, are essential for genetic diversity. These rearrangements are typically orchestrated by enzymes involved in fundamental DNA repair processes such as homologous recombination or in the transposition of foreign genetic material by viruses and mobile genetic elements (MGEs). We report that IS110 insertion sequences, a family of minimal and autonomous MGEs, express a structured non-coding RNA that binds specifically to their encoded recombinase. This bridge RNA contains two internal loops encoding nucleotide stretches that base-pair with the target DNA and donor DNA, which is the IS110 element itself. We demonstrate that the target-binding and donor-binding loops can be independently reprogrammed to direct sequence-specific recombination between two DNA molecules. This modularity enables DNA insertion into genomic target sites as well as programmable DNA excision and inversion. The IS110 bridge system expands the diversity of nucleic acid-guided systems beyond CRISPR and RNA interference, offering a unified mechanism for the three fundamental DNA rearrangements required for genome design.
Collapse
Affiliation(s)
- Matthew G. Durrant
- Arc Institute, 3181 Porter Drive, Palo Alto, CA 94304, USA
- Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA
| | - Nicholas T. Perry
- Arc Institute, 3181 Porter Drive, Palo Alto, CA 94304, USA
- Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA
- University of California, Berkeley - University of California, San Francisco Graduate Program in Bioengineering, Berkeley, CA, USA
| | - James J. Pai
- Arc Institute, 3181 Porter Drive, Palo Alto, CA 94304, USA
| | - Aditya R. Jangid
- Arc Institute, 3181 Porter Drive, Palo Alto, CA 94304, USA
- Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA
| | | | - Masahiro Hiraizumi
- Department of Chemistry and Biotechnology, Graduate School of Engineering, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan
| | | | - April Pawluk
- Arc Institute, 3181 Porter Drive, Palo Alto, CA 94304, USA
| | - Hiroshi Nishimasu
- Department of Chemistry and Biotechnology, Graduate School of Engineering, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan
- Structural Biology Division, Research Center for Advanced Science and Technology, The University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8904, Japan
- Department of Biological Sciences, Graduate School of Science, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, Japan
- Inamori Research Institute for Science, 620 Suiginya-cho, Shimogyo-ku, Kyoto 600-8411, Japan
- Japan Science and Technology Agency, Core Research for Evolutional Science and Technology, 4-1-8, Honcho, Kawaguchi-shi, Saitama 332-0012, Japan
| | - Silvana Konermann
- Arc Institute, 3181 Porter Drive, Palo Alto, CA 94304, USA
- Department of Biochemistry, Stanford University School of Medicine, Stanford, CA, USA
| | - Patrick D. Hsu
- Arc Institute, 3181 Porter Drive, Palo Alto, CA 94304, USA
- Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA
| |
Collapse
|
11
|
Wei J, Lotfy P, Faizi K, Baungaard S, Gibson E, Wang E, Slabodkin H, Kinnaman E, Chandrasekaran S, Kitano H, Durrant MG, Duffy CV, Pawluk A, Hsu PD, Konermann S. Deep learning and CRISPR-Cas13d ortholog discovery for optimized RNA targeting. Cell Syst 2023; 14:1087-1102.e13. [PMID: 38091991 DOI: 10.1016/j.cels.2023.11.006] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2022] [Revised: 05/03/2023] [Accepted: 11/20/2023] [Indexed: 12/23/2023]
Abstract
Effective and precise mammalian transcriptome engineering technologies are needed to accelerate biological discovery and RNA therapeutics. Despite the promise of programmable CRISPR-Cas13 ribonucleases, their utility has been hampered by an incomplete understanding of guide RNA design rules and cellular toxicity resulting from off-target or collateral RNA cleavage. Here, we quantified the performance of over 127,000 RfxCas13d (CasRx) guide RNAs and systematically evaluated seven machine learning models to build a guide efficiency prediction algorithm orthogonally validated across multiple human cell types. Deep learning model interpretation revealed preferred sequence motifs and secondary features for highly efficient guides. We next identified and screened 46 novel Cas13d orthologs, finding that DjCas13d achieves low cellular toxicity and high specificity-even when targeting abundant transcripts in sensitive cell types, including stem cells and neurons. Our Cas13d guide efficiency model was successfully generalized to DjCas13d, illustrating the power of combining machine learning with ortholog discovery to advance RNA targeting in human cells.
Collapse
Affiliation(s)
- Jingyi Wei
- Department of Bioengineering, Stanford University, Stanford, CA, USA; Department of Biochemistry, Stanford University, Stanford, CA, USA; Arc Institute, Palo Alto, CA, USA
| | - Peter Lotfy
- Laboratory of Molecular and Cell Biology, Salk Institute for Biological Studies, La Jolla, CA, USA
| | - Kian Faizi
- Laboratory of Molecular and Cell Biology, Salk Institute for Biological Studies, La Jolla, CA, USA
| | | | | | - Eleanor Wang
- Laboratory of Molecular and Cell Biology, Salk Institute for Biological Studies, La Jolla, CA, USA; Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA; Innovative Genomics Institute, University of California, Berkeley, Berkeley, CA, USA
| | - Hannah Slabodkin
- Department of Biochemistry, Stanford University, Stanford, CA, USA; Arc Institute, Palo Alto, CA, USA
| | - Emily Kinnaman
- Department of Biochemistry, Stanford University, Stanford, CA, USA; Arc Institute, Palo Alto, CA, USA
| | - Sita Chandrasekaran
- Arc Institute, Palo Alto, CA, USA; Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA; Innovative Genomics Institute, University of California, Berkeley, Berkeley, CA, USA
| | - Hugo Kitano
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Matthew G Durrant
- Arc Institute, Palo Alto, CA, USA; Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA; Innovative Genomics Institute, University of California, Berkeley, Berkeley, CA, USA
| | - Connor V Duffy
- Arc Institute, Palo Alto, CA, USA; Department of Genetics, Stanford University, Stanford, CA, USA
| | | | - Patrick D Hsu
- Arc Institute, Palo Alto, CA, USA; Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA; Innovative Genomics Institute, University of California, Berkeley, Berkeley, CA, USA.
| | - Silvana Konermann
- Department of Biochemistry, Stanford University, Stanford, CA, USA; Arc Institute, Palo Alto, CA, USA.
| |
Collapse
|
12
|
Rocca R, Grillone K, Citriniti EL, Gualtieri G, Artese A, Tagliaferri P, Tassone P, Alcaro S. Targeting non-coding RNAs: Perspectives and challenges of in-silico approaches. Eur J Med Chem 2023; 261:115850. [PMID: 37839343 DOI: 10.1016/j.ejmech.2023.115850] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2023] [Revised: 09/08/2023] [Accepted: 09/29/2023] [Indexed: 10/17/2023]
Abstract
The growing information currently available on the central role of non-coding RNAs (ncRNAs) including microRNAs (miRNAS) and long non-coding RNAs (lncRNAs) for chronic and degenerative human diseases makes them attractive therapeutic targets. RNAs carry out different functional roles in human biology and are deeply deregulated in several diseases. So far, different attempts to therapeutically target the 3D RNA structures with small molecules have been reported. In this scenario, the development of computational tools suitable for describing RNA structures and their potential interactions with small molecules is gaining more and more interest. Here, we describe the most suitable strategies to study ncRNAs through computational tools. We focus on methods capable of predicting 2D and 3D ncRNA structures. Furthermore, we describe computational tools to identify, design and optimize small molecule ncRNA binders. This review aims to outline the state of the art and perspectives of computational methods for ncRNAs over the past decade.
Collapse
Affiliation(s)
- Roberta Rocca
- Department of Health Science, Magna Graecia University, Catanzaro, Italy; Net4Science srl, Academic Spinoff, Magna Græcia University, Catanzaro, Italy
| | - Katia Grillone
- Department of Experimental and Clinical Medicine, Magna Græcia University, Catanzaro, Italy
| | | | | | - Anna Artese
- Department of Health Science, Magna Graecia University, Catanzaro, Italy; Net4Science srl, Academic Spinoff, Magna Græcia University, Catanzaro, Italy.
| | | | - Pierfrancesco Tassone
- Department of Experimental and Clinical Medicine, Magna Græcia University, Catanzaro, Italy
| | - Stefano Alcaro
- Department of Health Science, Magna Graecia University, Catanzaro, Italy; Net4Science srl, Academic Spinoff, Magna Græcia University, Catanzaro, Italy
| |
Collapse
|
13
|
Binet T, Padiolleau-Lefèvre S, Octave S, Avalle B, Maffucci I. Comparative Study of Single-stranded Oligonucleotides Secondary Structure Prediction Tools. BMC Bioinformatics 2023; 24:422. [PMID: 37940855 PMCID: PMC10634105 DOI: 10.1186/s12859-023-05532-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2023] [Accepted: 10/13/2023] [Indexed: 11/10/2023] Open
Abstract
BACKGROUND Single-stranded nucleic acids (ssNAs) have important biological roles and a high biotechnological potential linked to their ability to bind to numerous molecular targets. This depends on the different spatial conformations they can assume. The first level of ssNAs spatial organisation corresponds to their base pairs pattern, i.e. their secondary structure. Many computational tools have been developed to predict the ssNAs secondary structures, making the choice of the appropriate tool difficult, and an up-to-date guide on the limits and applicability of current secondary structure prediction tools is missing. Therefore, we performed a comparative study of the performances of 9 freely available tools (mfold, RNAfold, CentroidFold, CONTRAfold, MC-Fold, LinearFold, UFold, SPOT-RNA, and MXfold2) on a dataset of 538 ssNAs with known experimental secondary structure. RESULTS The minimum free energy-based tools, namely mfold and RNAfold, and some tools based on artificial intelligence, namely CONTRAfold and MXfold2, provided the best results, with [Formula: see text] of exact predictions, whilst MC-fold seemed to be the worst performing tool, with only [Formula: see text] of exact predictions. In addition, UFold and SPOT-RNA are the only options for pseudoknots prediction. Including in the analysis of mfold and RNAfold results 5-10 suboptimal solutions further improved the performances of these tools. Nevertheless, we could observe issues in predicting particular motifs, such as multiple-ways junctions and mini-dumbbells, or the ssNAs whose structure has been determined in complex with a protein. In addition, our benchmark shows that some effort has to be paid for ssDNA secondary structure predictions. CONCLUSIONS In general, Mfold, RNAfold, and MXfold2 seem to currently be the best choice for the ssNAs secondary structure prediction, although they still show some limits linked to specific structural motifs. Nevertheless, actual trends suggest that artificial intelligence has a high potential to overcome these remaining issues, for example the recently developed UFold and SPOT-RNA have a high success rate in predicting pseudoknots.
Collapse
Affiliation(s)
- Thomas Binet
- Université de technologie de Compiègne, UPJV, CNRS, Enzyme and Cell Engineering, Centre de recherche Royallieu - CS 60 319, 60203, Compiègne Cedex, France
| | - Séverine Padiolleau-Lefèvre
- Université de technologie de Compiègne, UPJV, CNRS, Enzyme and Cell Engineering, Centre de recherche Royallieu - CS 60 319, 60203, Compiègne Cedex, France
| | - Stéphane Octave
- Université de technologie de Compiègne, UPJV, CNRS, Enzyme and Cell Engineering, Centre de recherche Royallieu - CS 60 319, 60203, Compiègne Cedex, France
| | - Bérangère Avalle
- Université de technologie de Compiègne, UPJV, CNRS, Enzyme and Cell Engineering, Centre de recherche Royallieu - CS 60 319, 60203, Compiègne Cedex, France.
| | - Irene Maffucci
- Université de technologie de Compiègne, UPJV, CNRS, Enzyme and Cell Engineering, Centre de recherche Royallieu - CS 60 319, 60203, Compiègne Cedex, France.
| |
Collapse
|
14
|
Wang Y, Zhang H, Xu Z, Zhang S, Guo R. TransUFold: Unlocking the structural complexity of short and long RNA with pseudoknots. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:19320-19340. [PMID: 38052602 DOI: 10.3934/mbe.2023854] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/07/2023]
Abstract
The RNA secondary structure is like a blueprint that holds the key to unlocking the mysteries of RNA function and 3D structure. It serves as a crucial foundation for investigating the complex world of RNA, making it an indispensable component of research in this exciting field. However, pseudoknots cannot be accurately predicted by conventional prediction methods based on free energy minimization, which results in a performance bottleneck. To this end, we propose a deep learning-based method called TransUFold to train directly on RNA data annotated with structure information. It employs an encoder-decoder network architecture, named Vision Transformer, to extract long-range interactions in RNA sequences and utilizes convolutions with lateral connections to supplement short-range interactions. Then, a post-processing program is designed to constrain the model's output to produce realistic and effective RNA secondary structures, including pseudoknots. After training TransUFold on benchmark datasets, we outperform other methods in test data on the same family. Additionally, we achieve better results on longer sequences up to 1600 nt, demonstrating the outstanding performance of Vision Transformer in extracting long-range interactions in RNA sequences. Finally, our analysis indicates that TransUFold produces effective pseudoknot structures in long sequences. As more high-quality RNA structures become available, deep learning-based prediction methods like Vision Transformer can exhibit better performance.
Collapse
Affiliation(s)
- Yunxiang Wang
- School of Cyber Security and Computer, Hebei University, Baoding, Hebei, China
| | - Hong Zhang
- School of Cyber Security and Computer, Hebei University, Baoding, Hebei, China
| | - Zhenchao Xu
- School of Cyber Security and Computer, Hebei University, Baoding, Hebei, China
| | - Shouhua Zhang
- Information Technology and Electrical Engineering, University of Oulu, Oulu, Finland
| | - Rui Guo
- College of Life Sciences, Institute of Life Science and Green Development, Hebei University, Baoding, China
| |
Collapse
|
15
|
Zhang H, Li S, Dai N, Zhang L, Mathews DH, Huang L. LinearCoFold and LinearCoPartition: linear-time algorithms for secondary structure prediction of interacting RNA molecules. Nucleic Acids Res 2023; 51:e94. [PMID: 37650626 PMCID: PMC10570024 DOI: 10.1093/nar/gkad664] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2022] [Revised: 06/15/2023] [Accepted: 08/17/2023] [Indexed: 09/01/2023] Open
Abstract
Many RNAs function through RNA-RNA interactions. Fast and reliable RNA structure prediction with consideration of RNA-RNA interaction is useful, however, existing tools are either too simplistic or too slow. To address this issue, we present LinearCoFold, which approximates the complete minimum free energy structure of two strands in linear time, and LinearCoPartition, which approximates the cofolding partition function and base pairing probabilities in linear time. LinearCoFold and LinearCoPartition are orders of magnitude faster than RNAcofold. For example, on a sequence pair with combined length of 26,190 nt, LinearCoFold is 86.8× faster than RNAcofold MFE mode, and LinearCoPartition is 642.3× faster than RNAcofold partition function mode. Surprisingly, LinearCoFold and LinearCoPartition's predictions have higher PPV and sensitivity of intermolecular base pairs. Furthermore, we apply LinearCoFold to predict the RNA-RNA interaction between SARS-CoV-2 genomic RNA (gRNA) and human U4 small nuclear RNA (snRNA), which has been experimentally studied, and observe that LinearCoFold's prediction correlates better with the wet lab results than RNAcofold's.
Collapse
Affiliation(s)
- He Zhang
- Baidu Research, Sunnyvale, CA, USA
- School of Electrical Engineering & Computer Science, Oregon State University, Corvallis, OR, USA
| | - Sizhen Li
- School of Electrical Engineering & Computer Science, Oregon State University, Corvallis, OR, USA
| | - Ning Dai
- School of Electrical Engineering & Computer Science, Oregon State University, Corvallis, OR, USA
| | - Liang Zhang
- School of Electrical Engineering & Computer Science, Oregon State University, Corvallis, OR, USA
| | - David H Mathews
- Department of Biochemistry & Biophysics,Rochester, NY 14642, USA
- Center for RNA Biology, Rochester, NY 14642, USA
- Department of Biostatistics & Computational Biology, University of Rochester Medical Center, Rochester, NY 14642, USA
| | - Liang Huang
- School of Electrical Engineering & Computer Science, Oregon State University, Corvallis, OR, USA
| |
Collapse
|
16
|
Hara K, Iwano N, Fukunaga T, Hamada M. DeepRaccess: high-speed RNA accessibility prediction using deep learning. FRONTIERS IN BIOINFORMATICS 2023; 3:1275787. [PMID: 37881622 PMCID: PMC10597636 DOI: 10.3389/fbinf.2023.1275787] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2023] [Accepted: 09/29/2023] [Indexed: 10/27/2023] Open
Abstract
RNA accessibility is a useful RNA secondary structural feature for predicting RNA-RNA interactions and translation efficiency in prokaryotes. However, conventional accessibility calculation tools, such as Raccess, are computationally expensive and require considerable computational time to perform transcriptome-scale analysis. In this study, we developed DeepRaccess, which predicts RNA accessibility based on deep learning methods. DeepRaccess was trained to take artificial RNA sequences as input and to predict the accessibility of these sequences as calculated by Raccess. Simulation and empirical dataset analyses showed that the accessibility predicted by DeepRaccess was highly correlated with the accessibility calculated by Raccess. In addition, we confirmed that DeepRaccess could predict protein abundance in E.coli with moderate accuracy from the sequences around the start codon. We also demonstrated that DeepRaccess achieved tens to hundreds of times software speed-up in a GPU environment. The source codes and the trained models of DeepRaccess are freely available at https://github.com/hmdlab/DeepRaccess.
Collapse
Affiliation(s)
- Kaisei Hara
- Department of Electrical Engineering and Bioscience, Graduate School of Advanced Science and Engineering, Waseda University, Tokyo, Japan
- Computational Bio Big-Data Open Innovation Laboratory, AIST-Waseda University, Tokyo, Japan
| | - Natsuki Iwano
- Department of Electrical Engineering and Bioscience, Graduate School of Advanced Science and Engineering, Waseda University, Tokyo, Japan
| | - Tsukasa Fukunaga
- Waseda Institute for Advanced Study, Waseda University, Tokyo, Japan
| | - Michiaki Hamada
- Department of Electrical Engineering and Bioscience, Graduate School of Advanced Science and Engineering, Waseda University, Tokyo, Japan
- Computational Bio Big-Data Open Innovation Laboratory, AIST-Waseda University, Tokyo, Japan
- Graduate School of Medicine, Nippon Medical School, Tokyo, Japan
| |
Collapse
|
17
|
Zhang H, Zhang L, Lin A, Xu C, Li Z, Liu K, Liu B, Ma X, Zhao F, Jiang H, Chen C, Shen H, Li H, Mathews DH, Zhang Y, Huang L. Algorithm for optimized mRNA design improves stability and immunogenicity. Nature 2023; 621:396-403. [PMID: 37130545 PMCID: PMC10499610 DOI: 10.1038/s41586-023-06127-z] [Citation(s) in RCA: 52] [Impact Index Per Article: 52.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2022] [Accepted: 04/25/2023] [Indexed: 05/04/2023]
Abstract
Messenger RNA (mRNA) vaccines are being used to combat the spread of COVID-19 (refs. 1-3), but they still exhibit critical limitations caused by mRNA instability and degradation, which are major obstacles for the storage, distribution and efficacy of the vaccine products4. Increasing secondary structure lengthens mRNA half-life, which, together with optimal codons, improves protein expression5. Therefore, a principled mRNA design algorithm must optimize both structural stability and codon usage. However, owing to synonymous codons, the mRNA design space is prohibitively large-for example, there are around 2.4 × 10632 candidate mRNA sequences for the SARS-CoV-2 spike protein. This poses insurmountable computational challenges. Here we provide a simple and unexpected solution using the classical concept of lattice parsing in computational linguistics, where finding the optimal mRNA sequence is analogous to identifying the most likely sentence among similar-sounding alternatives6. Our algorithm LinearDesign finds an optimal mRNA design for the spike protein in just 11 minutes, and can concurrently optimize stability and codon usage. LinearDesign substantially improves mRNA half-life and protein expression, and profoundly increases antibody titre by up to 128 times in mice compared to the codon-optimization benchmark on mRNA vaccines for COVID-19 and varicella-zoster virus. This result reveals the great potential of principled mRNA design and enables the exploration of previously unreachable but highly stable and efficient designs. Our work is a timely tool for vaccines and other mRNA-based medicines encoding therapeutic proteins such as monoclonal antibodies and anti-cancer drugs7,8.
Collapse
Affiliation(s)
- He Zhang
- Baidu Research USA, Sunnyvale, CA, USA
- School of EECS, Oregon State University, Corvallis, OR, USA
| | - Liang Zhang
- Baidu Research USA, Sunnyvale, CA, USA
- School of EECS, Oregon State University, Corvallis, OR, USA
- Vaccine Center, School of Basic Medicine and Clinical Pharmacy, China Pharmaceutical University, Nanjing, China
| | - Ang Lin
- StemiRNA Therapeutics, Shanghai, China
- Vaccine Center, School of Basic Medicine and Clinical Pharmacy, China Pharmaceutical University, Nanjing, China
| | | | - Ziyu Li
- Baidu Research USA, Sunnyvale, CA, USA
| | - Kaibo Liu
- Baidu Research USA, Sunnyvale, CA, USA
- School of EECS, Oregon State University, Corvallis, OR, USA
| | - Boxiang Liu
- Baidu Research USA, Sunnyvale, CA, USA
- Department of Pharmacy, National University of Singapore, Singapore, Singapore
| | | | | | | | | | | | | | - David H Mathews
- Department of Biochemistry and Biophysics, University of Rochester Medical Center, Rochester, NY, USA.
- Center for RNA Biology, University of Rochester Medical Center, Rochester, NY, USA.
- Department of Biostatistics and Computational Biology, University of Rochester Medical Center, Rochester, NY, USA.
- Coderna.ai, Inc., Sunnyvale, CA, USA.
| | - Yujian Zhang
- StemiRNA Therapeutics, Shanghai, China.
- , Gaithersburg, MD, USA.
| | - Liang Huang
- Baidu Research USA, Sunnyvale, CA, USA.
- School of EECS, Oregon State University, Corvallis, OR, USA.
- Coderna.ai, Inc., Sunnyvale, CA, USA.
| |
Collapse
|
18
|
Kulkarni M, Thangappan J, Deb I, Wu S. Comparative analysis of RNA secondary structure accuracy on predicted RNA 3D models. PLoS One 2023; 18:e0290907. [PMID: 37656749 PMCID: PMC10473517 DOI: 10.1371/journal.pone.0290907] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2023] [Accepted: 08/18/2023] [Indexed: 09/03/2023] Open
Abstract
RNA structure is conformationally dynamic, and accurate all-atom tertiary (3D) structure modeling of RNA remains challenging with the prevailing tools. Secondary structure (2D) information is the standard prerequisite for most RNA 3D modeling. Despite several 2D and 3D structure prediction tools proposed in recent years, one of the challenges is to choose the best combination for accurate RNA 3D structure prediction. Here, we benchmarked seven small RNA PDB structures (40 to 90 nucleotides) with different topologies to understand the effects of different 2D structure predictions on the accuracy of 3D modeling. The current study explores the blind challenge of 2D to 3D conversions and highlights the performances of de novo RNA 3D modeling from their predicted 2D structure constraints. Our results show that conformational sampling-based methods such as SimRNA and IsRNA1 depend less on 2D accuracy, whereas motif-based methods account for 2D evidence. Our observations illustrate the disparities in available 3D and 2D prediction methods and may further offer insights into developing topology-specific or family-specific RNA structure prediction pipelines.
Collapse
Affiliation(s)
- Mandar Kulkarni
- R&D Center, PharmCADD Co. Ltd., Dong-gu, Busan, Republic of Korea
| | | | - Indrajit Deb
- R&D Center, PharmCADD Co. Ltd., Dong-gu, Busan, Republic of Korea
| | - Sangwook Wu
- R&D Center, PharmCADD Co. Ltd., Dong-gu, Busan, Republic of Korea
- Department of Physics, Pukyong National University, Busan, Republic of Korea
| |
Collapse
|
19
|
Yang E, Zhang H, Zang Z, Zhou Z, Wang S, Liu Z, Liu Y. GCNfold: A novel lightweight model with valid extractors for RNA secondary structure prediction. Comput Biol Med 2023; 164:107246. [PMID: 37487383 DOI: 10.1016/j.compbiomed.2023.107246] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2023] [Revised: 06/23/2023] [Accepted: 07/07/2023] [Indexed: 07/26/2023]
Abstract
RNA secondary structure is essential for predicting the tertiary structure and understanding RNA function. Recent research tends to stack numerous modules to design large deep-learning models. This can increase the accuracy to more than 70%, as well as significant training costs and prediction efficiency. We proposed a model with three feature extractors called GCNfold. Structure Extractor utilizes a three-layer Graph Convolutional Network (GCN) to mine the structural information of RNA, such as stems, hairpin, and internal loops. Structure and Sequence Fusion embeds structural information into sequences with Transformer Encoders. Long-distance Dependency Extractor captures long-range pairwise relationships by UNet. The experiments indicate that GCNfold has a small number of parameters, a fast inference speed, and a high accuracy among all models with over 80% accuracy. Additionally, GCNfold-Small takes only 90ms to infer an RNA secondary structure and can achieve close to 90% accuracy on average. The GCNfold code is available on Github https://github.com/EnbinYang/GCNfold.
Collapse
Affiliation(s)
- Enbin Yang
- College of Computer Science and Technology, Jilin University, Changchun, 130012, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, 130012, China
| | - Hao Zhang
- College of Computer Science and Technology, Jilin University, Changchun, 130012, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, 130012, China; College of Software, Jilin University, Changchun, 130012, China
| | - Zinan Zang
- College of Computer Science and Technology, Jilin University, Changchun, 130012, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, 130012, China
| | - Zhiyong Zhou
- College of Computer Science and Technology, Jilin University, Changchun, 130012, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, 130012, China
| | - Shuo Wang
- College of Computer Science and Technology, Jilin University, Changchun, 130012, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, 130012, China
| | - Zhen Liu
- College of Computer Science and Technology, Jilin University, Changchun, 130012, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, 130012, China; Graduate School of Engineering, Nagasaki Institute of Applied Science, 536 Aba-machi, Nagasaki 851-0193, Japan
| | - Yuanning Liu
- College of Computer Science and Technology, Jilin University, Changchun, 130012, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, 130012, China; College of Software, Jilin University, Changchun, 130012, China.
| |
Collapse
|
20
|
Tang M, Hwang K, Kang SH. StemP: A Fast and Deterministic Stem-Graph Approach for RNA Secondary Structure Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:3278-3291. [PMID: 37028040 DOI: 10.1109/tcbb.2023.3253049] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
We propose a new deterministic methodology to predict the secondary structure of RNA sequences. What information of stem is important for structure prediction, and is it enough ? The proposed simple deterministic algorithm uses minimum stem length, Stem-Loop score, and co-existence of stems, to give good structure predictions for short RNA and tRNA sequences. The main idea is to consider all possible stem with certain stem loop energy and strength to predict RNA secondary structure. We use graph notation, where stems are represented as vertexes, and co-existence between stems as edges. This full Stem-graph presents all possible folding structure, and we pick sub-graph(s) which give the best matching energy for structure prediction. Stem-Loop score adds structure information and speeds up the computation. The proposed method can predict secondary structure even with pseudo knots. One of the strengths of this approach is the simplicity and flexibility of the algorithm, and it gives a deterministic answer. Numerical experiments are done on various sequences from Protein Data Bank and the Gutell Lab using a laptop and results take only a few seconds.
Collapse
|
21
|
Hidalgo M, Ramos C, Zolla G. Analysis of lncRNAs in Lupinus mutabilis (Tarwi) and Their Potential Role in Drought Response. Noncoding RNA 2023; 9:48. [PMID: 37736894 PMCID: PMC10514842 DOI: 10.3390/ncrna9050048] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Revised: 08/01/2023] [Accepted: 08/16/2023] [Indexed: 09/23/2023] Open
Abstract
Lupinus mutabilis is a legume with high agronomic potential and available transcriptomic data for which lncRNAs have not been studied. Therefore, our objective was to identify, characterize, and validate the drought-responsive lncRNAs in L. mutabilis. To achieve this, we used a multilevel approach based on lncRNA prediction, annotation, subcellular location, thermodynamic characterization, structural conservation, and validation. Thus, 590 lncRNAs were identified by at least two algorithms of lncRNA identification. Annotation with the PLncDB database showed 571 lncRNAs unique to tarwi and 19 lncRNAs with homology in 28 botanical families including Solanaceae (19), Fabaceae (17), Brassicaceae (17), Rutaceae (17), Rosaceae (16), and Malvaceae (16), among others. In total, 12 lncRNAs had homology in more than 40 species. A total of 67% of lncRNAs were located in the cytoplasm and 33% in exosomes. Thermodynamic characterization of S03 showed a stable secondary structure with -105.67 kcal/mol. This structure included three regions, with a multibranch loop containing a hairpin with a SECIS-like element. Evaluation of the structural conservation by CROSSalign revealed partial similarities between L. mutabilis (S03) and S. lycopersicum (Solyc04r022210.1). RT-PCR validation demonstrated that S03 was upregulated in a drought-tolerant accession of L. mutabilis. Finally, these results highlighted the importance of lncRNAs in tarwi improvement under drought conditions.
Collapse
Affiliation(s)
- Manuel Hidalgo
- Programa de Estudio de Medicina Humana, Universidad Privada Antenor Orrego, Av. América Sur 3145, Trujillo 13008, Peru; (M.H.); (C.R.)
| | - Cynthia Ramos
- Programa de Estudio de Medicina Humana, Universidad Privada Antenor Orrego, Av. América Sur 3145, Trujillo 13008, Peru; (M.H.); (C.R.)
| | - Gaston Zolla
- Laboratorio de Fisiología Molecular de Plantas del Programa de Cereales y Granos Nativos, Facultad de Agronomía, Universidad Nacional Agraria La Molina, Lima 12, Peru
| |
Collapse
|
22
|
Waldl M, Spicher T, Lorenz R, Beckmann IK, Hofacker IL, Löhneysen SV, Stadler PF. Local RNA folding revisited. J Bioinform Comput Biol 2023; 21:2350016. [PMID: 37522173 DOI: 10.1142/s0219720023500166] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/01/2023]
Abstract
Most of the functional RNA elements located within large transcripts are local. Local folding therefore serves a practically useful approximation to global structure prediction. Due to the sensitivity of RNA secondary structure prediction to the exact definition of sequence ends, accuracy can be increased by averaging local structure predictions over multiple, overlapping sequence windows. These averages can be computed efficiently by dynamic programming. Here we revisit the local folding problem, present a concise mathematical formalization that generalizes previous approaches and show that correct Boltzmann samples can be obtained by local stochastic backtracing in McCaskill's algorithms but not from local folding recursions. Corresponding new features are implemented in the ViennaRNA package to improve the support of local folding. Applications include the computation of maximum expected accuracy structures from RNAplfold data and a mutual information measure to quantify the sensitivity of individual sequence positions.
Collapse
Affiliation(s)
- Maria Waldl
- Institute for Theoretical Chemistry, University of Vienna, Währingerstraße 17, 1090 Wien, Austria
| | - Thomas Spicher
- Institute for Theoretical Chemistry, University of Vienna, Währingerstraße 17, 1090 Wien, Austria
| | - Ronny Lorenz
- Institute for Theoretical Chemistry, University of Vienna, Währingerstraße 17, 1090 Wien, Austria
| | - Irene K Beckmann
- Institute for Theoretical Chemistry, University of Vienna, Währingerstraße 17, 1090 Wien, Austria
| | - Ivo L Hofacker
- Institute for Theoretical Chemistry, University of Vienna, Währingerstraße 17, 1090 Wien, Austria
| | - Sarah Von Löhneysen
- Institute of Computer Science and Interdisciplinary Center for Bioinformatics, Leipzig University, Härtelstraße 16-18, D-04107 Leipzig, Germany
| | - Peter F Stadler
- Institute of Computer Science and Interdisciplinary Center for Bioinformatics, Leipzig University, Härtelstraße 16-18, D-04107 Leipzig, Germany
| |
Collapse
|
23
|
Neugroschl A, Catrina IE. TFOFinder: Python program for identifying purine-only double-stranded stretches in the predicted secondary structure(s) of RNA targets. PLoS Comput Biol 2023; 19:e1011418. [PMID: 37624852 PMCID: PMC10484449 DOI: 10.1371/journal.pcbi.1011418] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Revised: 09/07/2023] [Accepted: 08/08/2023] [Indexed: 08/27/2023] Open
Abstract
Nucleic acid probes are valuable tools in biology and chemistry and are indispensable for PCR amplification of DNA, RNA quantification and visualization, and downregulation of gene expression. Recently, triplex-forming oligonucleotides (TFO) have received increased attention due to their improved selectivity and sensitivity in recognizing purine-rich double-stranded RNA regions at physiological pH by incorporating backbone and base modifications. For example, triplex-forming peptide nucleic acid (PNA) oligomers have been used for imaging a structured RNA in cells and inhibiting influenza A replication. Although a handful of programs are available to identify triplex target sites (TTS) in DNA, none are available that find such regions in structured RNAs. Here, we describe TFOFinder, a Python program that facilitates the identification of intramolecular purine-only RNA duplexes that are amenable to forming parallel triple helices (pyrimidine/purine/pyrimidine) and the design of the corresponding TFO(s). We performed genome- and transcriptome-wide analyses of TTS in Drosophila melanogaster and found that only 0.3% (123) of total unique transcripts (35,642) show the potential of forming 12-purine long triplex forming sites that contain at least one guanine. Using minimization algorithms, we predicted the secondary structure(s) of these transcripts, and using TFOFinder, we found that 97 (79%) of the identified 123 transcripts are predicted to fold to form at least one TTS for parallel triple helix formation. The number of transcripts with potential purine TTS increases when the strict search conditions are relaxed by decreasing the length of the probe or by allowing up to two pyrimidine inversions or 1-nucleotide bulge in the target site. These results are encouraging for the use of modified triplex forming probes for live imaging of endogenous structured RNA targets, such as pre-miRNAs, and inhibition of target-specific translation and viral replication.
Collapse
Affiliation(s)
- Atara Neugroschl
- Department of Chemistry and Biochemistry, Stern College for Women, Yeshiva University, New York, New York, United States of America
| | - Irina E. Catrina
- Department of Chemistry and Biochemistry, Yeshiva College, Yeshiva University, New York, New York, United States of America
| |
Collapse
|
24
|
Dasgupta S, LaDu JK, Garcia GR, Li S, Tomono-Duval K, Rericha Y, Huang L, Tanguay RL. A CRISPR-Cas9 mutation in sox9b long intergenic noncoding RNA (slincR) affects zebrafish development, behavior, and regeneration. Toxicol Sci 2023; 194:153-166. [PMID: 37220911 PMCID: PMC10375313 DOI: 10.1093/toxsci/kfad050] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/25/2023] Open
Abstract
The role of long noncoding RNAs (lncRNAs) regulators of toxicological responses to environmental chemicals is gaining prominence. Previously, our laboratory discovered an lncRNA, sox9b long intergenic noncoding RNA (slincR), that is activated by multiple ligands of aryl hydrocarbon receptor (AHR). Within this study, we designed a CRISPR-Cas9-mediated slincR zebrafish mutant line to better understand its biological function in presence or absence of a model AHR ligand, 2,3,7,8-tetrachlorodibenzo-p-dioxin (TCDD). The slincRosu3 line contains an 18 bp insertion within the slincR sequence that changes its predicted mRNA secondary structure. Toxicological profiling showed that slincRosu3 is equally or more sensitive to TCDD for morphological and behavioral phenotypes. Embryonic mRNA-sequencing showed differential responses of 499 or 908 genes in slincRosu3 in absence or presence of TCDD Specifically, unexposed slincRosu3 embryos showed disruptions in metabolic pathways, suggesting an endogenous role for slincR. slincRosu3 embryos also had repressed mRNA levels of sox9b-a transcription factor that slincR is known to negatively regulate. Hence, we studied cartilage development and regenerative capacity-both processes partially regulated by sox9b. Cartilage development was disrupted in slincRosu3 embryos both in presence and absence of TCDD. slincRosu3 embryos also displayed a lack of regenerative capacity of amputated tail fins, accompanied by a lack of cell proliferation. In summary, using a novel slincR mutant line, we show that a mutation in slincR can have widespread impacts on gene expression and structural development endogenously and limited, but significant impacts in presence of AHR induction that further highlights its importance in the developmental process.
Collapse
Affiliation(s)
- Subham Dasgupta
- Sinnhuber Aquatic Research Laboratory, Department of Environmental and Molecular Toxicology, Oregon State University, Corvallis, OR 97333, USA
| | - Jane K LaDu
- Sinnhuber Aquatic Research Laboratory, Department of Environmental and Molecular Toxicology, Oregon State University, Corvallis, OR 97333, USA
| | - Gloria R Garcia
- Sinnhuber Aquatic Research Laboratory, Department of Environmental and Molecular Toxicology, Oregon State University, Corvallis, OR 97333, USA
| | - Sizhen Li
- Department of Electrical Engineering and Computer Science, College of Engineering, Oregon State University, Corvallis, OR 97331, USA
| | - Konoha Tomono-Duval
- Sinnhuber Aquatic Research Laboratory, Department of Environmental and Molecular Toxicology, Oregon State University, Corvallis, OR 97333, USA
| | - Yvonne Rericha
- Sinnhuber Aquatic Research Laboratory, Department of Environmental and Molecular Toxicology, Oregon State University, Corvallis, OR 97333, USA
| | - Liang Huang
- Department of Electrical Engineering and Computer Science, College of Engineering, Oregon State University, Corvallis, OR 97331, USA
| | - Robyn L Tanguay
- Sinnhuber Aquatic Research Laboratory, Department of Environmental and Molecular Toxicology, Oregon State University, Corvallis, OR 97333, USA
| |
Collapse
|
25
|
Wu KE, Zou JY, Chang H. Machine learning modeling of RNA structures: methods, challenges and future perspectives. Brief Bioinform 2023; 24:bbad210. [PMID: 37280185 DOI: 10.1093/bib/bbad210] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2023] [Revised: 05/12/2023] [Accepted: 05/17/2023] [Indexed: 06/08/2023] Open
Abstract
The three-dimensional structure of RNA molecules plays a critical role in a wide range of cellular processes encompassing functions from riboswitches to epigenetic regulation. These RNA structures are incredibly dynamic and can indeed be described aptly as an ensemble of structures that shifts in distribution depending on different cellular conditions. Thus, the computational prediction of RNA structure poses a unique challenge, even as computational protein folding has seen great advances. In this review, we focus on a variety of machine learning-based methods that have been developed to predict RNA molecules' secondary structure, as well as more complex tertiary structures. We survey commonly used modeling strategies, and how many are inspired by or incorporate thermodynamic principles. We discuss the shortcomings that various design decisions entail and propose future directions that could build off these methods to yield more robust, accurate RNA structure predictions.
Collapse
Affiliation(s)
- Kevin E Wu
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA 94305, USA
- Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - James Y Zou
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA
- Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Howard Chang
- Howard Hughes Medical Institute, Stanford University, Stanford, CA 94305, USA
- Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA 94305, USA
| |
Collapse
|
26
|
Sato K, Hamada M. Recent trends in RNA informatics: a review of machine learning and deep learning for RNA secondary structure prediction and RNA drug discovery. Brief Bioinform 2023; 24:bbad186. [PMID: 37232359 PMCID: PMC10359090 DOI: 10.1093/bib/bbad186] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Revised: 04/24/2023] [Accepted: 04/25/2023] [Indexed: 05/27/2023] Open
Abstract
Computational analysis of RNA sequences constitutes a crucial step in the field of RNA biology. As in other domains of the life sciences, the incorporation of artificial intelligence and machine learning techniques into RNA sequence analysis has gained significant traction in recent years. Historically, thermodynamics-based methods were widely employed for the prediction of RNA secondary structures; however, machine learning-based approaches have demonstrated remarkable advancements in recent years, enabling more accurate predictions. Consequently, the precision of sequence analysis pertaining to RNA secondary structures, such as RNA-protein interactions, has also been enhanced, making a substantial contribution to the field of RNA biology. Additionally, artificial intelligence and machine learning are also introducing technical innovations in the analysis of RNA-small molecule interactions for RNA-targeted drug discovery and in the design of RNA aptamers, where RNA serves as its own ligand. This review will highlight recent trends in the prediction of RNA secondary structure, RNA aptamers and RNA drug discovery using machine learning, deep learning and related technologies, and will also discuss potential future avenues in the field of RNA informatics.
Collapse
Affiliation(s)
- Kengo Sato
- School of System Design and Technology, Tokyo Denki University, 5 Senju Asahi-cho, Adachi-ku, Tokyo 120-8551, Japan
| | - Michiaki Hamada
- Department of Electrical Engineering and Bioscience, Faculty of Science and Engineering, Waseda University, 55N-06-10, 3-4-1, Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL) , National Institute of Advanced Industrial Science and Technology (AIST), 3-4-1, Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
- Graduate School of Medicine, Nippon Medical School, 1-1-5, Sendagi, Bunkyo-ku, Tokyo 113-8602, Japan
| |
Collapse
|
27
|
Abstract
RNAstructure is a user-friendly program for the prediction and analysis of RNA secondary structure. It is available as a web server, a program with a graphical user interface, or a set of command line tools. The programs are available for Microsoft Windows, macOS, or Linux. This article provides protocols for prediction of RNA secondary structure (using the web server, the graphical user interface, or the command line) and high-affinity oligonucleotide binding sites to a structured RNA target (using the graphical user interface). © 2023 Wiley Periodicals LLC. Basic Protocol 1: Predicting RNA secondary structure using the RNAstructure web server Alternate Protocol 1: Predicting secondary structure and base pair probabilities using the RNAstructure graphical user interface Alternate Protocol 2: Predicting secondary structure and base pair probabilities using the RNAstructure command line interface Basic Protocol 2: Predicting binding affinities of oligonucleotides complementary to an RNA target using OligoWalk.
Collapse
Affiliation(s)
- Sara E Ali
- Department of Biochemistry & Biophysics and Center for RNA Biology, University of Rochester Medical Center, Rochester, New York
| | - Abhinav Mittal
- Department of Biochemistry & Biophysics and Center for RNA Biology, University of Rochester Medical Center, Rochester, New York
| | - David H Mathews
- Department of Biochemistry & Biophysics and Center for RNA Biology, University of Rochester Medical Center, Rochester, New York
| |
Collapse
|
28
|
Gaucherand L, Iyer A, Gilabert I, Rycroft CH, Gaglia MM. Cut site preference allows influenza A virus PA-X to discriminate between host and viral mRNAs. Nat Microbiol 2023; 8:1304-1317. [PMID: 37349586 PMCID: PMC10690756 DOI: 10.1038/s41564-023-01409-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Accepted: 05/10/2023] [Indexed: 06/24/2023]
Abstract
Many viruses block host gene expression to take over the infected cell. This process, termed host shutoff, is thought to promote viral replication by preventing antiviral responses and redirecting cellular resources to viral processes. Several viruses from divergent families accomplish host shutoff through RNA degradation by endoribonucleases. However, viruses also need to ensure expression of their own genes. The influenza A virus endoribonuclease PA-X solves this problem by sparing viral mRNAs and some host RNAs necessary for viral replication. To understand how PA-X distinguishes between RNAs, we characterized PA-X cut sites transcriptome-wide using 5' rapid amplification of complementary DNA ends coupled to high-throughput sequencing. This analysis, along with RNA structure predictions and validation experiments using reporters, shows that PA-Xs from multiple influenza strains preferentially cleave RNAs at GCUG tetramers in hairpin loops. Importantly, GCUG tetramers are enriched in the human but not the influenza transcriptome. Moreover, optimal PA-X cut sites inserted in the influenza A virus genome are quickly selected against during viral replication in cells. This finding suggests that PA-X evolved these cleavage characteristics to preferentially target host over viral mRNAs in a manner reminiscent of cellular self versus non-self discrimination.
Collapse
Affiliation(s)
- Lea Gaucherand
- Program in Molecular Microbiology, Tufts University Graduate School of Biomedical Sciences, Boston, MA, USA
- Department of Molecular Biology and Microbiology, Tufts University School of Medicine, Boston, MA, USA
- Architecture et Réactivité de l'ARN, Institut de Biologie Moléculaire et Cellulaire du CNRS, Université de Strasbourg, Strasbourg, France
| | - Amrita Iyer
- Department of Molecular Biology and Microbiology, Tufts University School of Medicine, Boston, MA, USA
| | - Isabel Gilabert
- Department of Molecular Biology and Microbiology, Tufts University School of Medicine, Boston, MA, USA
- Faculty of Experimental Sciences, Universidad Francisco de Vitoria, Madrid, Spain
| | - Chris H Rycroft
- John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, MA, USA
- Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
- Department of Mathematics, University of Wisconsin-Madison, Madison, WI, USA
| | - Marta M Gaglia
- Program in Molecular Microbiology, Tufts University Graduate School of Biomedical Sciences, Boston, MA, USA.
- Department of Molecular Biology and Microbiology, Tufts University School of Medicine, Boston, MA, USA.
- Institute for Molecular Virology and Department of Medical Microbiology and Immunology, University of Wisconsin-Madison, Madison, WI, USA.
| |
Collapse
|
29
|
Zhou T, Dai N, Li S, Ward M, Mathews DH, Huang L. RNA design via structure-aware multifrontier ensemble optimization. Bioinformatics 2023; 39:i563-i571. [PMID: 37387188 DOI: 10.1093/bioinformatics/btad252] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION RNA design is the search for a sequence or set of sequences that will fold to desired structure, also known as the inverse problem of RNA folding. However, the sequences designed by existing algorithms often suffer from low ensemble stability, which worsens for long sequence design. Additionally, for many methods only a small number of sequences satisfying the MFE criterion can be found by each run of design. These drawbacks limit their use cases. RESULTS We propose an innovative optimization paradigm, SAMFEO, which optimizes ensemble objectives (equilibrium probability or ensemble defect) by iterative search and yields a very large number of successfully designed RNA sequences as byproducts. We develop a search method which leverages structure level and ensemble level information at different stages of the optimization: initialization, sampling, mutation, and updating. Our work, while being less complicated than others, is the first algorithm that is able to design thousands of RNA sequences for the puzzles from the Eterna100 benchmark. In addition, our algorithm solves the most Eterna100 puzzles among all the general optimization based methods in our study. The only baseline solving more puzzles than our work is dependent on handcrafted heuristics designed for a specific folding model. Surprisingly, our approach shows superiority on designing long sequences for structures adapted from the database of 16S Ribosomal RNAs. AVAILABILITY AND IMPLEMENTATION Our source code and data used in this article is available at https://github.com/shanry/SAMFEO.
Collapse
Affiliation(s)
- Tianshuo Zhou
- School of Electrical Engineering and Computer Science, Oregon State University, Corvalli OR 97330, United States
| | - Ning Dai
- School of Electrical Engineering and Computer Science, Oregon State University, Corvalli OR 97330, United States
| | - Sizhen Li
- School of Electrical Engineering and Computer Science, Oregon State University, Corvalli OR 97330, United States
| | - Max Ward
- Department of Computer Science and Software Engineering, The University of Western Australia, Perth, Australia
| | - David H Mathews
- Department of Biochemistry and Biophysics, University of Rochester Medical Center, Rochester, NY 14642, United States
- Center for RNA Biology, University of Rochester Medical Center, Rochester, NY 14642, United States
- Department of Biostatistics & Computational Biology, University of Rochester Medical Center, Rochester, NY 14642, United States
| | - Liang Huang
- School of Electrical Engineering and Computer Science, Oregon State University, Corvalli OR 97330, United States
| |
Collapse
|
30
|
Chan YC, Kienle E, Oti M, Di Liddo A, Mendez-Lago M, Aschauer DF, Peter M, Pagani M, Arnold C, Vonderheit A, Schön C, Kreuz S, Stark A, Rumpel S. An unbiased AAV-STARR-seq screen revealing the enhancer activity map of genomic regions in the mouse brain in vivo. Sci Rep 2023; 13:6745. [PMID: 37185990 PMCID: PMC10130037 DOI: 10.1038/s41598-023-33448-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Accepted: 04/12/2023] [Indexed: 05/17/2023] Open
Abstract
Enhancers are important cis-regulatory elements controlling cell-type specific expression patterns of genes. Furthermore, combinations of enhancers and minimal promoters are utilized to construct small, artificial promoters for gene delivery vectors. Large-scale functional screening methodology to construct genomic maps of enhancer activities has been successfully established in cultured cell lines, however, not yet applied to terminally differentiated cells and tissues in a living animal. Here, we transposed the Self-Transcribing Active Regulatory Region Sequencing (STARR-seq) technique to the mouse brain using adeno-associated-viruses (AAV) for the delivery of a highly complex screening library tiling entire genomic regions and covering in total 3 Mb of the mouse genome. We identified 483 sequences with enhancer activity, including sequences that were not predicted by DNA accessibility or histone marks. Characterizing the expression patterns of fluorescent reporters controlled by nine candidate sequences, we observed differential expression patterns also in sparse cell types. Together, our study provides an entry point for the unbiased study of enhancer activities in organisms during health and disease.
Collapse
Affiliation(s)
- Ya-Chien Chan
- Institute of Physiology, Focus Program Translational Neurosciences, University Medical Center, Johannes Gutenberg University Mainz, Mainz, Germany
| | - Eike Kienle
- Institute of Physiology, Focus Program Translational Neurosciences, University Medical Center, Johannes Gutenberg University Mainz, Mainz, Germany
| | - Martin Oti
- Institute of Molecular Biology GmbH (IMB), Mainz, Germany
- Global Computational Biology and Digital Sciences, Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach an Der Riß, Germany
| | | | | | - Dominik F Aschauer
- Institute of Physiology, Focus Program Translational Neurosciences, University Medical Center, Johannes Gutenberg University Mainz, Mainz, Germany
| | - Manuel Peter
- Research Institute of Molecular Pathology (IMP), Vienna Biocenter (VBC), Vienna, Austria
- Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, MA, USA
| | - Michaela Pagani
- Research Institute of Molecular Pathology (IMP), Vienna Biocenter (VBC), Vienna, Austria
| | - Cosmas Arnold
- Research Institute of Molecular Pathology (IMP), Vienna Biocenter (VBC), Vienna, Austria
- CeMM Research Center for Molecular Medicine, Austrian Academy of Sciences, Vienna, Austria
| | | | - Christian Schön
- Research Beyond Borders, Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach an Der Riß, Germany
| | - Sebastian Kreuz
- Research Beyond Borders, Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach an Der Riß, Germany
| | - Alexander Stark
- Research Institute of Molecular Pathology (IMP), Vienna Biocenter (VBC), Vienna, Austria
- Medical University of Vienna, Vienna BioCenter (VBC), 1030, Vienna, Austria
| | - Simon Rumpel
- Institute of Physiology, Focus Program Translational Neurosciences, University Medical Center, Johannes Gutenberg University Mainz, Mainz, Germany.
| |
Collapse
|
31
|
Qiu X. Sequence similarity governs generalizability of de novo deep learning models for RNA secondary structure prediction. PLoS Comput Biol 2023; 19:e1011047. [PMID: 37068100 PMCID: PMC10138783 DOI: 10.1371/journal.pcbi.1011047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2023] [Revised: 04/27/2023] [Accepted: 03/25/2023] [Indexed: 04/18/2023] Open
Abstract
Making no use of physical laws or co-evolutionary information, de novo deep learning (DL) models for RNA secondary structure prediction have achieved far superior performances than traditional algorithms. However, their statistical underpinning raises the crucial question of generalizability. We present a quantitative study of the performance and generalizability of a series of de novo DL models, with a minimal two-module architecture and no post-processing, under varied similarities between seen and unseen sequences. Our models demonstrate excellent expressive capacities and outperform existing methods on common benchmark datasets. However, model generalizability, i.e., the performance gap between the seen and unseen sets, degrades rapidly as the sequence similarity decreases. The same trends are observed from several recent DL and machine learning models. And an inverse correlation between performance and generalizability is revealed collectively across all learning-based models with wide-ranging architectures and sizes. We further quantitate how generalizability depends on sequence and structure identity scores via pairwise alignment, providing unique quantitative insights into the limitations of statistical learning. Generalizability thus poses a major hurdle for deploying de novo DL models in practice and various pathways for future advances are discussed.
Collapse
Affiliation(s)
- Xiangyun Qiu
- Department of Physics, George Washington University, Washington DC, United States of America
| |
Collapse
|
32
|
Krüger A, Watkins AM, Wellington-Oguri R, Romano J, Kofman C, DeFoe A, Kim Y, Anderson-Lee J, Fisker E, Townley J, d'Aquino AE, Das R, Jewett MC. Community science designed ribosomes with beneficial phenotypes. Nat Commun 2023; 14:961. [PMID: 36810740 PMCID: PMC9944925 DOI: 10.1038/s41467-023-35827-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2022] [Accepted: 01/04/2023] [Indexed: 02/23/2023] Open
Abstract
Functional design of ribosomes with mutant ribosomal RNA (rRNA) can expand opportunities for understanding molecular translation, building cells from the bottom-up, and engineering ribosomes with altered capabilities. However, such efforts are hampered by cell viability constraints, an enormous combinatorial sequence space, and limitations on large-scale, 3D design of RNA structures and functions. To address these challenges, we develop an integrated community science and experimental screening approach for rational design of ribosomes. This approach couples Eterna, an online video game that crowdsources RNA sequence design to community scientists in the form of puzzles, with in vitro ribosome synthesis, assembly, and translation in multiple design-build-test-learn cycles. We apply our framework to discover mutant rRNA sequences that improve protein synthesis in vitro and cell growth in vivo, relative to wild type ribosomes, under diverse environmental conditions. This work provides insights into rRNA sequence-function relationships and has implications for synthetic biology.
Collapse
Affiliation(s)
- Antje Krüger
- Department of Chemical and Biological Engineering, Chemistry of Life Processes Institute, and Center for Synthetic Biology, Northwestern University, Evanston, IL, 60208, USA.,Resilience US Inc, 9310 Athena Circle, La Jolla, CA, 92037, USA
| | - Andrew M Watkins
- Department of Biochemistry, Stanford University, Stanford, CA, 94305, USA.,Prescient Design, Genentech, 1 DNA Way, South San Francisco, CA, 94080, USA
| | | | - Jonathan Romano
- Department of Biochemistry, Stanford University, Stanford, CA, 94305, USA.,Eterna Massive Open Laboratory, Stanford, CA, 94305, USA.,Department of Computer Science and Engineering, State University of New York at Buffalo, Buffalo, NY, 14260, USA
| | - Camila Kofman
- Department of Chemical and Biological Engineering, Chemistry of Life Processes Institute, and Center for Synthetic Biology, Northwestern University, Evanston, IL, 60208, USA
| | - Alysse DeFoe
- Department of Chemical and Biological Engineering, Chemistry of Life Processes Institute, and Center for Synthetic Biology, Northwestern University, Evanston, IL, 60208, USA
| | - Yejun Kim
- Department of Chemical and Biological Engineering, Chemistry of Life Processes Institute, and Center for Synthetic Biology, Northwestern University, Evanston, IL, 60208, USA
| | | | - Eli Fisker
- Eterna Massive Open Laboratory, Stanford, CA, 94305, USA
| | - Jill Townley
- Eterna Massive Open Laboratory, Stanford, CA, 94305, USA
| | | | - Anne E d'Aquino
- Department of Chemical and Biological Engineering, Chemistry of Life Processes Institute, and Center for Synthetic Biology, Northwestern University, Evanston, IL, 60208, USA
| | - Rhiju Das
- Department of Biochemistry, Stanford University, Stanford, CA, 94305, USA. .,Howard Hughes Medical Institute, Stanford University, Stanford, CA, 94305, USA.
| | - Michael C Jewett
- Department of Chemical and Biological Engineering, Chemistry of Life Processes Institute, and Center for Synthetic Biology, Northwestern University, Evanston, IL, 60208, USA. .,Robert H. Lurie Comprehensive Cancer Center and Simpson Querrey Institute, Northwestern University, Chicago, IL, 60611, USA.
| |
Collapse
|
33
|
Zhao Q, Mao Q, Zhao Z, Yuan W, He Q, Sun Q, Yao Y, Fan X. RNA independent fragment partition method based on deep learning for RNA secondary structure prediction. Sci Rep 2023; 13:2861. [PMID: 36801945 PMCID: PMC9938198 DOI: 10.1038/s41598-023-30124-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2022] [Accepted: 02/16/2023] [Indexed: 02/19/2023] Open
Abstract
The non-coding RNA secondary structure largely determines its function. Hence, accuracy in structure acquisition is of great importance. Currently, this acquisition primarily relies on various computational methods. The prediction of the structures of long RNA sequences with high precision and reasonable computational cost remains challenging. Here, we propose a deep learning model, RNA-par, which could partition an RNA sequence into several independent fragments (i-fragments) based on its exterior loops. Each i-fragment secondary structure predicted individually could be further assembled to acquire the complete RNA secondary structure. In the examination of our independent test set, the average length of the predicted i-fragments was 453 nt, which was considerably shorter than that of complete RNA sequences (848 nt). The accuracy of the assembled structures was higher than that of the structures predicted directly using the state-of-the-art RNA secondary structure prediction methods. This proposed model could serve as a preprocessing step for RNA secondary structure prediction for enhancing the predictive performance (especially for long RNA sequences) and reducing the computational cost. In the future, predicting the secondary structure of long-sequence RNA with high accuracy can be enabled by developing a framework combining RNA-par with various existing RNA secondary structure prediction algorithms. Our models, test codes and test data are provided at https://github.com/mianfei71/RNAPar .
Collapse
Affiliation(s)
- Qi Zhao
- grid.412252.20000 0004 0368 6968College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, 110169 Liaoning China
| | - Qian Mao
- grid.411356.40000 0000 9339 3042College of Light Industry, Liaoning University, Shenyang, 110036 Liaoning China
| | - Zheng Zhao
- grid.440686.80000 0001 0543 8253College of Artificial Intelligence, Dalian Maritime University, Dalian, 116026 Liaoning China
| | - Wenxuan Yuan
- grid.412252.20000 0004 0368 6968College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, 110169 Liaoning China
| | - Qiang He
- grid.412252.20000 0004 0368 6968College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, 110169 Liaoning China
| | - Qixuan Sun
- grid.412252.20000 0004 0368 6968College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, 110169 Liaoning China
| | - Yudong Yao
- grid.217309.e0000 0001 2180 0654Department of Electrical and Computer Engineering, Stevens Institute of Technology, Hoboken, NJ 07030 USA
| | - Xiaoya Fan
- School of Software, Dalian University of Technology, Key Laboratory for Ubiquitous Network and Service Software of Liaoning Province, Dalian, 116620, Liaoning, China.
| |
Collapse
|
34
|
Zhang H, Zhang L, Liu K, Li S, Mathews DH, Huang L. Linear-Time Algorithms for RNA Structure Prediction. Methods Mol Biol 2023; 2586:15-34. [PMID: 36705896 DOI: 10.1007/978-1-0716-2768-6_2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
RNA secondary structure prediction is widely used to understand RNA function. Existing dynamic programming-based algorithms, both the classical minimum free energy (MFE) methods and partition function methods, suffer from a major limitation: their runtimes scale cubically with the RNA length, and this slowness limits their use in genome-wide applications. Inspired by incremental parsing for context-free grammars in computational linguistics, we designed linear-time heuristic algorithms, LinearFold and LinearPartition, to approximate the MFE structure, partition function and base pairing probabilities. These programs are orders of magnitude faster than Vienna RNAfold and CONTRAfold on long sequences. More interestingly, LinearFold and LinearPartition lead to more accurate predictions on the longest sequence families for which the structures are well established (16S and 23S Ribosomal RNAs), as well as improved accuracies for long-range base pairs (500 + nucleotides apart). This chapter provides protocols for using LinearFold and LinearPartition for secondary structure prediction.
Collapse
Affiliation(s)
- He Zhang
- Baidu Research USA, Sunnyvale, CA, USA.,School of Electrical Engineering & Computer Science, Oregon State University, Corvallis, OR, USA
| | - Liang Zhang
- Baidu Research USA, Sunnyvale, CA, USA.,School of Electrical Engineering & Computer Science, Oregon State University, Corvallis, OR, USA
| | - Kaibo Liu
- Baidu Research USA, Sunnyvale, CA, USA
| | - Sizhen Li
- School of Electrical Engineering & Computer Science, Oregon State University, Corvallis, OR, USA
| | - David H Mathews
- Dept. of Biochemistry & Biophysics, Center for RNA Biology, Rochester, NY, USA.,Dept. of Biostatistics & Computational Biology, University of Rochester Medical Center, Rochester, NY, USA
| | - Liang Huang
- School of Electrical Engineering & Computer Science, Oregon State University, Corvallis, OR, USA.
| |
Collapse
|
35
|
Binet T, Avalle B, Dávila Felipe M, Maffucci I. AptaMat: a matrix-based algorithm to compare single-stranded oligonucleotides secondary structures. Bioinformatics 2022; 39:6849515. [PMID: 36440922 PMCID: PMC9805580 DOI: 10.1093/bioinformatics/btac752] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2022] [Revised: 11/14/2022] [Accepted: 11/24/2022] [Indexed: 11/30/2022] Open
Abstract
MOTIVATION Comparing single-stranded nucleic acids (ssNAs) secondary structures is fundamental when investigating their function and evolution and predicting the effect of mutations on their structures. Many comparison metrics exist, although they are either too elaborate or not sensitive enough to distinguish close ssNAs structures. RESULTS In this context, we developed AptaMat, a simple and sensitive algorithm for ssNAs secondary structures comparison based on matrices representing the ssNAs secondary structures and a metric built upon the Manhattan distance in the plane. We applied AptaMat to several examples and compared the results to those obtained by the most frequently used metrics, namely the Hamming distance and the RNAdistance, and by a recently developed image-based approach. We showed that AptaMat is able to discriminate between similar sequences, outperforming all the other here considered metrics. In addition, we showed that AptaMat was able to correctly classify 14 RFAM families within a clustering procedure. AVAILABILITY AND IMPLEMENTATION The python code for AptaMat is available at https://github.com/GEC-git/AptaMat.git. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Thomas Binet
- Université de technologie de Compiègne, UPJV, CNRS, Enzyme and Cell Engineering, Centre de recherche Royallieu, CS 60 319 - 60 203, Compiègne Cedex, France
| | - Bérangère Avalle
- Université de technologie de Compiègne, UPJV, CNRS, Enzyme and Cell Engineering, Centre de recherche Royallieu, CS 60 319 - 60 203, Compiègne Cedex, France
| | | | | |
Collapse
|
36
|
Nef C, Madoui MA, Pelletier É, Bowler C. Whole-genome scanning reveals environmental selection mechanisms that shape diversity in populations of the epipelagic diatom Chaetoceros. PLoS Biol 2022; 20:e3001893. [PMID: 36441816 PMCID: PMC9731442 DOI: 10.1371/journal.pbio.3001893] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2022] [Revised: 12/08/2022] [Accepted: 10/27/2022] [Indexed: 11/30/2022] Open
Abstract
Diatoms form a diverse and abundant group of photosynthetic protists that are essential players in marine ecosystems. However, the microevolutionary structure of their populations remains poorly understood, particularly in polar regions. Exploring how closely related diatoms adapt to different environments is essential given their short generation times, which may allow rapid adaptations, and their prevalence in marine regions dramatically impacted by climate change, such as the Arctic and Southern Oceans. Here, we address genetic diversity patterns in Chaetoceros, the most abundant diatom genus and one of the most diverse, using 11 metagenome-assembled genomes (MAGs) reconstructed from Tara Oceans metagenomes. Genome-resolved metagenomics on these MAGs confirmed a prevalent distribution of Chaetoceros in the Arctic Ocean with lower dispersal in the Pacific and Southern Oceans as well as in the Mediterranean Sea. Single-nucleotide variants identified within the different MAG populations allowed us to draw a landscape of Chaetoceros genetic diversity and revealed an elevated genetic structure in some Arctic Ocean populations. Gene flow patterns of closely related Chaetoceros populations seemed to correlate with distinct abiotic factors rather than with geographic distance. We found clear positive selection of genes involved in nutrient availability responses, in particular for iron (e.g., ISIP2a, flavodoxin), silicate, and phosphate (e.g., polyamine synthase), that were further supported by analysis of Chaetoceros transcriptomes. Altogether, these results highlight the importance of environmental selection in shaping diatom diversity patterns and provide new insights into their metapopulation genomics through the integration of metagenomic and environmental data.
Collapse
Affiliation(s)
- Charlotte Nef
- Institut de Biologie de l’École Normale Supérieure (IBENS), École Normale Supérieure, CNRS, INSERM, PSL Université Paris, Paris, France
- Research Federation for the study of Global Ocean Systems Ecology and Evolution, FR2022/Tara Oceans, Paris, France
| | - Mohammed-Amin Madoui
- Service d’Etude des Prions et des Infections Atypiques (SEPIA), Institut François Jacob, Commissariat à l’Energie Atomique et aux Energies Alternatives (CEA), Université Paris Saclay, Fontenay-aux-Roses, France
- Équipe Écologie Évolutive, UMR CNRS 6282 BioGéoSciences, Université de Bourgogne Franche-Comté, Dijon, 21000, France
| | - Éric Pelletier
- Research Federation for the study of Global Ocean Systems Ecology and Evolution, FR2022/Tara Oceans, Paris, France
- Metabolic Genomics, Genoscope, Institut de Biologie François-Jacob, CEA, CNRS, Université Evry, Université Paris Saclay, Evry, France
| | - Chris Bowler
- Institut de Biologie de l’École Normale Supérieure (IBENS), École Normale Supérieure, CNRS, INSERM, PSL Université Paris, Paris, France
- Research Federation for the study of Global Ocean Systems Ecology and Evolution, FR2022/Tara Oceans, Paris, France
- * E-mail:
| |
Collapse
|
37
|
Zhang H, Li S, Zhang L, Mathews D, Huang L. LazySampling and LinearSampling: fast stochastic sampling of RNA secondary structure with applications to SARS-CoV-2. Nucleic Acids Res 2022; 51:e7. [PMID: 36401871 PMCID: PMC9881153 DOI: 10.1093/nar/gkac1029] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2022] [Revised: 09/22/2022] [Accepted: 10/21/2022] [Indexed: 11/21/2022] Open
Abstract
Many RNAs fold into multiple structures at equilibrium, and there is a need to sample these structures according to their probabilities in the ensemble. The conventional sampling algorithm suffers from two limitations: (i) the sampling phase is slow due to many repeated calculations; and (ii) the end-to-end runtime scales cubically with the sequence length. These issues make it difficult to be applied to long RNAs, such as the full genomes of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). To address these problems, we devise a new sampling algorithm, LazySampling, which eliminates redundant work via on-demand caching. Based on LazySampling, we further derive LinearSampling, an end-to-end linear time sampling algorithm. Benchmarking on nine diverse RNA families, the sampled structures from LinearSampling correlate better with the well-established secondary structures than Vienna RNAsubopt and RNAplfold. More importantly, LinearSampling is orders of magnitude faster than standard tools, being 428× faster (72 s versus 8.6 h) than RNAsubopt on the full genome of SARS-CoV-2 (29 903 nt). The resulting sample landscape correlates well with the experimentally guided secondary structure models, and is closer to the alternative conformations revealed by experimentally driven analysis. Finally, LinearSampling finds 23 regions of 15 nt with high accessibilities in the SARS-CoV-2 genome, which are potential targets for COVID-19 diagnostics and therapeutics.
Collapse
Affiliation(s)
- He Zhang
- Baidu Research, Sunnyvale, CA, USA,School of Electrical Engineering & Computer Science, Oregon State University, Corvallis, OR, USA
| | - Sizhen Li
- School of Electrical Engineering & Computer Science, Oregon State University, Corvallis, OR, USA
| | - Liang Zhang
- School of Electrical Engineering & Computer Science, Oregon State University, Corvallis, OR, USA
| | - David H Mathews
- Department of Biochemistry & Biophysics, University of Rochester Medical Center, Rochester, NY 14642, USA,Center for RNA Biology, University of Rochester Medical Center, Rochester, NY 14642, USA,Department of Biostatistics & Computational Biology, University of Rochester Medical Center, Rochester, NY 14642, USA
| | | |
Collapse
|
38
|
Fukunaga T, Hamada M. LinAliFold and CentroidLinAliFold: fast RNA consensus secondary structure prediction for aligned sequences using beam search methods. BIOINFORMATICS ADVANCES 2022; 2:vbac078. [PMID: 36699418 PMCID: PMC9710674 DOI: 10.1093/bioadv/vbac078] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Revised: 10/13/2022] [Accepted: 10/21/2022] [Indexed: 11/05/2022]
Abstract
Motivation RNA consensus secondary structure prediction from aligned sequences is a powerful approach for improving the secondary structure prediction accuracy. However, because the computational complexities of conventional prediction tools scale with the cube of the alignment lengths, their application to long RNA sequences, such as viral RNAs or long non-coding RNAs, requires significant computational time. Results In this study, we developed LinAliFold and CentroidLinAliFold, fast RNA consensus secondary structure prediction tools based on minimum free energy and maximum expected accuracy principles, respectively. We achieved software acceleration using beam search methods that were successfully used for fast secondary structure prediction from a single RNA sequence. Benchmark analyses showed that LinAliFold and CentroidLinAliFold were much faster than the existing methods while preserving the prediction accuracy. As an empirical application, we predicted the consensus secondary structure of coronaviruses with approximately 30 000 nt in 5 and 79 min by LinAliFold and CentroidLinAliFold, respectively. We confirmed that the predicted consensus secondary structure of coronaviruses was consistent with the experimental results. Availability and implementation The source codes of LinAliFold and CentroidLinAliFold are freely available at https://github.com/fukunagatsu/LinAliFold-CentroidLinAliFold. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
| | - Michiaki Hamada
- Department of Electrical Engineering and Bioscience, Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 1698555, Japan,Computational Bio Big-Data Open Innovation Laboratory, AIST-Waseda University, Tokyo 1698555, Japan
| |
Collapse
|
39
|
Opuu V, Merleau NSC, Messow V, Smerlak M. RAFFT: Efficient prediction of RNA folding pathways using the fast Fourier transform. PLoS Comput Biol 2022; 18:e1010448. [PMID: 36026505 PMCID: PMC9455880 DOI: 10.1371/journal.pcbi.1010448] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2022] [Revised: 09/08/2022] [Accepted: 07/28/2022] [Indexed: 11/18/2022] Open
Abstract
We propose a novel heuristic to predict RNA secondary structure formation pathways that has two components: (i) a folding algorithm and (ii) a kinetic ansatz. This heuristic is inspired by the kinetic partitioning mechanism, by which molecules follow alternative folding pathways to their native structure, some much faster than others. Similarly, our algorithm RAFFT starts by generating an ensemble of concurrent folding pathways ending in multiple metastable structures, which is in contrast with traditional thermodynamic approaches that find single structures with minimal free energies. When we constrained the algorithm to predict only 50 structures per sequence, near-native structures were found for RNA molecules of length ≤ 200 nucleotides. Our heuristic has been tested on the coronavirus frameshifting stimulation element (CFSE): an ensemble of 68 distinct structures allowed us to produce complete folding kinetic trajectories, whereas known methods require evaluating millions of sub-optimal structures to achieve this result. Thanks to the fast Fourier transform on which RAFFT (RNA folding Algorithm wih Fast Fourier Transform) is based, these computations are efficient, with complexity O ( L 2logL ).
Collapse
Affiliation(s)
- Vaitea Opuu
- Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany
- * E-mail:
| | | | - Vincent Messow
- Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany
| | - Matteo Smerlak
- Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany
| |
Collapse
|
40
|
Fei Y, Zhang H, Wang Y, Liu Z, Liu Y. LTPConstraint: a transfer learning based end-to-end method for RNA secondary structure prediction. BMC Bioinformatics 2022; 23:354. [PMID: 35999499 PMCID: PMC9396797 DOI: 10.1186/s12859-022-04847-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2022] [Accepted: 07/18/2022] [Indexed: 11/26/2022] Open
Abstract
Background RNA secondary structure is very important for deciphering cell’s activity and disease occurrence. The first method which was used by the academics to predict this structure is biological experiment, But this method is too expensive, causing the promotion to be affected. Then, computing methods emerged, which has good efficiency and low cost. However, the accuracy of computing methods are not satisfactory. Many machine learning methods have also been applied to this area, but the accuracy has not improved significantly. Deep learning has matured and achieves great success in many areas such as computer vision and natural language processing. It uses neural network which is a kind of structure that has good functionality and versatility, but its effect is highly correlated with the quantity and quality of the data. At present, there is no model with high accuracy, low data dependence and high convenience in predicting RNA secondary structure. Results This paper designs a neural network called LTPConstraint to predict RNA secondary structure. The network is based on many network structure such as Bidirectional LSTM, Transformer and generator. It also uses transfer learning to train modelso that the data dependence can be reduced. Conclusions LTPConstraint has achieved high accuracy in RNA secondary structure prediction. Compared with the previous methods, the accuracy improves obviously both in predicting the structure with pseudoknot and the structure without pseudoknot. At the same time, LTPConstraint is easy to operate and can achieve result very quickly.
Collapse
Affiliation(s)
- Yinchao Fei
- College of Computer Science and Technology, Jilin University, Changchun, China.,Key Laboratory of Symbolic Computation and Knowledge Engineering, Ministry of Education, Jilin University, Changchun, China
| | - Hao Zhang
- College of Computer Science and Technology, Jilin University, Changchun, China.,Key Laboratory of Symbolic Computation and Knowledge Engineering, Ministry of Education, Jilin University, Changchun, China
| | - Yili Wang
- College of Computer Science and Technology, Jilin University, Changchun, China.,Key Laboratory of Symbolic Computation and Knowledge Engineering, Ministry of Education, Jilin University, Changchun, China
| | - Zhen Liu
- Graduate School of Engineering, Nagasaki Institute of Applied Science, Nagasaki, Japan
| | - Yuanning Liu
- College of Computer Science and Technology, Jilin University, Changchun, China. .,Key Laboratory of Symbolic Computation and Knowledge Engineering, Ministry of Education, Jilin University, Changchun, China.
| |
Collapse
|
41
|
Bugnon LA, Edera AA, Prochetto S, Gerard M, Raad J, Fenoy E, Rubiolo M, Chorostecki U, Gabaldón T, Ariel F, Di Persia LE, Milone DH, Stegmayer G. Secondary structure prediction of long noncoding RNA: review and experimental comparison of existing approaches. Brief Bioinform 2022; 23:6606044. [PMID: 35692094 DOI: 10.1093/bib/bbac205] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2022] [Revised: 05/02/2022] [Accepted: 05/04/2022] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION In contrast to messenger RNAs, the function of the wide range of existing long noncoding RNAs (lncRNAs) largely depends on their structure, which determines interactions with partner molecules. Thus, the determination or prediction of the secondary structure of lncRNAs is critical to uncover their function. Classical approaches for predicting RNA secondary structure have been based on dynamic programming and thermodynamic calculations. In the last 4 years, a growing number of machine learning (ML)-based models, including deep learning (DL), have achieved breakthrough performance in structure prediction of biomolecules such as proteins and have outperformed classical methods in short transcripts folding. Nevertheless, the accurate prediction for lncRNA still remains far from being effectively solved. Notably, the myriad of new proposals has not been systematically and experimentally evaluated. RESULTS In this work, we compare the performance of the classical methods as well as the most recently proposed approaches for secondary structure prediction of RNA sequences using a unified and consistent experimental setup. We use the publicly available structural profiles for 3023 yeast RNA sequences, and a novel benchmark of well-characterized lncRNA structures from different species. Moreover, we propose a novel metric to assess the predictive performance of methods, exclusively based on the chemical probing data commonly used for profiling RNA structures, avoiding any potential bias incorporated by computational predictions when using dot-bracket references. Our results provide a comprehensive comparative assessment of existing methodologies, and a novel and public benchmark resource to aid in the development and comparison of future approaches. AVAILABILITY Full source code and benchmark datasets are available at: https://github.com/sinc-lab/lncRNA-folding. CONTACT lbugnon@sinc.unl.edu.ar.
Collapse
Affiliation(s)
- L A Bugnon
- Research Institute for Signals, Systems and Computational Intelligence sinc(i) (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| | - A A Edera
- Research Institute for Signals, Systems and Computational Intelligence sinc(i) (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| | - S Prochetto
- Research Institute for Signals, Systems and Computational Intelligence sinc(i) (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina.,IAL, CONICET, Ciudad Universitaria UNL, (3000) Santa Fe, Argentina
| | - M Gerard
- Research Institute for Signals, Systems and Computational Intelligence sinc(i) (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| | - J Raad
- Research Institute for Signals, Systems and Computational Intelligence sinc(i) (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| | - E Fenoy
- Research Institute for Signals, Systems and Computational Intelligence sinc(i) (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| | - M Rubiolo
- Research Institute for Signals, Systems and Computational Intelligence sinc(i) (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| | - U Chorostecki
- Barcelona Supercomputing Center (BSC-CNS), Institute of Research in Biomedicine (IRB), Spain
| | - T Gabaldón
- Barcelona Supercomputing Center (BSC-CNS), Institute of Research in Biomedicine (IRB), Spain.,Catalan Institution for Research and Advanced Studies (ICREA), Barcelona, Spain.,Centro de Investigación Biomédica En Red de Enfermedades Infecciosas (CIBERINFEC), Barcelona, Spain
| | - F Ariel
- IAL, CONICET, Ciudad Universitaria UNL, (3000) Santa Fe, Argentina
| | - L E Di Persia
- Research Institute for Signals, Systems and Computational Intelligence sinc(i) (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| | - D H Milone
- Research Institute for Signals, Systems and Computational Intelligence sinc(i) (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| | - G Stegmayer
- Research Institute for Signals, Systems and Computational Intelligence sinc(i) (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| |
Collapse
|
42
|
Zuber J, Schroeder SJ, Sun H, Turner DH, Mathews DH. Nearest neighbor rules for RNA helix folding thermodynamics: improved end effects. Nucleic Acids Res 2022; 50:5251-5262. [PMID: 35524574 PMCID: PMC9122537 DOI: 10.1093/nar/gkac261] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2021] [Revised: 03/29/2022] [Accepted: 04/08/2022] [Indexed: 12/26/2022] Open
Abstract
Nearest neighbor parameters for estimating the folding stability of RNA secondary structures are in widespread use. For helices, current parameters penalize terminal AU base pairs relative to terminal GC base pairs. We curated an expanded database of helix stabilities determined by optical melting experiments. Analysis of the updated database shows that terminal penalties depend on the sequence identity of the adjacent penultimate base pair. New nearest neighbor parameters that include this additional sequence dependence accurately predict the measured values of 271 helices in an updated database with a correlation coefficient of 0.982. This refined understanding of helix ends facilitates fitting terms for base pair stacks with GU pairs. Prior parameter sets treated 5′GGUC3′ paired to 3′CUGG5′ separately from other 5′GU3′/3′UG5′ stacks. The improved understanding of helix end stability, however, makes the separate treatment unnecessary. Introduction of the additional terms was tested with three optical melting experiments. The average absolute difference between measured and predicted free energy changes at 37°C for these three duplexes containing terminal adjacent AU and GU pairs improved from 1.38 to 0.27 kcal/mol. This confirms the need for the additional sequence dependence in the model.
Collapse
Affiliation(s)
- Jeffrey Zuber
- Alnylam Pharmaceuticals, Inc., Cambridge, MA 02142, USA
| | - Susan J Schroeder
- Department of Chemistry and Biochemistry, and Department of Microbiology and Plant Biology, University of Oklahoma, Norman, OK 73019, USA
| | - Hongying Sun
- Department of Biochemistry & Biophysics, University of Rochester, Rochester, NY 14642, USA.,Center for RNA Biology, University of Rochester, Rochester, NY 14642, USA
| | - Douglas H Turner
- Center for RNA Biology, University of Rochester, Rochester, NY 14642, USA.,Department of Chemistry, University of Rochester, Rochester, NY 14627, USA
| | - David H Mathews
- Department of Biochemistry & Biophysics, University of Rochester, Rochester, NY 14642, USA.,Center for RNA Biology, University of Rochester, Rochester, NY 14642, USA.,Department of Biostatistics & Computational Biology, University of Rochester, Rochester, NY 14642, USA
| |
Collapse
|
43
|
Gray M, Chester S, Jabbari H. KnotAli: informed energy minimization through the use of evolutionary information. BMC Bioinformatics 2022; 23:159. [PMID: 35505276 PMCID: PMC9063079 DOI: 10.1186/s12859-022-04673-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2021] [Accepted: 04/05/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Improving the prediction of structures, especially those containing pseudoknots (structures with crossing base pairs) is an ongoing challenge. Homology-based methods utilize structural similarities within a family to predict the structure. However, their prediction is limited to the consensus structure, and by the quality of the alignment. Minimum free energy (MFE) based methods, on the other hand, do not rely on familial information and can predict structures of novel RNA molecules. Their prediction normally suffers from inaccuracies due to their underlying energy parameters. RESULTS We present a new method for prediction of RNA pseudoknotted secondary structures that combines the strengths of MFE prediction and alignment-based methods. KnotAli takes a multiple RNA sequence alignment as input and uses covariation and thermodynamic energy minimization to predict possibly pseudoknotted secondary structures for each individual sequence in the alignment. We compared KnotAli's performance to that of three other alignment-based programs, two that can handle pseudoknotted structures and one control, on a large data set of 3034 RNA sequences with varying lengths and levels of sequence conservation from 10 families with pseudoknotted and pseudoknot-free reference structures. We produced sequence alignments for each family using two well-known sequence aligners (MUSCLE and MAFFT). CONCLUSIONS We found KnotAli's performance to be superior in 6 of the 10 families for MUSCLE and 7 of the 10 for MAFFT. While both KnotAli and Cacofold use background noise correction strategies, we found KnotAli's predictions to be less dependent on the alignment quality. KnotAli can be found online at the Zenodo image: https://doi.org/10.5281/zenodo.5794719.
Collapse
Affiliation(s)
- Mateo Gray
- Department of Computer Science, University of Victoria, Victoria, Canada
| | - Sean Chester
- Department of Computer Science, University of Victoria, Victoria, Canada
| | - Hosna Jabbari
- Department of Computer Science, University of Victoria, Victoria, Canada. .,Institute on Aging and Lifelong Health, University of Victoria, Victoria, Canada.
| |
Collapse
|
44
|
RNA folding using quantum computers. PLoS Comput Biol 2022; 18:e1010032. [PMID: 35404931 PMCID: PMC9022793 DOI: 10.1371/journal.pcbi.1010032] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2021] [Revised: 04/21/2022] [Accepted: 03/18/2022] [Indexed: 11/19/2022] Open
Abstract
The 3-dimensional fold of an RNA molecule is largely determined by patterns of intramolecular hydrogen bonds between bases. Predicting the base pairing network from the sequence, also referred to as RNA secondary structure prediction or RNA folding, is a nondeterministic polynomial-time (NP)-complete computational problem. The structure of the molecule is strongly predictive of its functions and biochemical properties, and therefore the ability to accurately predict the structure is a crucial tool for biochemists. Many methods have been proposed to efficiently sample possible secondary structure patterns. Classic approaches employ dynamic programming, and recent studies have explored approaches inspired by evolutionary and machine learning algorithms. This work demonstrates leveraging quantum computing hardware to predict the secondary structure of RNA. A Hamiltonian written in the form of a Binary Quadratic Model (BQM) is derived to drive the system toward maximizing the number of consecutive base pairs while jointly maximizing the average length of the stems. A Quantum Annealer (QA) is compared to a Replica Exchange Monte Carlo (REMC) algorithm programmed with the same objective function, with the QA being shown to be highly competitive at rapidly identifying low energy solutions. The method proposed in this study was compared to three algorithms from literature and, despite its simplicity, was found to be competitive on a test set containing known structures with pseudoknots. The recent FDA approval of mRNA-based vaccines has increased public interest in synthetically designed RNA molecules. RNA molecules fold into complex secondary structures which determine their molecular properties and in part their efficacy. Determining the folded structure of an RNA molecule is a computationally challenging task with exponential scaling that is intractable to solve exactly, and therefore approximate methods are used. Quantum computing technology offers a new approach to finding approximate solutions to problems with exponential scaling. We formulate a simplistic, yet effective, model of RNA folding that can easily be mapped to quantum computers and we show that currently available quantum computing hardware is competitive with classical methods.
Collapse
|
45
|
Hess JM, Jannen WK, Aalberts DP. The four mRNA bases have quite different (un)folding free energies, applications to RNA splicing and translation initiation with BindOligoNet. J Mol Biol 2022; 434:167578. [DOI: 10.1016/j.jmb.2022.167578] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2021] [Revised: 03/31/2022] [Accepted: 04/01/2022] [Indexed: 12/12/2022]
|
46
|
Tagashira M, Asai K. ConsAlifold: considering RNA structural alignments improves prediction accuracy of RNA consensus secondary structures. Bioinformatics 2022; 38:710-719. [PMID: 34694364 DOI: 10.1093/bioinformatics/btab738] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2020] [Revised: 08/24/2021] [Accepted: 10/20/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION By detecting homology among RNAs, the probabilistic consideration of RNA structural alignments has improved the prediction accuracy of significant RNA prediction problems. Predicting an RNA consensus secondary structure from an RNA sequence alignment is a fundamental research objective because in the detection of conserved base-pairings among RNA homologs, predicting an RNA consensus secondary structure is more convenient than predicting an RNA structural alignment. RESULTS We developed and implemented ConsAlifold, a dynamic programming-based method that predicts the consensus secondary structure of an RNA sequence alignment. ConsAlifold considers RNA structural alignments. ConsAlifold achieves moderate running time and the best prediction accuracy of RNA consensus secondary structures among available prediction methods. AVAILABILITY AND IMPLEMENTATION ConsAlifold, data and Python scripts for generating both figures and tables are freely available at https://github.com/heartsh/consalifold. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Masaki Tagashira
- Department of Computational Biology and Medical Sciences, University of Tokyo, Chiba 277-8561, Japan.,Artificial Intelligence Research Center, AIST, Tokyo 135-0064, Japan
| | - Kiyoshi Asai
- Department of Computational Biology and Medical Sciences, University of Tokyo, Chiba 277-8561, Japan.,Artificial Intelligence Research Center, AIST, Tokyo 135-0064, Japan
| |
Collapse
|
47
|
Zhang H, Zhang L, Li S, Mathews DH, Huang L. LazySampling and LinearSampling: Fast Stochastic Sampling of RNA Secondary Structure with Applications to SARS-CoV-2. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2021:2020.12.29.424617. [PMID: 33398265 PMCID: PMC7781300 DOI: 10.1101/2020.12.29.424617] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Many RNAs fold into multiple structures at equilibrium. The classical stochastic sampling algorithm can sample secondary structures according to their probabilities in the Boltzmann ensemble, and is widely used. However, this algorithm, consisting of a bottom-up partition function phase followed by a top-down sampling phase, suffers from three limitations: (a) the formulation and implementation of the sampling phase are unnecessarily complicated; (b) the sampling phase repeatedly recalculates many redundant recursions already done during the partition function phase; (c) the partition function runtime scales cubically with the sequence length. These issues prevent stochastic sampling from being used for very long RNAs such as the full genomes of SARS-CoV-2. To address these problems, we first adopt a hypergraph framework under which the sampling algorithm can be greatly simplified. We then present three sampling algorithms under this framework, among which the LazySampling algorithm is the fastest by eliminating redundant work in the sampling phase via on-demand caching. Based on LazySampling, we further replace the cubic-time partition function by a linear-time approximate one, and derive LinearSampling, an end-to-end linear-time sampling algorithm that is orders of magnitude faster than the standard one. For instance, LinearSampling is 176Ã- faster (38.9s vs. 1.9h) than Vienna RNAsubopt on the full genome of Ebola virus (18,959 nt ). More importantly, LinearSampling is the first RNA structure sampling algorithm to scale up to the full-genome of SARS-CoV-2 without local window constraints, taking only 69.2 seconds on its reference sequence (29,903 nt ). The resulting sample correlates well with the experimentally-guided structures. On the SARS-CoV-2 genome, LinearSampling finds 23 regions of 15 nt with high accessibilities, which are potential targets for COVID-19 diagnostics and drug design. See code: https://github.com/LinearFold/LinearSampling.
Collapse
|
48
|
Fu L, Cao Y, Wu J, Peng Q, Nie Q, Xie X. UFold: fast and accurate RNA secondary structure prediction with deep learning. Nucleic Acids Res 2021; 50:e14. [PMID: 34792173 PMCID: PMC8860580 DOI: 10.1093/nar/gkab1074] [Citation(s) in RCA: 56] [Impact Index Per Article: 18.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2021] [Revised: 09/15/2021] [Accepted: 10/19/2021] [Indexed: 11/13/2022] Open
Abstract
For many RNA molecules, the secondary structure is essential for the correct function of the RNA. Predicting RNA secondary structure from nucleotide sequences is a long-standing problem in genomics, but the prediction performance has reached a plateau over time. Traditional RNA secondary structure prediction algorithms are primarily based on thermodynamic models through free energy minimization, which imposes strong prior assumptions and is slow to run. Here, we propose a deep learning-based method, called UFold, for RNA secondary structure prediction, trained directly on annotated data and base-pairing rules. UFold proposes a novel image-like representation of RNA sequences, which can be efficiently processed by Fully Convolutional Networks (FCNs). We benchmark the performance of UFold on both within- and cross-family RNA datasets. It significantly outperforms previous methods on within-family datasets, while achieving a similar performance as the traditional methods when trained and tested on distinct RNA families. UFold is also able to predict pseudoknots accurately. Its prediction is fast with an inference time of about 160 ms per sequence up to 1500 bp in length. An online web server running UFold is available at https://ufold.ics.uci.edu. Code is available at https://github.com/uci-cbcl/UFold.
Collapse
Affiliation(s)
- Laiyi Fu
- Systems Engineering Institute, School of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China.,Department of Computer Science, University of California, Irvine, CA 92697, USA
| | - Yingxin Cao
- Department of Computer Science, University of California, Irvine, CA 92697, USA.,Center for Complex Biological Systems, University of California, Irvine, CA 92697, USA.,NSF-Simons Center for Multiscale Cell Fate Research, University of California, Irvine, CA 92697, USA
| | - Jie Wu
- Department of Biological Chemistry, University of California, Irvine, CA 92697, USA
| | - Qinke Peng
- Systems Engineering Institute, School of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China
| | - Qing Nie
- Department of Mathematics, University of California, Irvine, CA 92697, USA.,Center for Complex Biological Systems, University of California, Irvine, CA 92697, USA.,NSF-Simons Center for Multiscale Cell Fate Research, University of California, Irvine, CA 92697, USA
| | - Xiaohui Xie
- Department of Computer Science, University of California, Irvine, CA 92697, USA
| |
Collapse
|
49
|
Pandemic Analytics: How Countries are Leveraging Big Data Analytics and Artificial Intelligence to Fight COVID-19? SN COMPUTER SCIENCE 2021; 3:54. [PMID: 34778841 PMCID: PMC8577168 DOI: 10.1007/s42979-021-00923-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/07/2021] [Accepted: 10/04/2021] [Indexed: 12/23/2022]
Abstract
Emergence of coronavirus in December 2019 and its spread across the world in the following months has made it a global health concern. The uncertainty about its evolution, transmission and effect of SARS-CoV-2, has left the countries and their governments in a worrisome state. Ambiguity about the strategies that would work towards mitigating the impact of virus has prompted them to use data-driven methods. Several countries started applying big data and advanced analytics technology for management of the crisis. This study aims to understand how different nations have employed analytics to deal with COVID-19. This paper reviews various strategies employed by different governments and organizations across nations that use advanced analytics to tackle pandemic. In the current emergency of corona virus, there have been several measures that organizations have taken to mitigate its impact, thanks to the evolution of computing technology. Big data and analytical tools provide various solutions like detection of existing COVID-19 cases, prediction of future outbreak, anticipation of potential preventive and therapeutic agents, and assistance in informed decision-making. This review discusses the big data analytics and artificial intelligence approaches that policy makers, researchers, epidemiologists and private organizations have adopted. By examining the different ways and areas where data analytics has been utilized, this study provides the other nations with the progressive scheme to address the pandemic.
Collapse
|
50
|
Zhang C, Forsdyke DR. Potential Achilles heels of SARS-CoV-2 are best displayed by the base order-dependent component of RNA folding energy. Comput Biol Chem 2021; 94:107570. [PMID: 34500325 PMCID: PMC8410225 DOI: 10.1016/j.compbiolchem.2021.107570] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2021] [Revised: 08/29/2021] [Accepted: 08/30/2021] [Indexed: 11/29/2022]
Abstract
The base order-dependent component of folding energy has revealed a highly conserved region in HIV-1 genomes that associates with RNA structure. This corresponds to a packaging signal that is recognized by the nucleocapsid domain of the Gag polyprotein. Long viewed as a potential HIV-1 "Achilles heel," the signal can be targeted by a new antiviral compound. Although SARS-CoV-2 differs in many respects from HIV-1, the same technology displays regions with a high base order-dependent folding energy component, which are also highly conserved. This indicates structural invariance (SI) sustained by natural selection. While the regions are often also protein-encoding (e. g. NSP3, ORF3a), we suggest that their nucleic acid level functions can be considered potential "Achilles heels" for SARS-CoV-2, perhaps susceptible to therapies like those envisaged for AIDS. The ribosomal frameshifting element scored well, but higher SI scores were obtained in other regions, including those encoding NSP13 and the nucleocapsid (N) protein.
Collapse
Affiliation(s)
- Chiyu Zhang
- Shanghai Public Health Clinical Center, Fudan University, Shanghai, China
| | - Donald R Forsdyke
- Department of Biomedical and Molecular Sciences, Queen's University, Kingston, Ontario K7L3N6, Canada.
| |
Collapse
|