1
|
Lee J, Kim M, Han K, Yoon S. StringFix: an annotation-guided transcriptome assembler improves the recovery of amino acid sequences from RNA-Seq reads. Genes Genomics 2023; 45:1599-1609. [PMID: 37837515 DOI: 10.1007/s13258-023-01458-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2023] [Accepted: 10/01/2023] [Indexed: 10/16/2023]
Abstract
BACKGROUND Reconstruction of amino acid sequences from assembled transcriptome is of interest in personalized medicine, for example, to predict drug-target (or protein-protein) interaction considering individual's genomic variations. Most of the existing transcriptome assemblers, however, seems not well suited for this purpose. METHODS In this work, we present StringFix, an annotation guided transcriptome assembly and protein sequence reconstruction software tool that takes genome-aligned reads and the annotations associated to the reference genome as input. The tool 'fixes' the pre-annotated transcript sequence by taking small variations into account, finally to produce possible amino acid sequences that are likely to exist in the test tissue. RESULTS The results show that, using outputs from existing reference-based assemblers as the input GTF-guide, StringFix could reconstruct amino acid sequences more precisely with higher sensitivity than direct generation using the recovered transcripts from all the assemblers we tested. CONCLUSION By using StringFix with the existing reference-based assemblers, one can recover not only a novel transcripts and isoforms but also the possible amino acid sequence stemming from them.
Collapse
Affiliation(s)
- Joongho Lee
- Dept. of Computer Science, College of SW Convergence, Dankook Univ, Yongin-si, 16890, Korea
| | - Minsoo Kim
- Dept. of Computer Science, College of SW Convergence, Dankook Univ, Yongin-si, 16890, Korea
| | - Kyudong Han
- Center for Bio-Medical Engineering Core Facility, Dankook Univ, Cheonan, 31116, Korea
- Dept. of Microbiology, College of Science & Technology, Dankook Univ, Cheonan, 31116, Korea
- HuNbiome Co., Ltd, R&D Center, Seoul, 08503, Korea
| | - Seokhyun Yoon
- Dept. of Electronics and Electrical Engineering, College of Engineering, Dankook Univ, Yongin-si, 16890, Korea.
| |
Collapse
|
2
|
Mallawaarachchi V, Roach MJ, Decewicz P, Papudeshi B, Giles SK, Grigson SR, Bouras G, Hesse RD, Inglis LK, Hutton ALK, Dinsdale EA, Edwards RA. Phables: from fragmented assemblies to high-quality bacteriophage genomes. Bioinformatics 2023; 39:btad586. [PMID: 37738590 PMCID: PMC10563150 DOI: 10.1093/bioinformatics/btad586] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2023] [Revised: 07/14/2023] [Accepted: 09/19/2023] [Indexed: 09/24/2023] Open
Abstract
MOTIVATION Microbial communities have a profound impact on both human health and various environments. Viruses infecting bacteria, known as bacteriophages or phages, play a key role in modulating bacterial communities within environments. High-quality phage genome sequences are essential for advancing our understanding of phage biology, enabling comparative genomics studies and developing phage-based diagnostic tools. Most available viral identification tools consider individual sequences to determine whether they are of viral origin. As a result of challenges in viral assembly, fragmentation of genomes can occur, and existing tools may recover incomplete genome fragments. Therefore, the identification and characterization of novel phage genomes remain a challenge, leading to the need of improved approaches for phage genome recovery. RESULTS We introduce Phables, a new computational method to resolve phage genomes from fragmented viral metagenome assemblies. Phables identifies phage-like components in the assembly graph, models each component as a flow network, and uses graph algorithms and flow decomposition techniques to identify genomic paths. Experimental results of viral metagenomic samples obtained from different environments show that Phables recovers on average over 49% more high-quality phage genomes compared to existing viral identification tools. Furthermore, Phables can resolve variant phage genomes with over 99% average nucleotide identity, a distinction that existing tools are unable to make. AVAILABILITY AND IMPLEMENTATION Phables is available on GitHub at https://github.com/Vini2/phables.
Collapse
Affiliation(s)
- Vijini Mallawaarachchi
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, South Australia 5042, Australia
| | - Michael J Roach
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, South Australia 5042, Australia
| | - Przemyslaw Decewicz
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, South Australia 5042, Australia
- Department of Environmental Microbiology and Biotechnology, Institute of Microbiology, Faculty of Biology, University of Warsaw, Warsaw 02-096, Poland
| | - Bhavya Papudeshi
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, South Australia 5042, Australia
| | - Sarah K Giles
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, South Australia 5042, Australia
| | - Susanna R Grigson
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, South Australia 5042, Australia
| | - George Bouras
- Adelaide Medical School, Faculty of Health and Medical Sciences, The University of Adelaide, Adelaide, South Australia 5005, Australia
- The Department of Surgery—Otolaryngology Head and Neck Surgery, Central Adelaide Local Health Network, Adelaide, South Australia 5000, Australia
| | - Ryan D Hesse
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, South Australia 5042, Australia
| | - Laura K Inglis
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, South Australia 5042, Australia
| | - Abbey L K Hutton
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, South Australia 5042, Australia
| | - Elizabeth A Dinsdale
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, South Australia 5042, Australia
| | - Robert A Edwards
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, South Australia 5042, Australia
| |
Collapse
|
3
|
Mallawaarachchi V, Roach MJ, Decewicz P, Papudeshi B, Giles SK, Grigson SR, Bouras G, Hesse RD, Inglis LK, Hutton ALK, Dinsdale EA, Edwards RA. Phables: from fragmented assemblies to high-quality bacteriophage genomes. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.04.04.535632. [PMID: 37066369 PMCID: PMC10104058 DOI: 10.1101/2023.04.04.535632] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/18/2023]
Abstract
Microbial communities influence both human health and different environments. Viruses infecting bacteria, known as bacteriophages or phages, play a key role in modulating bacterial communities within environments. High-quality phage genome sequences are essential for advancing our understanding of phage biology, enabling comparative genomics studies, and developing phage-based diagnostic tools. Most available viral identification tools consider individual sequences to determine whether they are of viral origin. As a result of the challenges in viral assembly, fragmentation of genomes can occur, leading to the need for new approaches in viral identification. Therefore, the identification and characterisation of novel phages remain a challenge. We introduce Phables, a new computational method to resolve phage genomes from fragmented viral metagenome assemblies. Phables identifies phage-like components in the assembly graph, models each component as a flow network, and uses graph algorithms and flow decomposition techniques to identify genomic paths. Experimental results of viral metagenomic samples obtained from different environments show that Phables recovers on average over 49% more high-quality phage genomes compared to existing viral identification tools. Furthermore, Phables can resolve variant phage genomes with over 99% average nucleotide identity, a distinction that existing tools are unable to make. Phables is available on GitHub at https://github.com/Vini2/phables.
Collapse
Affiliation(s)
- Vijini Mallawaarachchi
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Bedford Park, Adelaide, SA, 5042, Australia
| | - Michael J Roach
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Bedford Park, Adelaide, SA, 5042, Australia
| | - Przemyslaw Decewicz
- Department of Environmental Microbiology and Biotechnology, Institute of Microbiology, Faculty of Biology, University of Warsaw, Warsaw 02-096, Poland
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Bedford Park, Adelaide, SA, 5042, Australia
| | - Bhavya Papudeshi
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Bedford Park, Adelaide, SA, 5042, Australia
| | - Sarah K Giles
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Bedford Park, Adelaide, SA, 5042, Australia
| | - Susanna R Grigson
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Bedford Park, Adelaide, SA, 5042, Australia
| | - George Bouras
- Adelaide Medical School, The University of Adelaide, North Tce, Adelaide, SA, 5000, Australia
| | - Ryan D Hesse
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Bedford Park, Adelaide, SA, 5042, Australia
| | - Laura K Inglis
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Bedford Park, Adelaide, SA, 5042, Australia
| | - Abbey L K Hutton
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Bedford Park, Adelaide, SA, 5042, Australia
| | - Elizabeth A Dinsdale
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Bedford Park, Adelaide, SA, 5042, Australia
| | - Robert A Edwards
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Bedford Park, Adelaide, SA, 5042, Australia
| |
Collapse
|
4
|
Khan S, Kortelainen M, Cáceres M, Williams L, Tomescu AI. Improving RNA Assembly via Safety and Completeness in Flow Decompositions. J Comput Biol 2022; 29:1270-1287. [PMID: 36288562 PMCID: PMC9807076 DOI: 10.1089/cmb.2022.0261] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023] Open
Abstract
Decomposing a network flow into weighted paths is a problem with numerous applications, ranging from networking, transportation planning, to bioinformatics. In some applications we look for a decomposition that is optimal with respect to some property, such as the number of paths used, robustness to edge deletion, or length of the longest path. However, in many bioinformatic applications, we seek a specific decomposition where the paths correspond to some underlying data that generated the flow. In these cases, no optimization criteria guarantee the identification of the correct decomposition. Therefore, we propose to instead report the safe paths, which are subpaths of at least one path in every flow decomposition. In this work, we give the first local characterization of safe paths for flow decompositions in directed acyclic graphs, leading to a practical algorithm for finding the complete set of safe paths. In addition, we evaluate our algorithm on RNA transcript data sets against a trivial safe algorithm (extended unitigs), the recently proposed safe paths for path covers (TCBB 2021) and the popular heuristic greedy-width. On the one hand, we found that besides maintaining perfect precision, our safe and complete algorithm reports a significantly higher coverage (≈50% more) compared with the other safe algorithms. On the other hand, the greedy-width algorithm although reporting a better coverage, it also reports a significantly lower precision on complex graphs (for genes expressing a large number of transcripts). Overall, our safe and complete algorithm outperforms (by ≈20%) greedy-width on a unified metric (F-score) considering both coverage and precision when the evaluated data set has a significant number of complex graphs. Moreover, it also has a superior time (4-5×) and space performance (1.2-2.2×), resulting in a better and more practical approach for bioinformatic applications of flow decomposition.
Collapse
Affiliation(s)
- Shahbaz Khan
- Department of Computer Science and Engineering, IIT Roorkee, Roorkee, India.,Department of Computer Science, University of Helsinki, Helsinki, Finland.,Address correspondence to: Prof. Shahbaz Khan, Department of Computer Science and Engineering, IIT Roorkee, Haridwar Highway, Roorkee 247667, Uttarakhand, India
| | - Milla Kortelainen
- Department of Computer Science, University of Helsinki, Helsinki, Finland
| | - Manuel Cáceres
- Department of Computer Science, University of Helsinki, Helsinki, Finland
| | - Lucia Williams
- School of Computing, Montana State University, Bozeman, Montana, USA
| | | |
Collapse
|
5
|
Caceres M, Mumey B, Husic E, Rizzi R, Cairo M, Sahlin K, Tomescu AI. Safety in Multi-Assembly via Paths Appearing in All Path Covers of a DAG. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3673-3684. [PMID: 34847041 DOI: 10.1109/tcbb.2021.3131203] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
A multi-assembly problem asks to reconstruct multiple genomic sequences from mixed reads sequenced from all of them. Standard formulations of such problems model a solution as a path cover in a directed acyclic graph, namely a set of paths that together cover all vertices of the graph. Since multi-assembly problems admit multiple solutions in practice, we consider an approach commonly used in standard genome assembly: output only partial solutions (contigs, or safe paths), that appear in all path cover solutions. We study constrained path covers, a restriction on the path cover solution that incorporate practical constraints arising in multi-assembly problems. We give efficient algorithms finding all maximal safe paths for constrained path covers. We compute the safe paths of splicing graphs constructed from transcript annotations of different species. Our algorithms run in less than 15 seconds per species and report RNA contigs that are over 99% precise and are up to 8 times longer than unitigs. Moreover, RNA contigs cover over 70% of the transcripts and their coding sequences in most cases. With their increased length to unitigs, high precision, and fast construction time, maximal safe paths can provide a better base set of sequences for transcript assembly programs.
Collapse
|
6
|
Dias FH, Williams L, Mumey B, Tomescu AI. Efficient Minimum Flow Decomposition via Integer Linear Programming. J Comput Biol 2022; 29:1252-1267. [PMID: 36260412 PMCID: PMC9700332 DOI: 10.1089/cmb.2022.0257] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Minimum flow decomposition (MFD) is an NP-hard problem asking to decompose a network flow into a minimum set of paths (together with associated weights). Variants of it are powerful models in multiassembly problems in Bioinformatics, such as RNA assembly. Owing to its hardness, practical multiassembly tools either use heuristics or solve simpler, polynomial time-solvable versions of the problem, which may yield solutions that are not minimal or do not perfectly decompose the flow. Here, we provide the first fast and exact solver for MFD on acyclic flow networks, based on Integer Linear Programming (ILP). Key to our approach is an encoding of all the exponentially many solution paths using only a quadratic number of variables. We also extend our ILP formulation to many practical variants, such as incorporating longer or paired-end reads, or minimizing flow errors. On both simulated and real-flow splicing graphs, our approach solves any instance in <13 seconds. We hope that our formulations can lie at the core of future practical RNA assembly tools. Our implementations are freely available on Github.
Collapse
Affiliation(s)
- Fernando H.C. Dias
- Department of Computer Science, University of Helsinki, Helsinki, Finland
| | - Lucia Williams
- School of Computing, Montana State University, Bozeman, Montana, USA
| | - Brendan Mumey
- School of Computing, Montana State University, Bozeman, Montana, USA
| | | |
Collapse
|
7
|
Nunn A, Rodríguez‐Arévalo I, Tandukar Z, Frels K, Contreras‐Garrido A, Carbonell‐Bejerano P, Zhang P, Ramos Cruz D, Jandrasits K, Lanz C, Brusa A, Mirouze M, Dorn K, Galbraith DW, Jarvis BA, Sedbrook JC, Wyse DL, Otto C, Langenberger D, Stadler PF, Weigel D, Marks MD, Anderson JA, Becker C, Chopra R. Chromosome-level Thlaspi arvense genome provides new tools for translational research and for a newly domesticated cash cover crop of the cooler climates. PLANT BIOTECHNOLOGY JOURNAL 2022; 20:944-963. [PMID: 34990041 PMCID: PMC9055812 DOI: 10.1111/pbi.13775] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/17/2021] [Revised: 11/28/2021] [Accepted: 12/23/2021] [Indexed: 05/20/2023]
Abstract
Thlaspi arvense (field pennycress) is being domesticated as a winter annual oilseed crop capable of improving ecosystems and intensifying agricultural productivity without increasing land use. It is a selfing diploid with a short life cycle and is amenable to genetic manipulations, making it an accessible field-based model species for genetics and epigenetics. The availability of a high-quality reference genome is vital for understanding pennycress physiology and for clarifying its evolutionary history within the Brassicaceae. Here, we present a chromosome-level genome assembly of var. MN106-Ref with improved gene annotation and use it to investigate gene structure differences between two accessions (MN108 and Spring32-10) that are highly amenable to genetic transformation. We describe non-coding RNAs, pseudogenes and transposable elements, and highlight tissue-specific expression and methylation patterns. Resequencing of forty wild accessions provided insights into genome-wide genetic variation, and QTL regions were identified for a seedling colour phenotype. Altogether, these data will serve as a tool for pennycress improvement in general and for translational research across the Brassicaceae.
Collapse
Affiliation(s)
- Adam Nunn
- ecSeq Bioinformatics GmbHLeipzigGermany
- Department of Computer ScienceLeipzig UniversityLeipzigGermany
| | - Isaac Rodríguez‐Arévalo
- GeneticsFaculty of BiologyLudwig Maximilians UniversityMartinsriedGermany
- Gregor Mendel Institute of Molecular Plant Biology GmbHAustrian Academy of Sciences (ÖAW), Vienna BioCenter (VBC)ViennaAustria
| | - Zenith Tandukar
- Department of Agronomy and Plant GeneticsUniversity of MinnesotaSaint PaulMNUSA
| | - Katherine Frels
- Department of Agronomy and Plant GeneticsUniversity of MinnesotaSaint PaulMNUSA
- Department of Agronomy and HorticultureUniversity of NebraskaLincolnNEUSA
| | | | | | - Panpan Zhang
- Institut de Recherche pour le DéveloppementUMR232 DIADEMontpellierFrance
- Laboratory of Plant Genome and DevelopmentUniversity of PerpignanPerpignanFrance
| | - Daniela Ramos Cruz
- GeneticsFaculty of BiologyLudwig Maximilians UniversityMartinsriedGermany
- Gregor Mendel Institute of Molecular Plant Biology GmbHAustrian Academy of Sciences (ÖAW), Vienna BioCenter (VBC)ViennaAustria
| | - Katharina Jandrasits
- GeneticsFaculty of BiologyLudwig Maximilians UniversityMartinsriedGermany
- Gregor Mendel Institute of Molecular Plant Biology GmbHAustrian Academy of Sciences (ÖAW), Vienna BioCenter (VBC)ViennaAustria
| | - Christa Lanz
- Department of Molecular BiologyMax Planck Institute for Developmental BiologyTübingenGermany
| | - Anthony Brusa
- Department of Agronomy and Plant GeneticsUniversity of MinnesotaSaint PaulMNUSA
| | - Marie Mirouze
- Institut de Recherche pour le DéveloppementUMR232 DIADEMontpellierFrance
- Laboratory of Plant Genome and DevelopmentUniversity of PerpignanPerpignanFrance
| | - Kevin Dorn
- Department of Plant and Microbial BiologyUniversity of MinnesotaSaint PaulMNUSA
- USDA‐ARSSoil Management and Sugarbeet ResearchFort CollinsCOUSA
| | - David W Galbraith
- BIO5 InstituteArizona Cancer CenterDepartment of Biomedical EngineeringUniversity of ArizonaSchool of Plant SciencesTucsonAZUSA
| | - Brice A. Jarvis
- School of Biological SciencesIllinois State UniversityNormalILUSA
| | - John C. Sedbrook
- School of Biological SciencesIllinois State UniversityNormalILUSA
| | - Donald L. Wyse
- Department of Agronomy and Plant GeneticsUniversity of MinnesotaSaint PaulMNUSA
| | | | | | - Peter F. Stadler
- Department of Computer ScienceLeipzig UniversityLeipzigGermany
- Max Planck Institute for Mathematics in the SciencesLeipzigGermany
| | - Detlef Weigel
- Department of Molecular BiologyMax Planck Institute for Developmental BiologyTübingenGermany
| | - M. David Marks
- Department of Plant and Microbial BiologyUniversity of MinnesotaSaint PaulMNUSA
| | - James A. Anderson
- Department of Agronomy and Plant GeneticsUniversity of MinnesotaSaint PaulMNUSA
| | - Claude Becker
- GeneticsFaculty of BiologyLudwig Maximilians UniversityMartinsriedGermany
- Gregor Mendel Institute of Molecular Plant Biology GmbHAustrian Academy of Sciences (ÖAW), Vienna BioCenter (VBC)ViennaAustria
| | - Ratan Chopra
- Department of Agronomy and Plant GeneticsUniversity of MinnesotaSaint PaulMNUSA
- Department of Plant and Microbial BiologyUniversity of MinnesotaSaint PaulMNUSA
| |
Collapse
|
8
|
Zheng H, Ma C, Kingsford C. Deriving Ranges of Optimal Estimated Transcript Expression due to Nonidentifiability. J Comput Biol 2022; 29:121-139. [PMID: 35041494 PMCID: PMC8892959 DOI: 10.1089/cmb.2021.0444] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Current expression quantification methods suffer from a fundamental but undercharacterized type of error: the most likely estimates for transcript abundances are not unique. This means multiple estimates of transcript abundances generate the observed RNA-seq reads with equal likelihood, and the underlying true expression cannot be determined. This is called nonidentifiability in probabilistic modeling. It is further exacerbated by incomplete reference transcriptomes where reads may be sequenced from unannotated transcripts. Graph quantification is a generalization to transcript quantification, accounting for the reference incompleteness by allowing exponentially many unannotated transcripts to express reads. We propose methods to calculate a "confidence range of expression" for each transcript, representing its possible abundance across equally optimal estimates for both quantification models. This range informs both whether a transcript has potential estimation error due to nonidentifiability and the extent of the error. Applying our methods to the Human Body Map data, we observe that 35%-50% of transcripts potentially suffer from inaccurate quantification caused by nonidentifiability. When comparing the expression between isoforms in one sample, we find that the degree of inaccuracy of 20%-47% transcripts can be so large that the ranking of expression between the transcript and other isoforms from the same gene cannot be determined. When comparing the expression of a transcript between two groups of RNA-seq samples in differential expression analysis, we observe that the majority of detected differentially expressed transcripts are reliable with a few exceptions after considering the ranges of the optimal expression estimates.
Collapse
Affiliation(s)
- Hongyu Zheng
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | - Cong Ma
- Computer Science Department, Princeton University, Princeton, New Jersey, USA
| | - Carl Kingsford
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| |
Collapse
|
9
|
RNA-seq for revealing the function of the transcriptome. Bioinformatics 2022. [DOI: 10.1016/b978-0-323-89775-4.00002-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
|
10
|
Zhao J, Feng H, Zhu D, Lin Y. MultiTrans: An Algorithm for Path Extraction Through Mixed Integer Linear Programming for Transcriptome Assembly. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:48-56. [PMID: 34033544 DOI: 10.1109/tcbb.2021.3083277] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Recent advances in RNA-seq technology have made identification of expressed genes affordable, and thus boosting repaid development of transcriptomic studies. Transcriptome assembly, reconstructing all expressed transcripts from RNA-seq reads, is an essential step to understand genes, proteins, and cell functions. Transcriptome assembly remains a challenging problem due to complications in splicing variants, expression levels, uneven coverage and sequencing errors. Here, we formulate the transcriptome assembly problem as path extraction on splicing graphs (or assembly graphs), and propose a novel algorithm MultiTrans for path extraction using mixed integer linear programming. MultiTrans is able to take into consideration coverage constraints on vertices and edges, the number of paths and the paired-end information simultaneously. We benchmarked MultiTrans against two state-of-the-art transcriptome assemblers, TransLiG and rnaSPAdes. Experimental results show that MultiTrans generates more accurate transcripts compared to TransLiG (using the same splicing graphs) and rnaSPAdes (using the same assembly graphs). MultiTrans is freely available at https://github.com/jzbio/MultiTrans.
Collapse
|
11
|
Souvorov A, Agarwala R. SAUTE: sequence assembly using target enrichment. BMC Bioinformatics 2021; 22:375. [PMID: 34289805 PMCID: PMC8293564 DOI: 10.1186/s12859-021-04174-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2021] [Accepted: 05/05/2021] [Indexed: 01/25/2023] Open
Abstract
Background Illumina is the dominant sequencing technology at this time. Short length, short insert size, some systematic biases, and low-level carryover contamination in Illumina reads continue to make assembly of repeated regions a challenging problem. Some applications also require finding multiple well supported variants for assembled regions. Results To facilitate assembly of repeat regions and to report multiple well supported variants when a user can provide target sequences to assist the assembly, we propose SAUTE and SAUTE_PROT assemblers. Both assemblers use de Bruijn graph on reads. Targets can be transcripts or proteins for RNA-seq reads and transcripts, proteins, or genomic regions for genomic reads. Target sequences are nucleotide and protein sequences for SAUTE and SAUTE_PROT, respectively. Conclusions For RNA-seq, comparisons with Trinity, rnaSPAdes, SPAligner, and SPAdes assembly of reads aligned to target proteins by DIAMOND show that SAUTE_PROT finds more coding sequences that translate to benchmark proteins. Using AMRFinderPlus calls, we find SAUTE has higher sensitivity and precision than SPAdes, plasmidSPAdes, SPAligner, and SPAdes assembly of reads aligned to target regions by HISAT2. It also has better sensitivity than SKESA but worse precision. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04174-9.
Collapse
Affiliation(s)
| | - Richa Agarwala
- NCBI/NLM/NIH/DHHS, 8600 Rockville Pike, Bethesda, MD, 20894, USA.
| |
Collapse
|
12
|
Gatter T, Stadler PF. Ryūtō: Improved multi-sample transcript assembly for differential transcript expression analysis and more. Bioinformatics 2021; 37:4307-4313. [PMID: 34255826 DOI: 10.1093/bioinformatics/btab494] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2021] [Revised: 06/21/2021] [Accepted: 07/01/2021] [Indexed: 01/12/2023] Open
Abstract
MOTIVATION Accurate assembly of RNA-seq is a crucial step in many analytic tasks such as gene annotation or expression studies. Despite ongoing research, progress on traditional single sample assembly has brought no major breakthrough. Multi-sample RNA-Seq experiments provide more information than single sample datasets and thus constitute a promising area of research. Yet, this advantage is challenging to utilize due to the large amount of accumulating errors. RESULTS We present an extension to Ryūtō enabling the reconstruction of consensus transcriptomes from multiple RNA-seq data sets, incorporating consensus calling at low level features. We report stable improvements already at 3 replicates. Ryūtō outperforms competing approaches, providing a better and user-adjustable sensitivity-precision trade-off. Ryūtō's unique ability to utilize a (incomplete) reference for multi sample assemblies greatly increases precision. We demonstrate benefits for differential expression analysis. CONCLUSION Ryūtō consistently improves assembly on replicates of the same tissue independent of filter settings, even when mixing conditions or time series. Consensus voting in Ryūtō is especially effective at high precision assembly, while Ryūtō's conventional mode can reach higher recall. AVAILABILITY Ryūtō is available at https://github.com/studla/RYUTO. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Thomas Gatter
- Bioinformatics Group, Department of Computer Science & Interdisciplinary Center for Bioinformatics, Universität Leipzig, D-04107 Leipzig, Germany
| | - Peter F Stadler
- Bioinformatics Group, Department of Computer Science & Interdisciplinary Center for Bioinformatics, Universität Leipzig, D-04107 Leipzig, Germany
- Discrete Biomath Group, Max Planck Institute for Mathematics in the Sciences, D-04103 Leipzig, Germany
- Institute for Theoretical Chemistry, University of Vienna, A-1090 Wien, Austria
- Santa Fe Institute, Santa Fe, NM 87501, USA
| |
Collapse
|
13
|
Uhl M, Tran VD, Backofen R. Improving CLIP-seq data analysis by incorporating transcript information. BMC Genomics 2020; 21:894. [PMID: 33334306 PMCID: PMC7745353 DOI: 10.1186/s12864-020-07297-0] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2020] [Accepted: 12/02/2020] [Indexed: 12/26/2022] Open
Abstract
BACKGROUND Current peak callers for identifying RNA-binding protein (RBP) binding sites from CLIP-seq data take into account genomic read profiles, but they ignore the underlying transcript information, that is information regarding splicing events. So far, there are no studies available that closer observe this issue. RESULTS Here we show that current peak callers are susceptible to false peak calling near exon borders. We quantify its extent in publicly available datasets, which turns out to be substantial. By providing a tool called CLIPcontext for automatic transcript and genomic context sequence extraction, we further demonstrate that context choice affects the performances of RBP binding site prediction tools. Moreover, we show that known motifs of exon-binding RBPs are often enriched in transcript context sites, which should enable the recovery of more authentic binding sites. Finally, we discuss possible strategies on how to integrate transcript information into future workflows. CONCLUSIONS Our results demonstrate the importance of incorporating transcript information in CLIP-seq data analysis. Taking advantage of the underlying transcript information should therefore become an integral part of future peak calling and downstream analysis tools.
Collapse
Affiliation(s)
- Michael Uhl
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Georges-Köhler-Allee 106, Freiburg, 79110, Germany
| | - Van Dinh Tran
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Georges-Köhler-Allee 106, Freiburg, 79110, Germany
| | - Rolf Backofen
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Georges-Köhler-Allee 106, Freiburg, 79110, Germany. .,Signalling Research Centres BIOSS and CIBSS, University of Freiburg, Schaenzlestr. 18, Freiburg, 79104, Germany.
| |
Collapse
|
14
|
Mao S, Pachter L, Tse D, Kannan S. RefShannon: A genome-guided transcriptome assembler using sparse flow decomposition. PLoS One 2020; 15:e0232946. [PMID: 32484809 PMCID: PMC7266320 DOI: 10.1371/journal.pone.0232946] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2019] [Accepted: 04/24/2020] [Indexed: 12/12/2022] Open
Abstract
High throughput sequencing of RNA (RNA-Seq) has become a staple in modern molecular biology, with applications not only in quantifying gene expression but also in isoform-level analysis of the RNA transcripts. To enable such an isoform-level analysis, a transcriptome assembly algorithm is utilized to stitch together the observed short reads into the corresponding transcripts. This task is complicated due to the complexity of alternative splicing - a mechanism by which the same gene may generate multiple distinct RNA transcripts. We develop a novel genome-guided transcriptome assembler, RefShannon, that exploits the varying abundances of the different transcripts, in enabling an accurate reconstruction of the transcripts. Our evaluation shows RefShannon is able to improve sensitivity effectively (up to 22%) at a given specificity in comparison with other state-of-the-art assemblers. RefShannon is written in Python and is available from Github (https://github.com/shunfumao/RefShannon).
Collapse
Affiliation(s)
- Shunfu Mao
- Department of Electrical and Computer Engineering, University of Washington, Seattle, WA, United States of America
| | - Lior Pachter
- Division of Biology and Biological Engineering, Caltech, Pasadena, CA, United States of America
| | - David Tse
- Department of Electrical Engineering, Stanford University, Stanford, CA, United States of America
| | - Sreeram Kannan
- Department of Electrical and Computer Engineering, University of Washington, Seattle, WA, United States of America
- * E-mail:
| |
Collapse
|