1
|
Park A, Koslickia D. Pro krustean Graph: A substring index for rapid k-mer size analysis. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.11.21.568151. [PMID: 38853857 PMCID: PMC11160577 DOI: 10.1101/2023.11.21.568151] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2024]
Abstract
Despite the widespread adoption of k -mer-based methods in bioinformatics, understanding the influence of k -mer sizes remains a persistent challenge. Selecting an optimal k -mer size or employing multiple k -mer sizes is often arbitrary, application-specific, and fraught with computational complexities. Typically, the influence of k -mer size is obscured by the outputs of complex bioinformatics tasks, such as genome analysis, comparison, assembly, alignment, and error correction. However, it is frequently overlooked that every method is built above a well-defined k -mer-based object like Jaccard Similarity, de Bruijn graphs, k -mer spectra, and Bray-Curtis Dissimilarity. Despite these objects offering a clearer perspective on the role of k -mer sizes, the dynamics of k -mer-based objects with respect to k -mer sizes remain surprisingly elusive. This paper introduces a computational framework that generalizes the transition of k -mer-based objects across k -mer sizes, utilizing a novel substring index, the Prokrustean graph. The primary contribution of this framework is to compute quantities associated with k -mer-based objects for all k -mer sizes, where the computational complexity depends solely on the number of maximal repeats and is independent of the range of k -mer sizes. For example, counting vertices of compacted de Bruijn graphs for k = 1 , … , 100 can be accomplished in mere seconds with our substring index constructed on a gigabase-sized read set. Additionally, we derive a space-efficient algorithm to extract the Prokrustean graph from the Burrows-Wheeler Transform. It becomes evident that modern substring indices, mostly based on longest common prefixes of suffix arrays, inherently face difficulties at exploring varying k -mer sizes due to their limitations at grouping co-occurring substrings. We have implemented four applications that utilize quantities critical in modern pangenomics and metagenomics. The code for these applications and the construction algorithm is available at https://github.com/KoslickiLab/prokrustean.
Collapse
Affiliation(s)
- Adam Park
- Computer Science and Engineering in Pennsylvania State University, PA, USA
| | - David Koslickia
- Computer Science and Engineering in Pennsylvania State University, PA, USA
- Biology in Pennsylvania State University, PA, USA
- Huck Institutes of the Life Sciences in Pennsylvania State University, PA, USA
| |
Collapse
|
2
|
Rádai Z, Váradi A, Takács P, Nagy NA, Schmitt N, Prépost E, Kardos G, Laczkó L. An overlooked phenomenon: complex interactions of potential error sources on the quality of bacterial de novo genome assemblies. BMC Genomics 2024; 25:45. [PMID: 38195441 PMCID: PMC10777565 DOI: 10.1186/s12864-023-09910-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2022] [Accepted: 12/15/2023] [Indexed: 01/11/2024] Open
Abstract
BACKGROUND Parameters adversely affecting the contiguity and accuracy of the assemblies from Illumina next-generation sequencing (NGS) are well described. However, past studies generally focused on their additive effects, overlooking their potential interactions possibly exacerbating one another's effects in a multiplicative manner. To investigate whether or not they act interactively on de novo genome assembly quality, we simulated sequencing data for 13 bacterial reference genomes, with varying levels of error rate, sequencing depth, PCR and optical duplicate ratios. RESULTS We assessed the quality of assemblies from the simulated sequencing data with a number of contiguity and accuracy metrics, which we used to quantify both additive and multiplicative effects of the four parameters. We found that the tested parameters are engaged in complex interactions, exerting multiplicative, rather than additive, effects on assembly quality. Also, the ratio of non-repeated regions and GC% of the original genomes can shape how the four parameters affect assembly quality. CONCLUSIONS We provide a framework for consideration in future studies using de novo genome assembly of bacterial genomes, e.g. in choosing the optimal sequencing depth, balancing between its positive effect on contiguity and negative effect on accuracy due to its interaction with error rate. Furthermore, the properties of the genomes to be sequenced also should be taken into account, as they might influence the effects of error sources themselves.
Collapse
Affiliation(s)
- Zoltán Rádai
- Institute of Metagenomics, University of Debrecen, Debrecen, Hungary.
- Department of Dermatology, University Hospital Düsseldorf, Heinrich-Heine-University, Düsseldorf, Germany.
| | - Alex Váradi
- Institute of Metagenomics, University of Debrecen, Debrecen, Hungary
- Department of Laboratory Medicine, Medical School, University of Pécs, Pécs, Hungary
| | - Péter Takács
- Institute of Metagenomics, University of Debrecen, Debrecen, Hungary
- Department of Health Informatics, Institute of Health Sciences, Faculty of Health, University of Debrecen, Debrecen, Hungary
| | - Nikoletta Andrea Nagy
- Institute of Metagenomics, University of Debrecen, Debrecen, Hungary
- Department of Evolutionary Zoology, ELKH-DE Behavioural Ecology Research Group, University of Debrecen, Debrecen, Hungary
- Department of Evolutionary Zoology and Human Biology, University of Debrecen, Debrecen, Hungary
| | - Nicholas Schmitt
- Department of Dermatology, University Hospital Düsseldorf, Heinrich-Heine-University, Düsseldorf, Germany
| | - Eszter Prépost
- Department of Health Industry, University of Debrecen, Debrecen, Hungary
| | - Gábor Kardos
- Institute of Metagenomics, University of Debrecen, Debrecen, Hungary
- Department of Gerontology, Faculty of Health Sciences, University of Debrecen, Debrecen, Hungary
| | - Levente Laczkó
- Institute of Metagenomics, University of Debrecen, Debrecen, Hungary
- ELKH-DE Conservation Biology Research Group, Debrecen, Hungary
| |
Collapse
|
3
|
Varsamis GD, Karafyllidis IG, Gilkes KM, Arranz U, Martin-Cuevas R, Calleja G, Wong J, Jessen HC, Dimitrakis P, Kolovos P, Sandaltzopoulos R. Quantum algorithm for de novo DNA sequence assembly based on quantum walks on graphs. Biosystems 2023; 233:105037. [PMID: 37734700 DOI: 10.1016/j.biosystems.2023.105037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Revised: 09/16/2023] [Accepted: 09/18/2023] [Indexed: 09/23/2023]
Abstract
De novo DNA sequence assembly is based on finding paths in overlap graphs, which is a NP-hard problem. We developed a quantum algorithm for de novo assembly based on quantum walks in graphs. The overlap graph is partitioned repeatedly to smaller graphs that form a hierarchical structure. We use quantum walks to find paths in low rank graphs and a quantum algorithm that finds Hamiltonian paths in high hierarchical rank. We tested the partitioning quantum algorithm, as well as the quantum algorithm that finds Hamiltonian paths in high hierarchical rank and confirmed its correct operation using Qiskit. We developed a custom simulation for quantum walks to search for paths in low rank graphs. The approach described in this paper may serve as a basis for the development of efficient quantum algorithms that solve the de novo DNA assembly problem.
Collapse
Affiliation(s)
- G D Varsamis
- Department of Electrical and Computer Engineering, Democritus University of Thrace, Xanthi, 67100, Greece
| | - I G Karafyllidis
- Department of Electrical and Computer Engineering, Democritus University of Thrace, Xanthi, 67100, Greece; National Centre for Scientific Research Demokritos, Athens, 15342, Greece.
| | - K M Gilkes
- EY Global Innovation Quantum Computing Lab, USA
| | - U Arranz
- EY Global Innovation Quantum Computing Lab, Spain
| | | | - G Calleja
- EY Global Innovation Quantum Computing Lab, Spain
| | - J Wong
- EY Global Innovation Quantum Computing Lab, USA
| | - H C Jessen
- EY Global Innovation Quantum Computing Lab, Denmark
| | - P Dimitrakis
- National Centre for Scientific Research Demokritos, Athens, 15342, Greece
| | - P Kolovos
- Department of Molecular Biology and Genetics, Democritus University of Thrace, Alexandroupolis, 68100, Greece
| | - R Sandaltzopoulos
- Department of Molecular Biology and Genetics, Democritus University of Thrace, Alexandroupolis, 68100, Greece
| |
Collapse
|
4
|
Kong W, Wang Y, Zhang S, Yu J, Zhang X. Recent Advances in Assembly of Complex Plant Genomes. GENOMICS, PROTEOMICS & BIOINFORMATICS 2023; 21:427-439. [PMID: 37100237 PMCID: PMC10787022 DOI: 10.1016/j.gpb.2023.04.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/18/2023] [Revised: 03/18/2023] [Accepted: 04/07/2023] [Indexed: 04/28/2023]
Abstract
Over the past 20 years, tremendous advances in sequencing technologies and computational algorithms have spurred plant genomic research into a thriving era with hundreds of genomes decoded already, ranging from those of nonvascular plants to those of flowering plants. However, complex plant genome assembly is still challenging and remains difficult to fully resolve with conventional sequencing and assembly methods due to high heterozygosity, highly repetitive sequences, or high ploidy characteristics of complex genomes. Herein, we summarize the challenges of and advances in complex plant genome assembly, including feasible experimental strategies, upgrades to sequencing technology, existing assembly methods, and different phasing algorithms. Moreover, we list actual cases of complex genome projects for readers to refer to and draw upon to solve future problems related to complex genomes. Finally, we expect that the accurate, gapless, telomere-to-telomere, and fully phased assembly of complex plant genomes could soon become routine.
Collapse
Affiliation(s)
- Weilong Kong
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
| | - Yibin Wang
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
| | - Shengcheng Zhang
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
| | - Jiaxin Yu
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
| | - Xingtan Zhang
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China.
| |
Collapse
|
5
|
Lu Y, Ge C, Cai B, Xu Q, Kong R, Chang S. Antibody sequences assembly method based on weighted de Bruijn graph. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:6174-6190. [PMID: 37161102 DOI: 10.3934/mbe.2023266] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
With the development of next-generation protein sequencing technologies, sequence assembly algorithm has become a key technology for de novo sequencing process. At present, the existing methods can address the assembly of an unknown single protein chain. However, for monoclonal antibodies with light and heavy chains, the assembly is still an unsolved question. To address this problem, we propose a new assembly method, DBAS, which integrates the quality scores and sequence alignment scores from de novo sequencing peptides into a weighted de Bruijn graph to assemble the final protein sequences. The established method is used to assembling sequences from two datasets with mixed light and heavy chains from antibodies. The results show that the DBAS can assemble long antibody sequences for both mixed light and heavy chains and single chains. In addition, DBAS is able to distinguish the light and heavy chains by using BLAST sequence alignment. The results show that the algorithm has good performance for both target sequence coverage and contig assembly accuracy.
Collapse
Affiliation(s)
- Yi Lu
- Institute of Bioinformatics and Medical Engineering, School of Electrical and Information Engineering, Jiangsu University of Technology, Changzhou 213001, China
| | - Cheng Ge
- Key Laboratory of Marine Drugs, Chinese Ministry of Education, School of Medicine and Pharmacy, Ocean University of China, Qingdao 266003, China
| | - Biao Cai
- Institute of Bioinformatics and Medical Engineering, School of Electrical and Information Engineering, Jiangsu University of Technology, Changzhou 213001, China
| | - Qing Xu
- Institute of Bioinformatics and Medical Engineering, School of Electrical and Information Engineering, Jiangsu University of Technology, Changzhou 213001, China
| | - Ren Kong
- Institute of Bioinformatics and Medical Engineering, School of Electrical and Information Engineering, Jiangsu University of Technology, Changzhou 213001, China
| | - Shan Chang
- Institute of Bioinformatics and Medical Engineering, School of Electrical and Information Engineering, Jiangsu University of Technology, Changzhou 213001, China
| |
Collapse
|
6
|
Buttler J, Drown DM. Accuracy and Completeness of Long Read Metagenomic Assemblies. Microorganisms 2022; 11:96. [PMID: 36677391 PMCID: PMC9861289 DOI: 10.3390/microorganisms11010096] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2022] [Revised: 12/22/2022] [Accepted: 12/28/2022] [Indexed: 01/03/2023] Open
Abstract
Microbes influence the surrounding environment and contribute to human health. Metagenomics can be used as a tool to explore the interactions between microbes. Metagenomic assemblies built using long read nanopore data depend on the read level accuracy. The read level accuracy of nanopore sequencing has made dramatic improvements over the past several years. However, we do not know if the increased read level accuracy allows for faster assemblers to make as accurate metagenomic assemblies as slower assemblers. Here, we present the results of a benchmarking study comparing three commonly used long read assemblers, Flye, Raven, and Redbean. We used a prepared DNA standard of seven bacteria as our input community. We prepared a sequencing library using a VolTRAX V2 and sequenced using a MinION mk1b. We basecalled with Guppy v5.0.7 using the super-accuracy model. We found that increasing read depth benefited each of the assemblers, and nearly complete community member chromosomes were assembled with as little as 10× read depth. Polishing assemblies using Medaka had a predictable improvement in quality. We found Flye to be the most robust across taxa and was the most effective assembler for recovering plasmids. Based on Flye's consistency for chromosomes and increased effectiveness at assembling plasmids, we would recommend using Flye in future metagenomic studies.
Collapse
Affiliation(s)
- Jeremy Buttler
- Department of Biology and Wildlife, University of Alaska Fairbanks, Fairbanks, AK 99775, USA
| | - Devin M. Drown
- Department of Biology and Wildlife, University of Alaska Fairbanks, Fairbanks, AK 99775, USA
- Institute of Arctic Biology, University of Alaska Fairbanks, Fairbanks, AK 99775, USA
| |
Collapse
|
7
|
Lobo D, Linheiro R, Godinho R, Archer JP. On taming the effect of transcript level intra-condition count variation during differential expression analysis: A story of dogs, foxes and wolves. PLoS One 2022; 17:e0274591. [PMID: 36136981 PMCID: PMC9498955 DOI: 10.1371/journal.pone.0274591] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2021] [Accepted: 08/31/2022] [Indexed: 11/22/2022] Open
Abstract
The evolution of RNA-seq technologies has yielded datasets of scientific value that are often generated as condition associated biological replicates within expression studies. With expanding data archives opportunity arises to augment replicate numbers when conditions of interest overlap. Despite correction procedures for estimating transcript abundance, a source of ambiguity is transcript level intra-condition count variation; as indicated by disjointed results between analysis tools. We present TVscript, a tool that removes reference-based transcripts associated with intra-condition count variation above specified thresholds and we explore the effects of such variation on differential expression analysis. Initially iterative differential expression analysis involving simulated counts, where levels of intra-condition variation and sets of over represented transcripts are explicitly specified, was performed. Then counts derived from inter- and intra-study data representing brain samples of dogs, wolves and foxes (wolves vs. dogs and aggressive vs. tame foxes) were used. For simulations, the sensitivity in detecting differentially expressed transcripts increased after removing hyper-variable transcripts, although at levels of intra-condition variation above 5% detection became unreliable. For real data, prior to applying TVscript, ≈20% of the transcripts identified as being differentially expressed were associated with high levels of intra-condition variation, an over representation relative to the reference set. As transcripts harbouring such variation were removed pre-analysis, a discordance from 26 to 40% in the lists of differentially expressed transcripts is observed when compared to those obtained using the non-filtered reference. The removal of transcripts possessing intra-condition variation values within (and above) the 97th and 95th percentiles, for wolves vs. dogs and aggressive vs. tame foxes, maximized the sensitivity in detecting differentially expressed transcripts as a result of alterations within gene-wise dispersion estimates. Through analysis of our real data the support for seven genes with potential for being involved with selection for tameness is provided. TVscript is available at: https://sourceforge.net/projects/tvscript/.
Collapse
Affiliation(s)
- Diana Lobo
- CIBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, InBIO Laboratório Associado, Universidade do Porto, Vairão, Portugal
- BIOPOLIS, Program in Genomics, Biodiversity and Land Planning, CIBIO, Vairão, Portugal
- Departamento de Biologia, Faculdade de Ciências, Universidade do Porto, Porto, Portugal
- * E-mail: (DL); (JPA)
| | - Raquel Linheiro
- CIBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, InBIO Laboratório Associado, Universidade do Porto, Vairão, Portugal
| | - Raquel Godinho
- CIBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, InBIO Laboratório Associado, Universidade do Porto, Vairão, Portugal
- BIOPOLIS, Program in Genomics, Biodiversity and Land Planning, CIBIO, Vairão, Portugal
- Departamento de Biologia, Faculdade de Ciências, Universidade do Porto, Porto, Portugal
| | - John Patrick Archer
- CIBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, InBIO Laboratório Associado, Universidade do Porto, Vairão, Portugal
- BIOPOLIS, Program in Genomics, Biodiversity and Land Planning, CIBIO, Vairão, Portugal
- * E-mail: (DL); (JPA)
| |
Collapse
|
8
|
Zhang T, Zhou J, Gao W, Jia Y, Wei Y, Wang G. Complex genome assembly based on long-read sequencing. Brief Bioinform 2022; 23:6657663. [PMID: 35940845 DOI: 10.1093/bib/bbac305] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2022] [Revised: 06/20/2022] [Accepted: 07/06/2022] [Indexed: 11/12/2022] Open
Abstract
High-quality genome chromosome-scale sequences provide an important basis for genomics downstream analysis, especially the construction of haplotype-resolved and complete genomes, which plays a key role in genome annotation, mutation detection, evolutionary analysis, gene function research, comparative genomics and other aspects. However, genome-wide short-read sequencing is difficult to produce a complete genome in the face of a complex genome with high duplication and multiple heterozygosity. The emergence of long-read sequencing technology has greatly improved the integrity of complex genome assembly. We review a variety of computational methods for complex genome assembly and describe in detail the theories, innovations and shortcomings of collapsed, semi-collapsed and uncollapsed assemblers based on long reads. Among the three methods, uncollapsed assembly is the most correct and complete way to represent genomes. In addition, genome assembly is closely related to haplotype reconstruction, that is uncollapsed assembly realizes haplotype reconstruction, and haplotype reconstruction promotes uncollapsed assembly. We hope that gapless, telomere-to-telomere and accurate assembly of complex genomes can be truly routinely achieved using only a simple process or a single tool in the future.
Collapse
Affiliation(s)
- Tianjiao Zhang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| | - Jie Zhou
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| | - Wentao Gao
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| | - Yuran Jia
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| | - Yanan Wei
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| | - Guohua Wang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| |
Collapse
|
9
|
Kang X, Luo X, Schönhuth A. StrainXpress: strain aware metagenome assembly from short reads. Nucleic Acids Res 2022; 50:e101. [PMID: 35776122 PMCID: PMC9508831 DOI: 10.1093/nar/gkac543] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2021] [Revised: 05/27/2022] [Accepted: 06/30/2022] [Indexed: 12/05/2022] Open
Abstract
Next-generation sequencing–based metagenomics has enabled to identify microorganisms in characteristic habitats without the need for lengthy cultivation. Importantly, clinically relevant phenomena such as resistance to medication, virulence or interactions with the environment can vary already within species. Therefore, a major current challenge is to reconstruct individual genomes from the sequencing reads at the level of strains, and not just the level of species. However, strains of one species can differ only by minor amounts of variants, which makes it difficult to distinguish them. Despite considerable recent progress, related approaches have remained fragmentary so far. Here, we present StrainXpress, as a comprehensive solution to the problem of strain aware metagenome assembly from next-generation sequencing reads. In experiments, StrainXpress reconstructs strain-specific genomes from metagenomes that involve up to >1000 strains and proves to successfully deal with poorly covered strains. The amount of reconstructed strain-specific sequence exceeds that of the current state-of-the-art approaches by on average 26.75% across all data sets (first quartile: 18.51%, median: 26.60%, third quartile: 35.05%).
Collapse
Affiliation(s)
- Xiongbin Kang
- Genome Data Science, Faculty of Technology, Bielefeld University, Bielefeld, 33615, Germany
| | - Xiao Luo
- Genome Data Science, Faculty of Technology, Bielefeld University, Bielefeld, 33615, Germany
| | - Alexander Schönhuth
- Genome Data Science, Faculty of Technology, Bielefeld University, Bielefeld, 33615, Germany
| |
Collapse
|
10
|
Baaijens JA, Bonizzoni P, Boucher C, Della Vedova G, Pirola Y, Rizzi R, Sirén J. Computational graph pangenomics: a tutorial on data structures and their applications. NATURAL COMPUTING 2022; 21:81-108. [PMID: 36969737 PMCID: PMC10038355 DOI: 10.1007/s11047-022-09882-6] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 02/14/2022] [Indexed: 05/08/2023]
Abstract
Computational pangenomics is an emerging research field that is changing the way computer scientists are facing challenges in biological sequence analysis. In past decades, contributions from combinatorics, stringology, graph theory and data structures were essential in the development of a plethora of software tools for the analysis of the human genome. These tools allowed computational biologists to approach ambitious projects at population scale, such as the 1000 Genomes Project. A major contribution of the 1000 Genomes Project is the characterization of a broad spectrum of genetic variations in the human genome, including the discovery of novel variations in the South Asian, African and European populations-thus enhancing the catalogue of variability within the reference genome. Currently, the need to take into account the high variability in population genomes as well as the specificity of an individual genome in a personalized approach to medicine is rapidly pushing the abandonment of the traditional paradigm of using a single reference genome. A graph-based representation of multiple genomes, or a graph pangenome, is replacing the linear reference genome. This means completely rethinking well-established procedures to analyze, store, and access information from genome representations. Properly addressing these challenges is crucial to face the computational tasks of ambitious healthcare projects aiming to characterize human diversity by sequencing 1M individuals (Stark et al. 2019). This tutorial aims to introduce readers to the most recent advances in the theory of data structures for the representation of graph pangenomes. We discuss efficient representations of haplotypes and the variability of genotypes in graph pangenomes, and highlight applications in solving computational problems in human and microbial (viral) pangenomes.
Collapse
Affiliation(s)
- Jasmijn A. Baaijens
- Department of Intelligent Systems, Delft University of Technology, Van Mourik Broekmanweg 6, 2628XE Delft, The Netherlands
- Department of Biomedical Informatics, Harvard University, 10 Shattuck St, Boston, MA 02115, USA
| | - Paola Bonizzoni
- Department of Informatics, Systems and Communication (DISCo), University of Milano-Bicocca, V.le Sarca, 336, 20126 Milan, Italy
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, University of Florida, 432 Newell Dr, Gainesville, FL 32603, USA
| | - Gianluca Della Vedova
- Department of Informatics, Systems and Communication (DISCo), University of Milano-Bicocca, V.le Sarca, 336, 20126 Milan, Italy
| | - Yuri Pirola
- Department of Informatics, Systems and Communication (DISCo), University of Milano-Bicocca, V.le Sarca, 336, 20126 Milan, Italy
| | - Raffaella Rizzi
- Department of Informatics, Systems and Communication (DISCo), University of Milano-Bicocca, V.le Sarca, 336, 20126 Milan, Italy
| | - Jouni Sirén
- Genomics Institute, University of California, 1156 High St., Santa Cruz, CA 95064, USA
| |
Collapse
|
11
|
CStone: A de novo transcriptome assembler for short-read data that identifies non-chimeric contigs based on underlying graph structure. PLoS Comput Biol 2021; 17:e1009631. [PMID: 34813594 PMCID: PMC8651127 DOI: 10.1371/journal.pcbi.1009631] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2021] [Revised: 12/07/2021] [Accepted: 11/11/2021] [Indexed: 11/19/2022] Open
Abstract
With the exponential growth of sequence information stored over the last decade, including that of de novo assembled contigs from RNA-Seq experiments, quantification of chimeric sequences has become essential when assembling read data. In transcriptomics, de novo assembled chimeras can closely resemble underlying transcripts, but patterns such as those seen between co-evolving sites, or mapped read counts, become obscured. We have created a de Bruijn based de novo assembler for RNA-Seq data that utilizes a classification system to describe the complexity of underlying graphs from which contigs are created. Each contig is labelled with one of three levels, indicating whether or not ambiguous paths exist. A by-product of this is information on the range of complexity of the underlying gene families present. As a demonstration of CStones ability to assemble high-quality contigs, and to label them in this manner, both simulated and real data were used. For simulated data, ten million read pairs were generated from cDNA libraries representing four species, Drosophila melanogaster, Panthera pardus, Rattus norvegicus and Serinus canaria. These were assembled using CStone, Trinity and rnaSPAdes; the latter two being high-quality, well established, de novo assembers. For real data, two RNA-Seq datasets, each consisting of ≈30 million read pairs, representing two adult D. melanogaster whole-body samples were used. The contigs that CStone produced were comparable in quality to those of Trinity and rnaSPAdes in terms of length, sequence identity of aligned regions and the range of cDNA transcripts represented, whilst providing additional information on chimerism. Here we describe the details of CStones assembly and classification process, and propose that similar classification systems can be incorporated into other de novo assembly tools. Within a related side study, we explore the effects that chimera’s within reference sets have on the identification of differentially expression genes. CStone is available at: https://sourceforge.net/projects/cstone/. Within transcriptome reference sets, non-chimeric sequences are representations of transcribed genes, while artificially generated chimeric ones are mosaics of two or more pieces of DNA incorrectly pieced together. One area where such sets are utilized is in the quantification of gene expression patterns; where RNA-Seq reads are mapped to the sequences within, and subsequent count values reflect expression levels. Artificial chimeras can have a negative impact on count values by erroneously increasing variation in relation to the reads being mapped. Reference sets can be created from de novo assembled contigs, but chimeras can be introduced during the assembly process via the required traversal of graphs, representing gene families, constructed from the RNA-Seq data. Graph complexity determines how likely chimeras will arise. We have created CStone, a de novo assembler that utilizes a classification system to describe such complexity. Contigs created by CStone are labelled in a manner that indicates whether or not they are non-chimeric. This encourages contig dependent results to be presented with increased objectivity by maintaining the context of ambiguity associated with the assembly process. CStone has been tested extensively. Additionally, we have quantified the relationship between chimeras within reference sets and the identification of differentially expressed genes.
Collapse
|
12
|
Haro-Moreno JM, López-Pérez M, Rodriguez-Valera F. Enhanced Recovery of Microbial Genes and Genomes From a Marine Water Column Using Long-Read Metagenomics. Front Microbiol 2021; 12:708782. [PMID: 34512586 PMCID: PMC8430335 DOI: 10.3389/fmicb.2021.708782] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2021] [Accepted: 07/30/2021] [Indexed: 12/12/2022] Open
Abstract
Third-generation sequencing has penetrated little in metagenomics due to the high error rate and dependence for assembly on short-read designed bioinformatics. However, second-generation sequencing metagenomics (mostly Illumina) suffers from limitations, particularly in the assembly of microbes with high microdiversity and retrieval of the flexible (adaptive) fraction of prokaryotic genomes. Here, we have used a third-generation technique to study the metagenome of a well-known marine sample from the mixed epipelagic water column of the winter Mediterranean. We have compared PacBio Sequel II with the classical approach using Illumina Nextseq short reads followed by assembly to study the metagenome. Long reads allow for efficient direct retrieval of complete genes avoiding the bias of the assembly step. Besides, the application of long reads on metagenomic assembly allows for the reconstruction of much more complete metagenome-assembled genomes (MAGs), particularly from microbes with high microdiversity such as Pelagibacterales. The flexible genome of reconstructed MAGs was much more complete containing many adaptive genes (some with biotechnological potential). PacBio Sequel II CCS appears particularly suitable for cellular metagenomics due to its low error rate. For most applications of metagenomics, from community structure analysis to ecosystem functioning, long reads should be applied whenever possible. Specifically, for in silico screening of biotechnologically useful genes, or population genomics, long-read metagenomics appears presently as a very fruitful approach and can be analyzed from raw reads before a computationally demanding (and potentially artifactual) assembly step.
Collapse
Affiliation(s)
- Jose M. Haro-Moreno
- Evolutionary Genomics Group, División de Microbiología, Universidad Miguel Hernández, Alicante, Spain
| | - Mario López-Pérez
- Evolutionary Genomics Group, División de Microbiología, Universidad Miguel Hernández, Alicante, Spain
| | - Francisco Rodriguez-Valera
- Evolutionary Genomics Group, División de Microbiología, Universidad Miguel Hernández, Alicante, Spain
- Research Center for Molecular Mechanisms of Aging and Age-Related Diseases, Moscow Institute of Physics and Technology, Dolgoprudny, Russia
| |
Collapse
|
13
|
Gatter T, von Löhneysen S, Fallmann J, Drozdova P, Hartmann T, Stadler PF. LazyB: fast and cheap genome assembly. Algorithms Mol Biol 2021; 16:8. [PMID: 34074310 PMCID: PMC8168326 DOI: 10.1186/s13015-021-00186-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2021] [Accepted: 05/06/2021] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND Advances in genome sequencing over the last years have lead to a fundamental paradigm shift in the field. With steadily decreasing sequencing costs, genome projects are no longer limited by the cost of raw sequencing data, but rather by computational problems associated with genome assembly. There is an urgent demand for more efficient and and more accurate methods is particular with regard to the highly complex and often very large genomes of animals and plants. Most recently, "hybrid" methods that integrate short and long read data have been devised to address this need. RESULTS LazyB is such a hybrid genome assembler. It has been designed specificially with an emphasis on utilizing low-coverage short and long reads. LazyB starts from a bipartite overlap graph between long reads and restrictively filtered short-read unitigs. This graph is translated into a long-read overlap graph G. Instead of the more conventional approach of removing tips, bubbles, and other local features, LazyB stepwisely extracts subgraphs whose global properties approach a disjoint union of paths. First, a consistently oriented subgraph is extracted, which in a second step is reduced to a directed acyclic graph. In the next step, properties of proper interval graphs are used to extract contigs as maximum weight paths. These path are translated into genomic sequences only in the final step. A prototype implementation of LazyB, entirely written in python, not only yields significantly more accurate assemblies of the yeast and fruit fly genomes compared to state-of-the-art pipelines but also requires much less computational effort. CONCLUSIONS LazyB is new low-cost genome assembler that copes well with large genomes and low coverage. It is based on a novel approach for reducing the overlap graph to a collection of paths, thus opening new avenues for future improvements. AVAILABILITY The LazyB prototype is available at https://github.com/TGatter/LazyB .
Collapse
Affiliation(s)
- Thomas Gatter
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16-18, 04107, Leipzig, Germany.
| | - Sarah von Löhneysen
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16-18, 04107, Leipzig, Germany
| | - Jörg Fallmann
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16-18, 04107, Leipzig, Germany
| | - Polina Drozdova
- Institute of Biology, Irkutsk State University, RU-664003, Irkutsk, Russia
| | - Tom Hartmann
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16-18, 04107, Leipzig, Germany
| | - Peter F Stadler
- Biology Department, Universidad Nacional de Colombia, Carrera 45 # 26-85, Edif. Uriel Gutiérrez, Bogotá, D.C, Colombia.
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16-18, 04107, Leipzig, Germany.
- Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, 04103, Leipzig, Germany.
- Department of Theoretical Chemistry, University of Vienna, Währinger Straße 17, 1090, Vienna, Austria.
- Santa Fe Institute, 1399 Hyde Park Rd., Santa Fe, NM87501, USA.
| |
Collapse
|
14
|
Hosseini ZZ, Rahimi SK, Forouzan E, Baraani A. RMI-DBG algorithm: A more agile iterative de Bruijn graph algorithm in short read genome assembly. J Bioinform Comput Biol 2021; 19:2150005. [PMID: 33866959 DOI: 10.1142/s0219720021500050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The de Bruijn Graph algorithm (DBG) as one of the cornerstones algorithms in short read assembly has extended with the rapid advancement of the Next Generation Sequencing (NGS) technologies and low-cost production of millions of high-quality short reads. Erroneous reads, non-uniform coverage, and genomic repeats are three major problems that influence the performance of short read assemblers. To encounter these problems, the iterative DBG algorithm applies multiple [Formula: see text]-mers instead of a single [Formula: see text]-mer, by iterating the DBG graph over a range of [Formula: see text]-mer sizes from the minimum to the maximum. However, the iteration paradigm of iterative DBG deals with complex graphs from the beginning of the algorithm and therefore, causes more potential errors and computational time for resolving various unreal branches. In this research, we propose the Reverse Modified Iterative DBG graph (named RMI-DBG) for short read assembly. RMI-DBG utilizes the DBG algorithm and String graph to achieve the advantages of both algorithms. We present that RMI-DBG performs faster with comparable results in comparison to iterative DBG. Additionally, the quality of the proposed algorithm in terms of continuity and accuracy is evaluated with some commonly-used assemblers via several real datasets of the GAGE-B benchmark.
Collapse
Affiliation(s)
| | | | - Esmaeil Forouzan
- National Institute for Genetic, Engineering & Biotechnology, (NIGEB), Tehran, Iran.,GeneMan Genomics Ltd, (www.ggenomics.ir), Shiraz, Iran
| | - Ahmad Baraani
- Department of Software Engineering, University of Isfahan, Iran
| |
Collapse
|
15
|
Lapidus AL, Korobeynikov AI. Metagenomic Data Assembly - The Way of Decoding Unknown Microorganisms. Front Microbiol 2021; 12:613791. [PMID: 33833738 PMCID: PMC8021871 DOI: 10.3389/fmicb.2021.613791] [Citation(s) in RCA: 49] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2020] [Accepted: 03/03/2021] [Indexed: 01/08/2023] Open
Abstract
Metagenomics is a segment of conventional microbial genomics dedicated to the sequencing and analysis of combined genomic DNA of entire environmental samples. The most critical step of the metagenomic data analysis is the reconstruction of individual genes and genomes of the microorganisms in the communities using metagenomic assemblers - computational programs that put together small fragments of sequenced DNA generated by sequencing instruments. Here, we describe the challenges of metagenomic assembly, a wide spectrum of applications in which metagenomic assemblies were used to better understand the ecology and evolution of microbial ecosystems, and present one of the most efficient microbial assemblers, SPAdes that was upgraded to become applicable for metagenomics.
Collapse
Affiliation(s)
- Alla L. Lapidus
- Center for Algorithmic Biotechnology, St. Petersburg State University, Saint Petersburg, Russia
| | | |
Collapse
|
16
|
Castro CJ, Marine RL, Ramos E, Ng TFF. The effect of variant interference on de novo assembly for viral deep sequencing. BMC Genomics 2020; 21:421. [PMID: 32571214 PMCID: PMC7306937 DOI: 10.1186/s12864-020-06801-w] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2020] [Accepted: 06/02/2020] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Viruses have high mutation rates and generally exist as a mixture of variants in biological samples. Next-generation sequencing (NGS) approaches have surpassed Sanger for generating long viral sequences, yet how variants affect NGS de novo assembly remains largely unexplored. RESULTS Our results from > 15,000 simulated experiments showed that presence of variants can turn an assembly of one genome into tens to thousands of contigs. This "variant interference" (VI) is highly consistent and reproducible by ten commonly-used de novo assemblers, and occurs over a range of genome length, read length, and GC content. The main driver of VI is pairwise identities between viral variants. These findings were further supported by in silico simulations, where selective removal of minor variant reads from clinical datasets allow the "rescue" of full viral genomes from fragmented contigs. CONCLUSIONS These results call for careful interpretation of contigs and contig numbers from de novo assembly in viral deep sequencing.
Collapse
Affiliation(s)
- Christina J Castro
- Division of Viral Diseases, National Center for Immunization and Respiratory Diseases, Centers for Disease Control and Prevention, Atlanta, GA, 30329, USA
- Oak Ridge Institute for Science and Education, Oak Ridge, TN, USA
| | - Rachel L Marine
- Division of Viral Diseases, National Center for Immunization and Respiratory Diseases, Centers for Disease Control and Prevention, Atlanta, GA, 30329, USA
| | - Edward Ramos
- General Dynamics Information Technology, Inc., contracting agency to the Office of Informatics, National Center for Immunization and Respiratory Diseases, Centers for Disease Control and Prevention, Falls Church, VA, USA
| | - Terry Fei Fan Ng
- Division of Viral Diseases, National Center for Immunization and Respiratory Diseases, Centers for Disease Control and Prevention, Atlanta, GA, 30329, USA.
| |
Collapse
|