1
|
Mustafa H, Karasikov M, Mansouri Ghiasi N, Rätsch G, Kahles A. Label-guided seed-chain-extend alignment on annotated De Bruijn graphs. Bioinformatics 2024; 40:i337-i346. [PMID: 38940164 PMCID: PMC11211850 DOI: 10.1093/bioinformatics/btae226] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
MOTIVATION Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contiguous walks required for alignment queries. Aligners that target single-labelled subgraphs reduce alignment lengths due to fragmentation, leading to low recall for long reads. While some (e.g. label-free) aligners partially overcome fragmentation by combining information from multiple samples, biologically irrelevant combinations in such approaches can inflate the search space or reduce accuracy. RESULTS We introduce a new scoring model, 'multi-label alignment' (MLA), for annotated DBGs. MLA leverages two new operations: To promote biologically relevant sample combinations, 'Label Change' incorporates more informative global sample similarity into local scores. To improve connectivity, 'Node Length Change' dynamically adjusts the DBG node length during traversal. Our fast, approximate, yet accurate MLA implementation has two key steps: a single-label seed-chain-extend aligner (SCA) and a multi-label chainer (MLC). SCA uses a traditional scoring model adapting recent chaining improvements to assembly graphs and provides a curated pool of alignments. MLC extracts seed anchors from SCAs alignments, produces multi-label chains using MLA scoring, then finally forms multi-label alignments. We show via substantial improvements in taxonomic classification accuracy that MLA produces biologically relevant alignments, decreasing average weighted UniFrac errors by 63.1%-66.8% and covering 45.5%-47.4% (median) more long-read query characters than state-of-the-art aligners. MLAs runtimes are competitive with label-combining alignment and substantially faster than single-label alignment. AVAILABILITY AND IMPLEMENTATION The data, scripts, and instructions for generating our results are available at https://github.com/ratschlab/mla.
Collapse
Affiliation(s)
- Harun Mustafa
- Department of Computer Science, ETH Zurich, Zurich, 8092, Switzerland
- Biomedical Informatics Group, University Hospital Zurich, Zurich, 8091, Switzerland
- Biomedical Informatics, Swiss Institute of Bioinformatics, Zurich, 8092, Switzerland
| | - Mikhail Karasikov
- Department of Computer Science, ETH Zurich, Zurich, 8092, Switzerland
- Biomedical Informatics Group, University Hospital Zurich, Zurich, 8091, Switzerland
- Biomedical Informatics, Swiss Institute of Bioinformatics, Zurich, 8092, Switzerland
| | - Nika Mansouri Ghiasi
- Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich, 8092, Switzerland
| | - Gunnar Rätsch
- Department of Computer Science, ETH Zurich, Zurich, 8092, Switzerland
- Biomedical Informatics Group, University Hospital Zurich, Zurich, 8091, Switzerland
- Biomedical Informatics, Swiss Institute of Bioinformatics, Zurich, 8092, Switzerland
- ETH AI Center, Zurich, 8092, Switzerland
- Department of Biology, ETH Zurich, Zurich, 8093, Switzerland
- The LOOP Zurich—Medical Research Center, Zurich, 8044, Switzerland
| | - André Kahles
- Department of Computer Science, ETH Zurich, Zurich, 8092, Switzerland
- Biomedical Informatics Group, University Hospital Zurich, Zurich, 8091, Switzerland
- Biomedical Informatics, Swiss Institute of Bioinformatics, Zurich, 8092, Switzerland
- The LOOP Zurich—Medical Research Center, Zurich, 8044, Switzerland
| |
Collapse
|
2
|
Chandra G, Jain C. Gap-Sensitive Colinear Chaining Algorithms for Acyclic Pangenome Graphs. J Comput Biol 2023; 30:1182-1197. [PMID: 37902967 DOI: 10.1089/cmb.2023.0186] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/01/2023] Open
Abstract
A pangenome graph can serve as a better reference for genomic studies because it allows a compact representation of multiple genomes within a species. Aligning sequences to a graph is critical for pangenome-based resequencing. The seed-chain-extend heuristic works by finding short exact matches between a sequence and a graph. In this heuristic, colinear chaining helps identify a good cluster of exact matches that can be combined to form an alignment. Colinear chaining algorithms have been extensively studied for aligning two sequences with various gap costs, including linear, concave, and convex cost functions. However, extending these algorithms for sequence-to-graph alignment presents significant challenges. Recently, Makinen et al. introduced a sparse dynamic programming framework that exploits the small path cover property of acyclic pangenome graphs, enabling efficient chaining. However, this framework does not consider gap costs, limiting its practical effectiveness. We address this limitation by developing novel problem formulations and provably good chaining algorithms that support a variety of gap cost functions. These functions are carefully designed to enable fast chaining algorithms whose time requirements are parameterized in terms of the size of the minimum path cover. Through an empirical evaluation, we demonstrate the superior performance of our algorithm compared with existing aligners. When mapping simulated long reads to a pangenome graph comprising 95 human haplotypes, we achieved 98.7% precision while leaving <2% of reads unmapped.
Collapse
Affiliation(s)
- Ghanshyam Chandra
- Department of Computational and Data Sciences, Indian Institute of Science Bengaluru, India
| | - Chirag Jain
- Department of Computational and Data Sciences, Indian Institute of Science Bengaluru, India
| |
Collapse
|
3
|
Depuydt L, Renders L, Abeel T, Fostier J. Pan-genome de Bruijn graph using the bidirectional FM-index. BMC Bioinformatics 2023; 24:400. [PMID: 37884897 PMCID: PMC10605969 DOI: 10.1186/s12859-023-05531-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2023] [Accepted: 10/12/2023] [Indexed: 10/28/2023] Open
Abstract
BACKGROUND Pan-genome graphs are gaining importance in the field of bioinformatics as data structures to represent and jointly analyze multiple genomes. Compacted de Bruijn graphs are inherently suited for this purpose, as their graph topology naturally reveals similarity and divergence within the pan-genome. Most state-of-the-art pan-genome graphs are represented explicitly in terms of nodes and edges. Recently, an alternative, implicit graph representation was proposed that builds directly upon the unidirectional FM-index. As such, a memory-efficient graph data structure is obtained that inherits the FM-index' backward search functionality. However, this representation suffers from a number of shortcomings in terms of functionality and algorithmic performance. RESULTS We present a data structure for a pan-genome, compacted de Bruijn graph that aims to address these shortcomings. It is built on the bidirectional FM-index, extending the ability of its unidirectional counterpart to navigate and search the graph in both directions. All basic graph navigation steps can be performed in constant time. Based on these features, we implement subgraph visualization as well as lossless approximate pattern matching to the graph using search schemes. We demonstrate that we can retrieve all occurrences corresponding to a read within a certain edit distance in a very efficient manner. Through a case study, we show the potential of exploiting the information embedded in the graph's topology through visualization and sequence alignment. CONCLUSIONS We propose a memory-efficient representation of the pan-genome graph that supports subgraph visualization and lossless approximate pattern matching of reads against the graph using search schemes. The C++ source code of our software, called Nexus, is available at https://github.com/biointec/nexus under AGPL-3.0 license.
Collapse
Affiliation(s)
- Lore Depuydt
- Department of Information Technology - IDLab, Ghent University - imec, Technologiepark 126, 9052, Ghent, Belgium.
| | - Luca Renders
- Department of Information Technology - IDLab, Ghent University - imec, Technologiepark 126, 9052, Ghent, Belgium
| | - Thomas Abeel
- Delft Bioinformatics Lab, Delft University of Technology, 2628 XE, Delft, The Netherlands
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
| | - Jan Fostier
- Department of Information Technology - IDLab, Ghent University - imec, Technologiepark 126, 9052, Ghent, Belgium.
| |
Collapse
|
4
|
Ma J, Cáceres M, Salmela L, Mäkinen V, Tomescu AI. Chaining for accurate alignment of erroneous long reads to acyclic variation graphs. Bioinformatics 2023; 39:btad460. [PMID: 37494467 PMCID: PMC10423031 DOI: 10.1093/bioinformatics/btad460] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2022] [Revised: 06/08/2023] [Accepted: 07/25/2023] [Indexed: 07/28/2023] Open
Abstract
MOTIVATION Aligning reads to a variation graph is a standard task in pangenomics, with downstream applications such as improving variant calling. While the vg toolkit [Garrison et al. (Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat Biotechnol 2018;36:875-9)] is a popular aligner of short reads, GraphAligner [Rautiainen and Marschall (GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol 2020;21:253-28)] is the state-of-the-art aligner of erroneous long reads. GraphAligner works by finding candidate read occurrences based on individually extending the best seeds of the read in the variation graph. However, a more principled approach recognized in the community is to co-linearly chain multiple seeds. RESULTS We present a new algorithm to co-linearly chain a set of seeds in a string labeled acyclic graph, together with the first efficient implementation of such a co-linear chaining algorithm into a new aligner of erroneous long reads to acyclic variation graphs, GraphChainer. We run experiments aligning real and simulated PacBio CLR reads with average error rates 15% and 5%. Compared to GraphAligner, GraphChainer aligns 12-17% more reads, and 21-28% more total read length, on real PacBio CLR reads from human chromosomes 1, 22, and the whole human pangenome. On both simulated and real data, GraphChainer aligns between 95% and 99% of all reads, and of total read length. We also show that minigraph [Li et al. (The design and construction of reference pangenome graphs with minigraph. Genome Biol 2020;21:265-19.)] and minichain [Chandra and Jain (Sequence to graph alignment using gap-sensitive co-linear chaining. In: Proceedings of the 27th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2023). Springer, 2023, 58-73.)] obtain an accuracy of <60% on this setting. AVAILABILITY AND IMPLEMENTATION GraphChainer is freely available at https://github.com/algbio/GraphChainer. The datasets and evaluation pipeline can be reached from the previous address.
Collapse
Affiliation(s)
- Jun Ma
- Department of Computer Science, University of Helsinki, 00014 Helsinki, Finland
| | - Manuel Cáceres
- Department of Computer Science, University of Helsinki, 00014 Helsinki, Finland
| | - Leena Salmela
- Department of Computer Science, University of Helsinki, 00014 Helsinki, Finland
| | - Veli Mäkinen
- Department of Computer Science, University of Helsinki, 00014 Helsinki, Finland
| | - Alexandru I Tomescu
- Department of Computer Science, University of Helsinki, 00014 Helsinki, Finland
| |
Collapse
|
5
|
Singh V, Pandey S, Bhardwaj A. From the reference human genome to human pangenome: Premise, promise and challenge. Front Genet 2022; 13:1042550. [PMID: 36437921 PMCID: PMC9684177 DOI: 10.3389/fgene.2022.1042550] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2022] [Accepted: 10/21/2022] [Indexed: 11/11/2022] Open
Abstract
The Reference Human Genome remains the single most important resource for mapping genetic variations and assessing their impact. However, it is monophasic, incomplete and not representative of the variation that exists in the population. Given the extent of ethno-geographic diversity and the consequent diversity in clinical manifestations of these variations, population specific references were developed overtime. The dramatically plummeting cost of sequencing whole genomes and the advent of third generation long range sequencers allowing accurate, error free, telomere-to-telomere assemblies of human genomes present us with a unique and unprecedented opportunity to develop a more composite standard reference consisting of a collection of multiple genomes that capture the maximal variation existing in the population, with the deepest annotation possible, enabling a realistic, reliable and actionable estimation of clinical significance of specific variations. The Human Pangenome Project thus is a logical next step promising a more accurate and global representation of genomic variations. The pangenome effort must be reciprocally complemented with precise variant discovery tools and exhaustive annotation to ensure unambiguous clinical assessment of the variant in ethno-geographical context. Here we discuss a broad roadmap, the challenges and way forward in developing a universal pangenome reference including data visualization techniques and integration of prior knowledge base in the new graph based architecture and tools to submit, compare, query, annotate and retrieve relevant information from the pangenomes. The biggest challenge, however, will be the ethical, legal and social implications and the training of human resource to the new reference paradigm.
Collapse
Affiliation(s)
- Vipin Singh
- University Institute of Biotechnology, Chandigarh University, Mohali, India
| | - Shweta Pandey
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, India
- Bioinformatics Centre, CSIR-Institute of Microbial Technology, Chandigarh, India
| | - Anshu Bhardwaj
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, India
- Bioinformatics Centre, CSIR-Institute of Microbial Technology, Chandigarh, India
- *Correspondence: Anshu Bhardwaj,
| |
Collapse
|
6
|
Tolstoganov I, Kamenev Y, Kruglikov R, Ochkalova S, Korobeynikov A. BinSPreader: refine binning results for fuller MAG reconstruction. iScience 2022; 25:104770. [PMID: 35992057 PMCID: PMC9386100 DOI: 10.1016/j.isci.2022.104770] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2022] [Revised: 06/20/2022] [Accepted: 07/12/2022] [Indexed: 11/02/2022] Open
Abstract
Despite the recent advances in high-throughput sequencing, metagenome analysis of microbial populations still remains a challenge. In particular, the metagenome-assembled genomes (MAGs) are often fragmented due to interspecies repeats, uneven coverage, and varying strain abundance. MAGs are constructed via a binning process that uses features of input data in order to cluster long contigs presumably belonging to the same species. In this work, we present BinSPreader—a binning refiner tool that exploits the assembly graph topology and other connectivity information to refine binning, correct binning errors, and propagate binning to shorter contigs. We show that BinSPreader could increase the completeness of the bins without sacrificing the purity and could predict contigs belonging to several MAGs. BinSPreader is effective in binning shorter contigs that often contain important conservative sequences that might be of great interest to researchers. We propose a novel method to refine the binning using the assembly graph connectivity Method could also use paired-end reads, Hi-C links, and other sources of connectivity It improves the completeness of the bins without sacrificing the purity BinSPreader could predict contigs belonging to several MAGs
Collapse
|
7
|
Quan C, Lu H, Lu Y, Zhou G. Population-scale genotyping of structural variation in the era of long-read sequencing. Comput Struct Biotechnol J 2022; 20:2639-2647. [PMID: 35685364 PMCID: PMC9163579 DOI: 10.1016/j.csbj.2022.05.047] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2022] [Revised: 05/24/2022] [Accepted: 05/24/2022] [Indexed: 11/29/2022] Open
Abstract
Population-scale studies of structural variation (SV) are growing rapidly worldwide with the development of long-read sequencing technology, yielding a considerable number of novel SVs and complete gap-closed genome assemblies. Herein, we highlight recent studies using a hybrid sequencing strategy and present the challenges toward large-scale genotyping for SVs due to the reference bias. Genotyping SVs at a population scale remains challenging, which severely impacts genotype-based population genetic studies or genome-wide association studies of complex diseases. We summarize academic efforts to improve genotype quality through linear or graph representations of reference and alternative alleles. Graph-based genotypers capable of integrating diverse genetic information are effectively applied to large and diverse cohorts, contributing to unbiased downstream analysis. Meanwhile, there is still an urgent need in this field for efficient tools to construct complex graphs and perform sequence-to-graph alignments.
Collapse
Affiliation(s)
- Cheng Quan
- Department of Genetics & Integrative Omics, State Key Laboratory of Proteomics, National Center for Protein Sciences, Beijing Institute of Radiation Medicine, Beijing 100850, PR China
| | - Hao Lu
- Department of Genetics & Integrative Omics, State Key Laboratory of Proteomics, National Center for Protein Sciences, Beijing Institute of Radiation Medicine, Beijing 100850, PR China
| | - Yiming Lu
- Department of Genetics & Integrative Omics, State Key Laboratory of Proteomics, National Center for Protein Sciences, Beijing Institute of Radiation Medicine, Beijing 100850, PR China
- Hebei University, Baoding, Hebei Province 071002, PR China
- Corresponding authors at: Department of Genetics & Integrative Omics, State Key Laboratory of Proteomics, National Center for Protein Sciences, Beijing Institute of Radiation Medicine, 27 Taiping Road, Beijing 100850, PR China (G. Zhou). Department of Genetics & Integrative Omics, State Key Laboratory of Proteomics, National Center for Protein Sciences, Beijing Institute of Radiation Medicine, 27 Taiping Road, Beijing, 100850, PR China (Y. Lu).
| | - Gangqiao Zhou
- Department of Genetics & Integrative Omics, State Key Laboratory of Proteomics, National Center for Protein Sciences, Beijing Institute of Radiation Medicine, Beijing 100850, PR China
- Collaborative Innovation Center for Personalized Cancer Medicine, Center for Global Health, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu Province 211166, PR China
- Medical College of Guizhou University, Guiyang, Guizhou Province 550025, PR China
- Hebei University, Baoding, Hebei Province 071002, PR China
- Corresponding authors at: Department of Genetics & Integrative Omics, State Key Laboratory of Proteomics, National Center for Protein Sciences, Beijing Institute of Radiation Medicine, 27 Taiping Road, Beijing 100850, PR China (G. Zhou). Department of Genetics & Integrative Omics, State Key Laboratory of Proteomics, National Center for Protein Sciences, Beijing Institute of Radiation Medicine, 27 Taiping Road, Beijing, 100850, PR China (Y. Lu).
| |
Collapse
|
8
|
The Human Pangenome Project: a global resource to map genomic diversity. Nature 2022; 604:437-446. [PMID: 35444317 DOI: 10.1038/s41586-022-04601-8] [Citation(s) in RCA: 148] [Impact Index Per Article: 74.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2021] [Accepted: 03/01/2022] [Indexed: 12/20/2022]
Abstract
The human reference genome is the most widely used resource in human genetics and is due for a major update. Its current structure is a linear composite of merged haplotypes from more than 20 people, with a single individual comprising most of the sequence. It contains biases and errors within a framework that does not represent global human genomic variation. A high-quality reference with global representation of common variants, including single-nucleotide variants, structural variants and functional elements, is needed. The Human Pangenome Reference Consortium aims to create a more sophisticated and complete human reference genome with a graph-based, telomere-to-telomere representation of global genomic diversity. Here we leverage innovations in technology, study design and global partnerships with the goal of constructing the highest-possible quality human pangenome reference. Our goal is to improve data representation and streamline analyses to enable routine assembly of complete diploid genomes. With attention to ethical frameworks, the human pangenome reference will contain a more accurate and diverse representation of global genomic variation, improve gene-disease association studies across populations, expand the scope of genomics research to the most repetitive and polymorphic regions of the genome, and serve as the ultimate genetic resource for future biomedical research and precision medicine.
Collapse
|
9
|
Souvorov A, Agarwala R. SAUTE: sequence assembly using target enrichment. BMC Bioinformatics 2021; 22:375. [PMID: 34289805 PMCID: PMC8293564 DOI: 10.1186/s12859-021-04174-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2021] [Accepted: 05/05/2021] [Indexed: 01/25/2023] Open
Abstract
Background Illumina is the dominant sequencing technology at this time. Short length, short insert size, some systematic biases, and low-level carryover contamination in Illumina reads continue to make assembly of repeated regions a challenging problem. Some applications also require finding multiple well supported variants for assembled regions. Results To facilitate assembly of repeat regions and to report multiple well supported variants when a user can provide target sequences to assist the assembly, we propose SAUTE and SAUTE_PROT assemblers. Both assemblers use de Bruijn graph on reads. Targets can be transcripts or proteins for RNA-seq reads and transcripts, proteins, or genomic regions for genomic reads. Target sequences are nucleotide and protein sequences for SAUTE and SAUTE_PROT, respectively. Conclusions For RNA-seq, comparisons with Trinity, rnaSPAdes, SPAligner, and SPAdes assembly of reads aligned to target proteins by DIAMOND show that SAUTE_PROT finds more coding sequences that translate to benchmark proteins. Using AMRFinderPlus calls, we find SAUTE has higher sensitivity and precision than SPAdes, plasmidSPAdes, SPAligner, and SPAdes assembly of reads aligned to target regions by HISAT2. It also has better sensitivity than SKESA but worse precision. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04174-9.
Collapse
Affiliation(s)
| | - Richa Agarwala
- NCBI/NLM/NIH/DHHS, 8600 Rockville Pike, Bethesda, MD, 20894, USA.
| |
Collapse
|
10
|
Dvorkina T, Bankevich A, Sorokin A, Yang F, Adu-Oppong B, Williams R, Turner K, Pevzner PA. ORFograph: search for novel insecticidal protein genes in genomic and metagenomic assembly graphs. MICROBIOME 2021; 9:149. [PMID: 34183047 PMCID: PMC8240309 DOI: 10.1186/s40168-021-01092-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/26/2021] [Accepted: 05/11/2021] [Indexed: 05/07/2023]
Abstract
BACKGROUND Since the prolonged use of insecticidal proteins has led to toxin resistance, it is important to search for novel insecticidal protein genes (IPGs) that are effective in controlling resistant insect populations. IPGs are usually encoded in the genomes of entomopathogenic bacteria, especially in large plasmids in strains of the ubiquitous soil bacteria, Bacillus thuringiensis (Bt). Since there are often multiple similar IPGs encoded by such plasmids, their assemblies are typically fragmented and many IPGs are scattered through multiple contigs. As a result, existing gene prediction tools (that analyze individual contigs) typically predict partial rather than complete IPGs, making it difficult to conduct downstream IPG engineering efforts in agricultural genomics. METHODS Although it is difficult to assemble IPGs in a single contig, the structure of the genome assembly graph often provides clues on how to combine multiple contigs into segments encoding a single IPG. RESULTS We describe ORFograph, a pipeline for predicting IPGs in assembly graphs, benchmark it on (meta)genomic datasets, and discover nearly a hundred novel IPGs. This work shows that graph-aware gene prediction tools enable the discovery of greater diversity of IPGs from (meta)genomes. CONCLUSIONS We demonstrated that analysis of the assembly graphs reveals novel candidate IPGs. ORFograph identified both already known genes "hidden" in assembly graphs and potential novel IPGs that evaded existing tools for IPG identification. As ORFograph is fast, one could imagine a pipeline that processes many (meta)genomic assembly graphs to identify even more novel IPGs for phenotypic testing than would previously be inaccessible by traditional gene-finding methods. While here we demonstrated the results of ORFograph only for IPGs, the proposed approach can be generalized to any class of genes. Video abstract.
Collapse
Affiliation(s)
- Tatiana Dvorkina
- Center for Algorithmic Biotechnology, Saint Petersburg State University, Saint Petersburg, Russia
| | - Anton Bankevich
- Department of Computer Science and Engineering, University of California San Diego, San Diego, CA USA
| | - Alexei Sorokin
- Université Paris-Saclay, INRAE, Micalis Institute, AgroParisTech, 78350 Jouy-en-Josas, France
| | - Fan Yang
- Data Science & Analytics, Bayer U.S. - Crop Science, Chesterfield, MO USA
- Ascus Biosciences, San Diego, CA USA
| | - Boahemaa Adu-Oppong
- Data Science & Analytics, Bayer U.S. - Crop Science, Chesterfield, MO USA
- Thermo Fisher Scientific, Carlsbad, CA USA
| | - Ryan Williams
- Data Science & Analytics, Bayer U.S. - Crop Science, Chesterfield, MO USA
| | - Keith Turner
- Data Science & Analytics, Bayer U.S. - Crop Science, Chesterfield, MO USA
| | - Pavel A. Pevzner
- Department of Computer Science and Engineering, University of California San Diego, San Diego, CA USA
| |
Collapse
|
11
|
Lapidus AL, Korobeynikov AI. Metagenomic Data Assembly - The Way of Decoding Unknown Microorganisms. Front Microbiol 2021; 12:613791. [PMID: 33833738 PMCID: PMC8021871 DOI: 10.3389/fmicb.2021.613791] [Citation(s) in RCA: 47] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2020] [Accepted: 03/03/2021] [Indexed: 01/08/2023] Open
Abstract
Metagenomics is a segment of conventional microbial genomics dedicated to the sequencing and analysis of combined genomic DNA of entire environmental samples. The most critical step of the metagenomic data analysis is the reconstruction of individual genes and genomes of the microorganisms in the communities using metagenomic assemblers - computational programs that put together small fragments of sequenced DNA generated by sequencing instruments. Here, we describe the challenges of metagenomic assembly, a wide spectrum of applications in which metagenomic assemblies were used to better understand the ecology and evolution of microbial ecosystems, and present one of the most efficient microbial assemblers, SPAdes that was upgraded to become applicable for metagenomics.
Collapse
Affiliation(s)
- Alla L. Lapidus
- Center for Algorithmic Biotechnology, St. Petersburg State University, Saint Petersburg, Russia
| | | |
Collapse
|