1
|
Campanelli A, Pibiri GE, Fan J, Patro R. Where the Patterns Are: Repetition-Aware Compression for Colored de Bruijn Graphs . J Comput Biol 2024. [PMID: 39381838 DOI: 10.1089/cmb.2024.0714] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/10/2024] Open
Abstract
We describe lossless compressed data structures for the colored de Bruijn graph (or c-dBG). Given a collection of reference sequences, a c-dBG can be essentially regarded as a map from k-mers to their color sets. The color set of a k-mer is the set of all identifiers, or colors, of the references that contain the k-mer. While these maps find countless applications in computational biology (e.g., basic query, reading mapping, abundance estimation, etc.), their memory usage represents a serious challenge for large-scale sequence indexing. Our solutions leverage on the intrinsic repetitiveness of the color sets when indexing large collections of related genomes. Hence, the described algorithms factorize the color sets into patterns that repeat across the entire collection and represent these patterns once instead of redundantly replicating their representation as would happen if the sets were encoded as atomic lists of integers. Experimental results across a range of datasets and query workloads show that these representations substantially improve over the space effectiveness of the best previous solutions (sometimes, even dramatically, yielding indexes that are smaller by an order of magnitude). Despite the space reduction, these indexes only moderately impact the efficiency of the queries compared to the fastest indexes.
Collapse
Affiliation(s)
| | | | - Jason Fan
- Fulcrum Genomics LLC, Somerville, Massachusetts, USA
| | - Rob Patro
- Department of Computer Science, University of Maryland, College Park, Maryland, USA
| |
Collapse
|
2
|
Kim Y, Worby CJ, Acharya S, van Dijk LR, Alfonsetti D, Gromko Z, Azimzadeh P, Dodson K, Gerber G, Hultgren S, Earl AM, Berger B, Gibson TE. Strain tracking with uncertainty quantification. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.01.25.525531. [PMID: 36747646 PMCID: PMC9900846 DOI: 10.1101/2023.01.25.525531] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
The ability to detect and quantify microbiota over time has a plethora of clinical, basic science, and public health applications. One of the primary means of tracking microbiota is through sequencing technologies. When the microorganism of interest is well characterized or known a priori , targeted sequencing is often used. In many applications, however, untargeted bulk (shotgun) sequencing is more appropriate; for instance, the tracking of infection transmission events and nucleotide variants across multiple genomic loci, or studying the role of multiple genes in a particular phenotype. Given these applications, and the observation that pathogens (e.g. Clostridioides difficile, Escherichia coli, Salmonella enterica ) and other taxa of interest can reside at low relative abundance in the gastrointestinal tract, there is a critical need for algorithms that accurately track low-abundance taxa with strain level resolution. Here we present a sequence quality- and time-aware model, ChronoStrain , that introduces uncertainty quantification to gauge low-abundance species and significantly outperforms the current state-of-the-art on both real and synthetic data. ChronoStrain leverages sequences' quality scores and the samples' temporal information to produce a probability distribution over abundance trajectories for each strain tracked in the model. We demonstrate Chronostrain's improved performance in capturing post-antibiotic Escherichia coli strain blooms among women with recurrent urinary tract infections (UTIs) from the UTI Microbiome (UMB) Project. Other strain tracking models on the same data either show inconsistent temporal colonization or can only track consistently using very coarse groupings. In contrast, our probabilistic outputs can reveal the relationship between low-confidence strains present in the sample that cannot be reliably assigned a single reference label (either due to poor coverage or novelty) while simultaneously calling high-confidence strains that can be unambiguously assigned a label. We also analyze samples from the Early Life Microbiota Colonisation (ELMC) Study demonstrating the algorithm's ability to correctly identify Enterococcus faecalis strains using paired sample isolates as validation.
Collapse
|
3
|
Pastusiak A, Reddy MR, Chen X, Hoyer I, Dorman J, Gebhardt ME, Carpi G, Norris DE, Pipas JM, Jackson EK. A metagenomic analysis of the phase 2 Anopheles gambiae 1000 genomes dataset reveals a wide diversity of cobionts associated with field collected mosquitoes. Commun Biol 2024; 7:667. [PMID: 38816486 PMCID: PMC11139907 DOI: 10.1038/s42003-024-06337-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Accepted: 05/15/2024] [Indexed: 06/01/2024] Open
Abstract
The Anopheles gambiae 1000 Genomes (Ag1000G) Consortium previously utilized deep sequencing methods to catalogue genetic diversity across African An. gambiae populations. We analyzed the complete datasets of 1142 individually sequenced mosquitoes through Microsoft Premonition's Bayesian mixture model based (BMM) metagenomics pipeline. All specimens were confirmed as either An. gambiae sensu stricto (s.s.) or An. coluzzii with a high degree of confidence ( > 98% identity to reference). Homo sapiens DNA was identified in all specimens indicating contamination may have occurred either at the time of specimen collection, preparation and/or sequencing. We found evidence of vertebrate hosts in 162 specimens. 59 specimens contained validated Plasmodium falciparum reads. Human hepatitis B and primate erythroparvovirus-1 viral sequences were identified in fifteen and three mosquito specimens, respectively. 478 of the 1,142 specimens were found to contain bacterial reads and bacteriophage-related contigs were detected in 27 specimens. This analysis demonstrates the capacity of metagenomic approaches to elucidate important vector-host-pathogen interactions of epidemiological significance.
Collapse
Affiliation(s)
| | - Michael R Reddy
- Microsoft Premonition, Microsoft Research, Redmond, WA, 98052, USA.
| | - Xiaoji Chen
- Microsoft Premonition, Microsoft Research, Redmond, WA, 98052, USA
| | - Isaiah Hoyer
- Microsoft Premonition, Microsoft Research, Redmond, WA, 98052, USA
| | - Jack Dorman
- Department of Biological Sciences, Purdue University, West Lafayette, IN, 47907, USA
| | - Mary E Gebhardt
- The W. Harry Feinstone Department of Molecular Microbiology and Immunology, Johns Hopkins Malaria Research Institute, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, 21205, USA
| | - Giovanna Carpi
- Department of Biological Sciences, Purdue University, West Lafayette, IN, 47907, USA
| | - Douglas E Norris
- The W. Harry Feinstone Department of Molecular Microbiology and Immunology, Johns Hopkins Malaria Research Institute, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, 21205, USA
| | - James M Pipas
- Department of Biological Sciences, University of Pittsburgh, Pittsburgh, PA, 15260, USA
| | - Ethan K Jackson
- Microsoft Premonition, Microsoft Research, Redmond, WA, 98052, USA
| |
Collapse
|
4
|
Fan J, Khan J, Singh NP, Pibiri GE, Patro R. Fulgor: a fast and compact k-mer index for large-scale matching and color queries. Algorithms Mol Biol 2024; 19:3. [PMID: 38254124 PMCID: PMC10810250 DOI: 10.1186/s13015-024-00251-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Accepted: 01/03/2024] [Indexed: 01/24/2024] Open
Abstract
The problem of sequence identification or matching-determining the subset of reference sequences from a given collection that are likely to contain a short, queried nucleotide sequence-is relevant for many important tasks in Computational Biology, such as metagenomics and pangenome analysis. Due to the complex nature of such analyses and the large scale of the reference collections a resource-efficient solution to this problem is of utmost importance. This poses the threefold challenge of representing the reference collection with a data structure that is efficient to query, has light memory usage, and scales well to large collections. To solve this problem, we describe an efficient colored de Bruijn graph index, arising as the combination of a k-mer dictionary with a compressed inverted index. The proposed index takes full advantage of the fact that unitigs in the colored compacted de Bruijn graph are monochromatic (i.e., all k-mers in a unitig have the same set of references of origin, or color). Specifically, the unitigs are kept in the dictionary in color order, thereby allowing for the encoding of the map from k-mers to their colors in as little as 1 + o(1) bits per unitig. Hence, one color per unitig is stored in the index with almost no space/time overhead. By combining this property with simple but effective compression methods for integer lists, the index achieves very small space. We implement these methods in a tool called Fulgor, and conduct an extensive experimental analysis to demonstrate the improvement of our tool over previous solutions. For example, compared to Themisto-the strongest competitor in terms of index space vs. query time trade-off-Fulgor requires significantly less space (up to 43% less space for a collection of 150,000 Salmonella enterica genomes), is at least twice as fast for color queries, and is 2-6[Formula: see text] faster to construct.
Collapse
Affiliation(s)
- Jason Fan
- Department of Computer Science, University of Maryland, College Park, MD, 20742, USA
| | - Jamshed Khan
- Department of Computer Science, University of Maryland, College Park, MD, 20742, USA
| | - Noor Pratap Singh
- Department of Computer Science, University of Maryland, College Park, MD, 20742, USA
| | | | - Rob Patro
- Department of Computer Science, University of Maryland, College Park, MD, 20742, USA.
| |
Collapse
|
5
|
Alanko JN, Vuohtoniemi J, Mäklin T, Puglisi SJ. Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes. Bioinformatics 2023; 39:i260-i269. [PMID: 37387143 DOI: 10.1093/bioinformatics/btad233] [Citation(s) in RCA: 12] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION Huge datasets containing whole-genome sequences of bacterial strains are now commonplace and represent a rich and important resource for modern genomic epidemiology and metagenomics. In order to efficiently make use of these datasets, efficient indexing data structures-that are both scalable and provide rapid query throughput-are paramount. RESULTS Here, we present Themisto, a scalable colored k-mer index designed for large collections of microbial reference genomes, that works for both short and long read data. Themisto indexes 179 thousand Salmonella enterica genomes in 9 h. The resulting index takes 142 gigabytes. In comparison, the best competing tools Metagraph and Bifrost were only able to index 11 000 genomes in the same time. In pseudoalignment, these other tools were either an order of magnitude slower than Themisto, or used an order of magnitude more memory. Themisto also offers superior pseudoalignment quality, achieving a higher recall than previous methods on Nanopore read sets. AVAILABILITY AND IMPLEMENTATION Themisto is available and documented as a C++ package at https://github.com/algbio/themisto available under the GPLv2 license.
Collapse
Affiliation(s)
- Jarno N Alanko
- Department of Computer Science, University of Helsinki, Helsinki 00014, Finland
| | - Jaakko Vuohtoniemi
- Department of Computer Science, University of Helsinki, Helsinki 00014, Finland
| | - Tommi Mäklin
- Department of Mathematics and Statistics, University of Helsinki, Helsinki 00014, Finland
| | - Simon J Puglisi
- Department of Computer Science, University of Helsinki, Helsinki 00014, Finland
| |
Collapse
|
6
|
Fan J, Singh NP, Khan J, Pibiri GE, Patro R. Fulgor: A fast and compact k-mer index for large-scale matching and color queries. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.09.539895. [PMID: 37214944 PMCID: PMC10197524 DOI: 10.1101/2023.05.09.539895] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
The problem of sequence identification or matching - determining the subset of references from a given collection that are likely to contain a query nucleotide sequence - is relevant for many important tasks in Computational Biology, such as metagenomics and pan-genome analysis. Due to the complex nature of such analyses and the large scale of the reference collections a resourceefficient solution to this problem is of utmost importance. The reference collection should therefore be pre-processed into an index for fast queries. This poses the threefold challenge of designing an index that is efficient to query, has light memory usage, and scales well to large collections. To solve this problem, we describe how recent advancements in associative, order-preserving, k-mer dictionaries can be combined with a compressed inverted index to implement a fast and compact colored de Bruijn graph data structure. This index takes full advantage of the fact that unitigs in the colored de Bruijn graph are monochromatic (all k-mers in a unitig have the same set of references of origin, or "color"), leveraging the order-preserving property of its dictionary. In fact, k-mers are kept in unitig order by the dictionary, thereby allowing for the encoding of the map from k-mers to their inverted lists in as little as 1+o(1) bits per unitig. Hence, one inverted list per unitig is stored in the index with almost no space/time overhead. By combining this property with simple but effective compression methods for inverted lists, the index achieves very small space. We implement these methods in a tool called Fulgor. Compared to Themisto, the prior state of the art, Fulgor indexes a heterogeneous collection of 30,691 bacterial genomes in 3.8× less space, a collection of 150,000 Salmonella enterica genomes in approximately 2× less space, is at least twice as fast for color queries, and is 2 - 6× faster to construct.
Collapse
Affiliation(s)
- Jason Fan
- Department of Computer Science, University of Maryland, College Park, MD 20440, USA
| | - Noor Pratap Singh
- Department of Computer Science, University of Maryland, College Park, MD 20440, USA
| | - Jamshed Khan
- Department of Computer Science, University of Maryland, College Park, MD 20440, USA
| | | | - Rob Patro
- Department of Computer Science, University of Maryland, College Park, MD 20440, USA
| |
Collapse
|
7
|
Fritschi L, Lenk K. Parameter Inference for an Astrocyte Model using Machine Learning Approaches. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.16.540982. [PMID: 37292854 PMCID: PMC10245674 DOI: 10.1101/2023.05.16.540982] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Astrocytes are the largest subset of glial cells and perform structural, metabolic, and regulatory functions. They are directly involved in the communication at neuronal synapses and the maintenance of brain homeostasis. Several disorders, such as Alzheimer's, epilepsy, and schizophrenia, have been associated with astrocyte dysfunction. Computational models on various spatial levels have been proposed to aid in the understanding and research of astrocytes. The difficulty of computational astrocyte models is to fastly and precisely infer parameters. Physics informed neural networks (PINNs) use the underlying physics to infer parameters and, if necessary, dynamics that can not be observed. We have applied PINNs to estimate parameters for a computational model of an astrocytic compartment. The addition of two techniques helped with the gradient pathologies of the PINNS, the dynamic weighting of various loss components and the addition of Transformers. To overcome the issue that the neural network only learned the time dependence but did not know about eventual changes of the input stimulation to the astrocyte model, we followed an adaptation of PINNs from control theory (PINCs). In the end, we were able to infer parameters from artificial, noisy data, with stable results for the computational astrocyte model.
Collapse
Affiliation(s)
| | - Kerstin Lenk
- Institute of Neural Engineering, Graz University of Technology, Graz, Austria
- BioTechMed, 8010 Graz, Austria
| |
Collapse
|
8
|
Lai S, Pan S, Sun C, Coelho LP, Chen WH, Zhao XM. metaMIC: reference-free misassembly identification and correction of de novo metagenomic assemblies. Genome Biol 2022; 23:242. [PMID: 36376928 PMCID: PMC9661791 DOI: 10.1186/s13059-022-02810-y] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2021] [Accepted: 11/01/2022] [Indexed: 11/16/2022] Open
Abstract
Evaluating the quality of metagenomic assemblies is important for constructing reliable metagenome-assembled genomes and downstream analyses. Here, we present metaMIC ( https://github.com/ZhaoXM-Lab/metaMIC ), a machine learning-based tool for identifying and correcting misassemblies in metagenomic assemblies. Benchmarking results on both simulated and real datasets demonstrate that metaMIC outperforms existing tools when identifying misassembled contigs. Furthermore, metaMIC is able to localize the misassembly breakpoints, and the correction of misassemblies by splitting at misassembly breakpoints can improve downstream scaffolding and binning results.
Collapse
Affiliation(s)
- Senying Lai
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
| | - Shaojun Pan
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
| | - Chuqing Sun
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center for Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei China
| | - Luis Pedro Coelho
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
- MOE Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, and MOE Frontiers Center for Brain Science, Fudan University, Shanghai, China
| | - Wei-Hua Chen
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center for Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei China
- College of Life Science, Henan Normal University, Xinxiang, Henan China
| | - Xing-Ming Zhao
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
- MOE Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, and MOE Frontiers Center for Brain Science, Fudan University, Shanghai, China
- State Key Laboratory of Medical Neurobiology, Institutes of Brain Science, Fudan University, Shanghai, China
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai, China
- International Human Phenome Institutes (Shanghai), Shanghai, China
- Zhangjiang Fudan International Innovation Center, Shanghai, China
| |
Collapse
|
9
|
Vandenbogaert M, Kwasiborski A, Gonofio E, Descorps-Declère S, Selekon B, Nkili Meyong AA, Ouilibona RS, Gessain A, Manuguerra JC, Caro V, Nakoune E, Berthet N. Nanopore sequencing of a monkeypox virus strain isolated from a pustular lesion in the Central African Republic. Sci Rep 2022; 12:10768. [PMID: 35750759 PMCID: PMC9232561 DOI: 10.1038/s41598-022-15073-1] [Citation(s) in RCA: 28] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2021] [Accepted: 06/17/2022] [Indexed: 12/16/2022] Open
Abstract
Monkeypox is an emerging and neglected zoonotic disease whose number of reported cases has been gradually increasing in Central Africa since 1980. This disease is caused by the monkeypox virus (MPXV), which belongs to the genus Orthopoxvirus in the family Poxviridae. Obtaining molecular data is particularly useful for establishing the relationships between the viral strains involved in outbreaks in countries affected by this disease. In this study, we evaluated the use of the MinION real-time sequencer as well as different polishing tools on MinION-sequenced genome for sequencing the MPXV genome originating from a pustular lesion in the context of an epidemic in a remote area of the Central African Republic. The reads corresponding to the MPXV genome were identified using two taxonomic classifiers, Kraken2 and Kaiju. Assembly of these reads led to a complete sequence of 196,956 bases, which is 6322 bases longer than the sequence previously obtained with Illumina sequencing from the same sample. The comparison of the two sequences showed mainly indels at the homopolymeric regions. However, the combined use of Canu with specific polishing tools such as Medaka and Homopolish was the best combination that reduced their numbers without adding mismatches. Although MinION sequencing is known to introduce a number of characteristic errors compared to Illumina sequencing, the new polishing tools allow a better-quality MinION-sequenced genome, thus to be used to help determine strain origin through phylogenetic analysis.
Collapse
Affiliation(s)
- Mathias Vandenbogaert
- Unité Environnement et Risque Infectieux, Cellule d'Intervention Biologique d'Urgence, Institut Pasteur, Paris, France
| | - Aurélia Kwasiborski
- Unité Environnement et Risque Infectieux, Cellule d'Intervention Biologique d'Urgence, Institut Pasteur, Paris, France
| | - Ella Gonofio
- Institut Pasteur de Bangui, Bangui, Central African Republic
| | - Stéphane Descorps-Declère
- Centre of Bioinformatics, Biostatistics and Integrative Biology (C3BI), Institut Pasteur, Paris, France
| | | | | | | | - Antoine Gessain
- Unité d'Epidémiologie et Physiopathologie des Virus Oncogènes, Département de Virologie, UMR3569, Institut Pasteur, Centre National de la Recherche Scientifique (CNRS, Paris, France
| | - Jean-Claude Manuguerra
- Unité Environnement et Risque Infectieux, Cellule d'Intervention Biologique d'Urgence, Institut Pasteur, Paris, France
| | - Valérie Caro
- Unité Environnement et Risque Infectieux, Cellule d'Intervention Biologique d'Urgence, Institut Pasteur, Paris, France
| | | | - Nicolas Berthet
- Unité Environnement et Risque Infectieux, Cellule d'Intervention Biologique d'Urgence, Institut Pasteur, Paris, France.
- The Center for Microbes, Development and Health, CAS Key Laboratory of Molecular Virology and Immunology, Institut Pasteur of Shanghai-Chinese Academy of Sciences, Discovery and Molecular Characterization of Pathogens, No. 320 Yueyang Road, XuHui District, Shanghai, 200031, China.
| |
Collapse
|
10
|
Skoufos G, Almodaresi F, Zakeri M, Paulson JN, Patro R, Hatzigeorgiou AG, Vlachos IS. AGAMEMNON: an Accurate metaGenomics And MEtatranscriptoMics quaNtificatiON analysis suite. Genome Biol 2022; 23:39. [PMID: 35101114 PMCID: PMC8802518 DOI: 10.1186/s13059-022-02610-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2020] [Accepted: 01/03/2022] [Indexed: 12/03/2022] Open
Abstract
We introduce AGAMEMNON ( https://github.com/ivlachos/agamemnon ) for the acquisition of microbial abundances from shotgun metagenomics and metatranscriptomic samples, single-microbe sequencing experiments, or sequenced host samples. AGAMEMNON delivers accurate abundances at genus, species, and strain resolution. It incorporates a time and space-efficient indexing scheme for fast pattern matching, enabling indexing and analysis of vast datasets with widely available computational resources. Host-specific modules provide exceptional accuracy for microbial abundance quantification from tissue RNA/DNA sequencing, enabling the expansion of experiments lacking metagenomic/metatranscriptomic analyses. AGAMEMNON provides an R-Shiny application, permitting performance of investigations and visualizations from a graphics interface.
Collapse
Affiliation(s)
- Giorgos Skoufos
- Department of Electrical & Computer Engineering, University of Thessaly, 38221, Volos, Greece.
- Hellenic Pasteur Institute, 11521, Athens, Greece.
- DIANA-Lab, Department of Computer Science and Biomedical Informatics, Univ. of Thessaly, 351 31, Lamia, Greece.
| | - Fatemeh Almodaresi
- Department of Computer Science, University of Maryland, College Park, MD, USA
| | - Mohsen Zakeri
- Department of Computer Science, University of Maryland, College Park, MD, USA
| | - Joseph N Paulson
- Department of Data Sciences, Genentech Inc., South San Francisco, CA, USA
| | - Rob Patro
- Department of Computer Science, University of Maryland, College Park, MD, USA
| | - Artemis G Hatzigeorgiou
- Department of Electrical & Computer Engineering, University of Thessaly, 38221, Volos, Greece.
- Hellenic Pasteur Institute, 11521, Athens, Greece.
- DIANA-Lab, Department of Computer Science and Biomedical Informatics, Univ. of Thessaly, 351 31, Lamia, Greece.
| | - Ioannis S Vlachos
- Cancer Research Institute | HMS Initiative for RNA Medicine | Department of Pathology, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA, 02115, USA.
- Spatial Technologies Unit, Beth Israel Deaconess Medical Center, MA, Boston, USA.
- Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA.
| |
Collapse
|
11
|
Almodaresi F, Zakeri M, Patro R. Puffaligner : A Fast, Efficient, and Accurate Aligner Based on the Pufferfish Index. Bioinformatics 2021; 37:4048-4055. [PMID: 34117875 PMCID: PMC9502150 DOI: 10.1093/bioinformatics/btab408] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2020] [Revised: 04/30/2021] [Accepted: 06/11/2021] [Indexed: 12/22/2022] Open
Abstract
Motivation Sequence alignment is one of the first steps in many modern genomic analyses, such as variant detection, transcript abundance estimation and metagenomic profiling. Unfortunately, it is often a computationally expensive procedure. As the quantity of data and wealth of different assays and applications continue to grow, the need for accurate and fast alignment tools that scale to large collections of reference sequences persists. Results In this article, we introduce PuffAligner, a fast, accurate and versatile aligner built on top of the Pufferfish index. PuffAligner is able to produce highly sensitive alignments, similar to those of Bowtie2, but much more quickly. While exhibiting similar speed to the ultrafast STAR aligner, PuffAligner requires considerably less memory to construct its index and align reads. PuffAligner strikes a desirable balance with respect to the time, space and accuracy tradeoffs made by different alignment tools and provides a promising foundation on which to test new alignment ideas over large collections of sequences. Availability and implementation All the data used for preparing the results of this paper can be found with 10.5281/zenodo.4902332. PuffAligner is a free and open-source software. It is implemented in C++14 and can be obtained from https://github.com/COMBINE-lab/pufferfish/tree/cigar-strings. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Mohsen Zakeri
- Computer Science Department, University of Maryland, College Park, USA
| | - Rob Patro
- Computer Science Department, University of Maryland, College Park, USA
| |
Collapse
|
12
|
Zhao H, Wang S, Yuan X. Detection of Pathogenic Microbe Composition Using Next-Generation Sequencing Data. Front Genet 2020; 11:603093. [PMID: 33329748 PMCID: PMC7734255 DOI: 10.3389/fgene.2020.603093] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2020] [Accepted: 10/21/2020] [Indexed: 11/23/2022] Open
Abstract
Next-generation sequencing (NGS) technologies have provided great opportunities to analyze pathogenic microbes with high-resolution data. The main goal is to accurately detect microbial composition and abundances in a sample. However, high similarity among sequences from different species and the existence of sequencing errors pose various challenges. Numerous methods have been developed for quantifying microbial composition and abundance, but they are not versatile enough for the analysis of samples with mixtures of noise. In this paper, we propose a new computational method, PGMicroD, for the detection of pathogenic microbial composition in a sample using NGS data. The method first filters the potentially mistakenly mapped reads and extracts multiple species-related features from the sequencing reads of 16S rRNA. Then it trains an Support Vector Machine classifier to predict the microbial composition. Finally, it groups all multiple-mapped sequencing reads into the references of the predicted species to estimate the abundance for each kind of species. The performance of PGMicroD is evaluated based on both simulation and real sequencing data and is compared with several existing methods. The results demonstrate that our proposed method achieves superior performance. The software package of PGMicroD is available at https://github.com/BDanalysis/PGMicroD.
Collapse
Affiliation(s)
- Haiyong Zhao
- School of Computer Science and Technology, Liaocheng University, Liaocheng, China.,School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Shuang Wang
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Xiguo Yuan
- School of Computer Science and Technology, Xidian University, Xi'an, China
| |
Collapse
|
13
|
LaPierre N, Alser M, Eskin E, Koslicki D, Mangul S. Metalign: efficient alignment-based metagenomic profiling via containment min hash. Genome Biol 2020; 21:242. [PMID: 32912225 PMCID: PMC7488264 DOI: 10.1186/s13059-020-02159-0] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2019] [Accepted: 08/26/2020] [Indexed: 12/31/2022] Open
Abstract
Metagenomic profiling, predicting the presence and relative abundances of microbes in a sample, is a critical first step in microbiome analysis. Alignment-based approaches are often considered accurate yet computationally infeasible. Here, we present a novel method, Metalign, that performs efficient and accurate alignment-based metagenomic profiling. We use a novel containment min hash approach to pre-filter the reference database prior to alignment and then process both uniquely aligned and multi-aligned reads to produce accurate abundance estimates. In performance evaluations on both real and simulated datasets, Metalign is the only method evaluated that maintained high performance and competitive running time across all datasets.
Collapse
Affiliation(s)
- Nathan LaPierre
- Department of Computer Science, University of California, Los Angeles, CA, 90095, USA.
| | - Mohammed Alser
- Department of Computer Science, ETH Zurich, Rämistrasse 101, CH-8092, Zurich, Switzerland
| | - Eleazar Eskin
- Department of Computer Science, University of California, Los Angeles, CA, 90095, USA
- Department of Computational Medicine, University of California, Los Angeles, CA, 90095, USA
- Department of Human Genetics, University of California, Los Angeles, CA, 90095, USA
| | - David Koslicki
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA.
- Department of Biology, The Pennsylvania State University, University Park, PA, USA.
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park,, PA, USA.
| | - Serghei Mangul
- Department of Clinical Pharmacy, University of Southern California, Los Angeles, CA, 90089, USA.
| |
Collapse
|
14
|
Ye SH, Siddle KJ, Park DJ, Sabeti PC. Benchmarking Metagenomics Tools for Taxonomic Classification. Cell 2019; 178:779-794. [PMID: 31398336 PMCID: PMC6716367 DOI: 10.1016/j.cell.2019.07.010] [Citation(s) in RCA: 264] [Impact Index Per Article: 52.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2019] [Revised: 06/18/2019] [Accepted: 07/08/2019] [Indexed: 01/17/2023]
Abstract
Metagenomic sequencing is revolutionizing the detection and characterization of microbial species, and a wide variety of software tools are available to perform taxonomic classification of these data. The fast pace of development of these tools and the complexity of metagenomic data make it important that researchers are able to benchmark their performance. Here, we review current approaches for metagenomic analysis and evaluate the performance of 20 metagenomic classifiers using simulated and experimental datasets. We describe the key metrics used to assess performance, offer a framework for the comparison of additional classifiers, and discuss the future of metagenomic data analysis.
Collapse
Affiliation(s)
- Simon H Ye
- Harvard-MIT Health Sciences and Technology, Massachusetts Institute of Technology, Cambridge, MA 02139, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
| | - Katherine J Siddle
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Center for Systems Biology, Department of Organismal and Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA
| | - Daniel J Park
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Pardis C Sabeti
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Center for Systems Biology, Department of Organismal and Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA; Department of Immunology and Infectious Disease, Harvard School of Public Health, Boston, MA 02115, USA; Howard Hughes Medical Institute (HHMI), Chevy Chase, MD 20815, USA
| |
Collapse
|