1
|
Marini S, Barquero A, Wadhwani AA, Bian J, Ruiz J, Boucher C, Prosperi M. OCTOPUS: Disk-based, Multiplatform, Mobile-friendly Metagenomics Classifier. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.15.585215. [PMID: 38559026 PMCID: PMC10979967 DOI: 10.1101/2024.03.15.585215] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Portable genomic sequencers such as Oxford Nanopore's MinION enable real-time applications in clinical and environmental health. However, there is a bottleneck in the downstream analytics when bioinformatics pipelines are unavailable, e.g., when cloud processing is unreachable due to absence of Internet connection, or only low-end computing devices can be carried on site. Here we present a platform-friendly software for portable metagenomic analysis of Nanopore data, the Oligomer-based Classifier of Taxonomic Operational and Pan-genome Units via Singletons (OCTOPUS). OCTOPUS is written in Java, reimplements several features of the popular Kraken2 and KrakenUniq software, with original components for improving metagenomics classification on incomplete/sampled reference databases, making it ideal for running on smartphones or tablets. OCTOPUS obtains sensitivity and precision comparable to Kraken2, while dramatically decreasing (4- to 16-fold) the false positive rate, and yielding high correlation on real-word data. OCTOPUS is available along with customized databases at https://github.com/DataIntellSystLab/OCTOPUS and https://github.com/Ruiz-HCI-Lab/OctopusMobile.
Collapse
Affiliation(s)
- Simone Marini
- Department of Epidemiology, University of Florida, Gainesville, USA
- Emerging Pathogens Institute, University of Florida, Gainesville, USA
| | - Alexander Barquero
- Department of Computer and Information Science and Engineering, University of Florida, USA
| | - Anisha Ashok Wadhwani
- Department of Computer and Information Science and Engineering, University of Florida, USA
| | - Jiang Bian
- Department of Health Outcomes and Biomedical Informatics, University of Florida, USA
| | - Jaime Ruiz
- Department of Computer and Information Science and Engineering, University of Florida, USA
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, University of Florida, USA
| | - Mattia Prosperi
- Department of Epidemiology, University of Florida, Gainesville, USA
| |
Collapse
|
2
|
Jing K, Xu Y, Yang Y, Yin P, Ning D, Huang G, Deng Y, Chen G, Li G, Tian SZ, Zheng M. ScSmOP: a universal computational pipeline for single-cell single-molecule multiomics data analysis. Brief Bioinform 2023; 24:bbad343. [PMID: 37779245 DOI: 10.1093/bib/bbad343] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2023] [Revised: 06/24/2023] [Accepted: 09/10/2023] [Indexed: 10/03/2023] Open
Abstract
Single-cell multiomics techniques have been widely applied to detect the key signature of cells. These methods have achieved a single-molecule resolution and can even reveal spatial localization. These emerging methods provide insights elucidating the features of genomic, epigenomic and transcriptomic heterogeneity in individual cells. However, they have given rise to new computational challenges in data processing. Here, we describe Single-cell Single-molecule multiple Omics Pipeline (ScSmOP), a universal pipeline for barcode-indexed single-cell single-molecule multiomics data analysis. Essentially, the C language is utilized in ScSmOP to set up spaced-seed hash table-based algorithms for barcode identification according to ligation-based barcoding data and synthesis-based barcoding data, followed by data mapping and deconvolution. We demonstrate high reproducibility of data processing between ScSmOP and published pipelines in comprehensive analyses of single-cell omics data (scRNA-seq, scATAC-seq, scARC-seq), single-molecule chromatin interaction data (ChIA-Drop, SPRITE, RD-SPRITE), single-cell single-molecule chromatin interaction data (scSPRITE) and spatial transcriptomic data from various cell types and species. Additionally, ScSmOP shows more rapid performance and is a versatile, efficient, easy-to-use and robust pipeline for single-cell single-molecule multiomics data analysis.
Collapse
Affiliation(s)
- Kai Jing
- Shenzhen Key Laboratory of Gene Regulation and Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
- Department of Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
| | - Yewen Xu
- Shenzhen Key Laboratory of Gene Regulation and Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
- Department of Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
| | - Yang Yang
- Shenzhen Key Laboratory of Gene Regulation and Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
- Department of Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
| | - Pengfei Yin
- Department of Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
| | - Duo Ning
- Shenzhen Key Laboratory of Gene Regulation and Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
- Department of Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
| | - Guangyu Huang
- Department of Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
| | - Yuqing Deng
- Department of Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
| | - Gengzhan Chen
- Department of Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
| | - Guoliang Li
- National Key Laboratory of Crop Genetic Improvement, Hubei Hongshan Laboratory, Huazhong Agricultural University, Wuhan 430070, China
- Agricultural Bioinformatics Key Laboratory of Hubei Province, Hubei Engineering Technology Research Center of Agricultural Big Data, 3D Genomics Research Center, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Simon Zhongyuan Tian
- Shenzhen Key Laboratory of Gene Regulation and Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
- Department of Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
| | - Meizhen Zheng
- Shenzhen Key Laboratory of Gene Regulation and Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
- Department of Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
| |
Collapse
|
3
|
Alser M, Lindegger J, Firtina C, Almadhoun N, Mao H, Singh G, Gomez-Luna J, Mutlu O. From molecules to genomic variations: Accelerating genome analysis via intelligent algorithms and architectures. Comput Struct Biotechnol J 2022; 20:4579-4599. [PMID: 36090814 PMCID: PMC9436709 DOI: 10.1016/j.csbj.2022.08.019] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2022] [Revised: 08/08/2022] [Accepted: 08/08/2022] [Indexed: 02/01/2023] Open
Abstract
We now need more than ever to make genome analysis more intelligent. We need to read, analyze, and interpret our genomes not only quickly, but also accurately and efficiently enough to scale the analysis to population level. There currently exist major computational bottlenecks and inefficiencies throughout the entire genome analysis pipeline, because state-of-the-art genome sequencing technologies are still not able to read a genome in its entirety. We describe the ongoing journey in significantly improving the performance, accuracy, and efficiency of genome analysis using intelligent algorithms and hardware architectures. We explain state-of-the-art algorithmic methods and hardware-based acceleration approaches for each step of the genome analysis pipeline and provide experimental evaluations. Algorithmic approaches exploit the structure of the genome as well as the structure of the underlying hardware. Hardware-based acceleration approaches exploit specialized microarchitectures or various execution paradigms (e.g., processing inside or near memory) along with algorithmic changes, leading to new hardware/software co-designed systems. We conclude with a foreshadowing of future challenges, benefits, and research directions triggered by the development of both very low cost yet highly error prone new sequencing technologies and specialized hardware chips for genomics. We hope that these efforts and the challenges we discuss provide a foundation for future work in making genome analysis more intelligent.
Collapse
Affiliation(s)
| | | | - Can Firtina
- ETH Zurich, Gloriastrasse 35, 8092 Zürich, Switzerland
| | | | - Haiyu Mao
- ETH Zurich, Gloriastrasse 35, 8092 Zürich, Switzerland
| | | | | | - Onur Mutlu
- ETH Zurich, Gloriastrasse 35, 8092 Zürich, Switzerland
| |
Collapse
|
4
|
Mallik A, Ilie L. ALeS: adaptive-length spaced-seed design. Bioinformatics 2021; 37:1206-1210. [PMID: 34107042 DOI: 10.1093/bioinformatics/btaa945] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2020] [Revised: 09/26/2020] [Accepted: 10/27/2020] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Sequence similarity is the most frequently used procedure in biological research, as proved by the widely used BLAST program. The consecutive seed used by BLAST can be dramatically improved by considering multiple spaced seeds. Finding the best seeds is a hard problem and much effort went into developing heuristic algorithms and software for designing highly sensitive spaced seeds. RESULTS We introduce a new algorithm and software, ALeS, that produces more sensitive seeds than the current state-of-the-art programs, as shown by extensive testing. We also accurately estimate the sensitivity of a seed, enabling its computation for arbitrary seeds. AVAILABILITYAND IMPLEMENTATION The source code is freely available at github.com/lucian-ilie/ALeS. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Arnab Mallik
- Department of Computer Science, University of Western Ontario, London, ON N6A 5B7, Canada
| | - Lucian Ilie
- Department of Computer Science, University of Western Ontario, London, ON N6A 5B7, Canada
| |
Collapse
|
5
|
New evaluation methods of read mapping by 17 aligners on simulated and empirical NGS data: an updated comparison of DNA- and RNA-Seq data from Illumina and Ion Torrent technologies. Neural Comput Appl 2021; 33:15669-15692. [PMID: 34155424 PMCID: PMC8208613 DOI: 10.1007/s00521-021-06188-z] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2020] [Accepted: 06/02/2021] [Indexed: 12/13/2022]
Abstract
During the last (15) years, improved omics sequencing technologies have expanded the scale and resolution of various biological applications, generating high-throughput datasets that require carefully chosen software tools to be processed. Therefore, following the sequencing development, bioinformatics researchers have been challenged to implement alignment algorithms for next-generation sequencing reads. However, nowadays selection of aligners based on genome characteristics is poorly studied, so our benchmarking study extended the “state of art” comparing 17 different aligners. The chosen tools were assessed on empirical human DNA- and RNA-Seq data, as well as on simulated datasets in human and mouse, evaluating a set of parameters previously not considered in such kind of benchmarks. As expected, we found that each tool was the best in specific conditions. For Ion Torrent single-end RNA-Seq samples, the most suitable aligners were CLC and BWA-MEM, which reached the best results in terms of efficiency, accuracy, duplication rate, saturation profile and running time. About Illumina paired-end osteomyelitis transcriptomics data, instead, the best performer algorithm, together with the already cited CLC, resulted Novoalign, which excelled in accuracy and saturation analyses. Segemehl and DNASTAR performed the best on both DNA-Seq data, with Segemehl particularly suitable for exome data. In conclusion, our study could guide users in the selection of a suitable aligner based on genome and transcriptome characteristics. However, several other aspects, emerged from our work, should be considered in the evolution of alignment research area, such as the involvement of artificial intelligence to support cloud computing and mapping to multiple genomes.
Collapse
|
6
|
Dencker T, Leimeister CA, Gerth M, Bleidorn C, Snir S, Morgenstern B. 'Multi-SpaM': a maximum-likelihood approach to phylogeny reconstruction using multiple spaced-word matches and quartet trees. NAR Genom Bioinform 2020; 2:lqz013. [PMID: 33575565 PMCID: PMC7671388 DOI: 10.1093/nargab/lqz013] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2019] [Revised: 07/31/2019] [Accepted: 10/13/2019] [Indexed: 02/03/2023] Open
Abstract
Word-based or 'alignment-free' methods for phylogeny inference have become popular in recent years. These methods are much faster than traditional, alignment-based approaches, but they are generally less accurate. Most alignment-free methods calculate 'pairwise' distances between nucleic-acid or protein sequences; these distance values can then be used as input for tree-reconstruction programs such as neighbor-joining. In this paper, we propose the first word-based phylogeny approach that is based on 'multiple' sequence comparison and 'maximum likelihood'. Our algorithm first samples small, gap-free alignments involving four taxa each. For each of these alignments, it then calculates a quartet tree and, finally, the program 'Quartet MaxCut' is used to infer a super tree for the full set of input taxa from the calculated quartet trees. Experimental results show that trees produced with our approach are of high quality.
Collapse
Affiliation(s)
- Thomas Dencker
- Department of Bioinformatics, Institute of Microbiology and Genetics, Universität Göttingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
| | - Chris-André Leimeister
- Department of Bioinformatics, Institute of Microbiology and Genetics, Universität Göttingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
| | - Michael Gerth
- Institute for Integrative Biology, University of Liverpool, Biosciences Building, Crown Street, L69 7ZB Liverpool, UK
| | - Christoph Bleidorn
- Department of Animal Evolution and Biodiversity, Universität Göttingen, Untere Karspüle 2, 37073 Göttingen, Germany
- Museo Nacional de Ciencias Naturales, Spanish National Research Council (CSIC), 28006 Madrid, Spain
| | - Sagi Snir
- Institute of Evolution, Department of Evolutionary and Environmental Biology, University of Haifa, 199 Aba Khoushy Ave. Mount Carmel, Haifa, Israel
| | - Burkhard Morgenstern
- Department of Bioinformatics, Institute of Microbiology and Genetics, Universität Göttingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
- Göttingen Center of Molecular Biosciences (GZMB), Justus-von-Liebig-Weg 11, 37077 Göttingen, Germany
| |
Collapse
|
7
|
Petrucci E, Noé L, Pizzi C, Comin M. Iterative Spaced Seed Hashing: Closing the Gap Between Spaced Seed Hashing and k-mer Hashing. J Comput Biol 2020; 27:223-233. [PMID: 31800307 DOI: 10.1089/cmb.2019.0298] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Alignment-free classification of sequences has enabled high-throughput processing of sequencing data in many bioinformatics pipelines. Much work has been done to speed up the indexing of k-mers through hash-table and other data structures. These efforts have led to very fast indexes, but because they are k-mer based, they often lack sensitivity due to sequencing errors or polymorphisms. Spaced seeds are a special type of pattern that accounts for errors or mutations. They allow to improve the sensitivity and they are now routinely used instead of k-mers in many applications. The major drawback of spaced seeds is that they cannot be efficiently hashed and thus their usage increases substantially the computational time. In this article we address the problem of efficient spaced seed hashing. We propose an iterative algorithm that combines multiple spaced seed hashes by exploiting the similarity of adjacent hash values to efficiently compute the next hash. We report a series of experiments on HTS reads hashing, with several spaced seeds. Our algorithm can compute the hashing values of spaced seeds with a speedup in range of [3.5 × -7 × ], outperforming previous methods. Software and data sets are available at Iterative Spaced Seed Hashing.
Collapse
Affiliation(s)
- Enrico Petrucci
- Department of Information Engineering, University of Padova, Padova, Italy
| | - Laurent Noé
- CRIStAL UMR9189, Universit de Lille, Lille, France
| | - Cinzia Pizzi
- Department of Information Engineering, University of Padova, Padova, Italy
| | - Matteo Comin
- Department of Information Engineering, University of Padova, Padova, Italy
| |
Collapse
|
8
|
Shibuya Y, Comin M. Indexing k-mers in linear space for quality value compression. J Bioinform Comput Biol 2019; 17:1940011. [DOI: 10.1142/s0219720019400110] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Many bioinformatics tools heavily rely on [Formula: see text]-mer dictionaries to describe the composition of sequences and allow for faster reference-free algorithms or look-ups. Unfortunately, naive [Formula: see text]-mer dictionaries are very memory-inefficient, requiring very large amount of storage space to save each [Formula: see text]-mer. This problem is generally worsened by the necessity of an index for fast queries. In this work, we discuss how to build an indexed linear reference containing a set of input [Formula: see text]-mers and its application to the compression of quality scores in FASTQ files. Most of the entropies of sequencing data lie in the quality scores, and thus they are difficult to compress. Here, we present an application to improve the compressibility of quality values while preserving the information for SNP calling. We show how a dictionary of significant [Formula: see text]-mers, obtained from SNP databases and multiple genomes, can be indexed in linear space and used to improve the compression of quality value. Availability: The software is freely available at https://github.com/yhhshb/yalff .
Collapse
Affiliation(s)
- Yoshihiro Shibuya
- Department of Information Engineering, University of Padua, via Gradenigo 6B, Padua, Italy
- Laboratoire d’Informatique Gaspard-Monge (LIGM), University Paris-Est Marne-la-Vallée, Bâtiment Copernic - 5, bd Descartes, Champs sur Marne, France
| | - Matteo Comin
- Department of Information Engineering, University of Padua, via Gradenigo 6B, Padua, Italy
| |
Collapse
|