1
|
Sami A, El-Metwally S, Rashad MZ. MAC-ErrorReads: machine learning-assisted classifier for filtering erroneous NGS reads. BMC Bioinformatics 2024; 25:61. [PMID: 38321434 PMCID: PMC10848413 DOI: 10.1186/s12859-024-05681-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Accepted: 01/29/2024] [Indexed: 02/08/2024] Open
Abstract
BACKGROUND The rapid advancement of next-generation sequencing (NGS) machines in terms of speed and affordability has led to the generation of a massive amount of biological data at the expense of data quality as errors become more prevalent. This introduces the need to utilize different approaches to detect and filtrate errors, and data quality assurance is moved from the hardware space to the software preprocessing stages. RESULTS We introduce MAC-ErrorReads, a novel Machine learning-Assisted Classifier designed for filtering Erroneous NGS Reads. MAC-ErrorReads transforms the erroneous NGS read filtration process into a robust binary classification task, employing five supervised machine learning algorithms. These models are trained on features extracted through the computation of Term Frequency-Inverse Document Frequency (TF_IDF) values from various datasets such as E. coli, GAGE S. aureus, H. Chr14, Arabidopsis thaliana Chr1 and Metriaclima zebra. Notably, Naive Bayes demonstrated robust performance across various datasets, displaying high accuracy, precision, recall, F1-score, MCC, and ROC values. The MAC-ErrorReads NB model accurately classified S. aureus reads, surpassing most error correction tools with a 38.69% alignment rate. For H. Chr14, tools like Lighter, Karect, CARE, Pollux, and MAC-ErrorReads showed rates above 99%. BFC and RECKONER exceeded 98%, while Fiona had 95.78%. For the Arabidopsis thaliana Chr1, Pollux, Karect, RECKONER, and MAC-ErrorReads demonstrated good alignment rates of 92.62%, 91.80%, 91.78%, and 90.87%, respectively. For the Metriaclima zebra, Pollux achieved a high alignment rate of 91.23%, despite having the lowest number of mapped reads. MAC-ErrorReads, Karect, and RECKONER demonstrated good alignment rates of 83.76%, 83.71%, and 83.67%, respectively, while also producing reasonable numbers of mapped reads to the reference genome. CONCLUSIONS This study demonstrates that machine learning approaches for filtering NGS reads effectively identify and retain the most accurate reads, significantly enhancing assembly quality and genomic coverage. The integration of genomics and artificial intelligence through machine learning algorithms holds promise for enhancing NGS data quality, advancing downstream data analysis accuracy, and opening new opportunities in genetics, genomics, and personalized medicine research.
Collapse
Affiliation(s)
- Amira Sami
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt
| | - Sara El-Metwally
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt.
- Biomedical Informatics Department, Faculty of Computer Science and Engineering, New Mansoura University, Gamasa, 35712, Egypt.
| | - M Z Rashad
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt
| |
Collapse
|
2
|
Długosz M, Deorowicz S. Illumina reads correction: evaluation and improvements. Sci Rep 2024; 14:2232. [PMID: 38278837 PMCID: PMC11222498 DOI: 10.1038/s41598-024-52386-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2023] [Accepted: 01/18/2024] [Indexed: 01/28/2024] Open
Abstract
The paper focuses on the correction of Illumina WGS sequencing reads. We provide an extensive evaluation of the existing correctors. To this end, we measure an impact of the correction on variant calling (VC) as well as de novo assembly. It shows, that in selected cases read correction improves the VC results quality. We also examine the algorithms behaviour in a processing of Illumina NovaSeq reads, with different reads quality characteristics than in older sequencers. We show that most of the algorithms are ready to cope with such reads. Finally, we introduce a new version of RECKONER, our read corrector, by optimizing it and equipping with a new correction strategy. Currently, RECKONER allows to correct high-coverage human reads in less than 2.5 h, is able to cope with two types of reads errors: indels and substitutions, and utilizes a new, based on a two lengths of oligomers, correction verification technique.
Collapse
Affiliation(s)
- Maciej Długosz
- Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, 44-100, Gliwice, Poland
| | - Sebastian Deorowicz
- Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, 44-100, Gliwice, Poland.
| |
Collapse
|
3
|
Heo Y, Manikandan G, Ramachandran A, Chen D. Comprehensive Evaluation of Error-Correction Methodologies for Genome Sequencing Data. Bioinformatics 2021. [DOI: 10.36255/exonpublications.bioinformatics.2021.ch6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
|
4
|
Schmeing S, Robinson MD. ReSeq simulates realistic Illumina high-throughput sequencing data. Genome Biol 2021; 22:67. [PMID: 33608040 PMCID: PMC7896392 DOI: 10.1186/s13059-021-02265-7] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2020] [Accepted: 01/07/2021] [Indexed: 12/18/2022] Open
Abstract
In high-throughput sequencing data, performance comparisons between computational tools are essential for making informed decisions at each step of a project. Simulations are a critical part of method comparisons, but for standard Illumina sequencing of genomic DNA, they are often oversimplified, which leads to optimistic results for most tools. ReSeq improves the authenticity of synthetic data by extracting and reproducing key components from real data. Major advancements are the inclusion of systematic errors, a fragment-based coverage model and sampling-matrix estimates based on two-dimensional margins. These improvements lead to more faithful performance evaluations. ReSeq is available at https://github.com/schmeing/ReSeq.
Collapse
Affiliation(s)
- Stephan Schmeing
- Institute of Molecular Life Sciences, University of Zurich, Winterthurerstrasse 190, Zurich, 8057, Switzerland. .,SIB Swiss Institute of Bioinformatics, Winterthurerstrasse 190, Zurich, 8057, Switzerland.
| | - Mark D Robinson
- Institute of Molecular Life Sciences, University of Zurich, Winterthurerstrasse 190, Zurich, 8057, Switzerland. .,SIB Swiss Institute of Bioinformatics, Winterthurerstrasse 190, Zurich, 8057, Switzerland.
| |
Collapse
|
5
|
Mitchell K, Brito JJ, Mandric I, Wu Q, Knyazev S, Chang S, Martin LS, Karlsberg A, Gerasimov E, Littman R, Hill BL, Wu NC, Yang HT, Hsieh K, Chen L, Littman E, Shabani T, Enik G, Yao D, Sun R, Schroeder J, Eskin E, Zelikovsky A, Skums P, Pop M, Mangul S. Benchmarking of computational error-correction methods for next-generation sequencing data. Genome Biol 2020; 21:71. [PMID: 32183840 PMCID: PMC7079412 DOI: 10.1186/s13059-020-01988-3] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2019] [Accepted: 03/06/2020] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown. RESULTS In this paper, we evaluate the ability of error correction algorithms to fix errors across different types of datasets that contain various levels of heterogeneity. We highlight the advantages and limitations of computational error correction techniques across different domains of biology, including immunogenomics and virology. To demonstrate the efficacy of our technique, we apply the UMI-based high-fidelity sequencing protocol to eliminate sequencing errors from both simulated data and the raw reads. We then perform a realistic evaluation of error-correction methods. CONCLUSIONS In terms of accuracy, we find that method performance varies substantially across different types of datasets with no single method performing best on all types of examined data. Finally, we also identify the techniques that offer a good balance between precision and sensitivity.
Collapse
Affiliation(s)
- Keith Mitchell
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Jaqueline J Brito
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, 1985 Zonal Avenue, Los Angeles, CA, 90089, USA
| | - Igor Mandric
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
- Department of Computer Science, Georgia State University, 1 Park Place, Atlanta, GA, 30303, USA
| | - Qiaozhen Wu
- Department of Mathematics, University of California Los Angeles, 520 Portola Plaza, Los Angeles, CA, 90095, USA
| | - Sergey Knyazev
- Department of Computer Science, Georgia State University, 1 Park Place, Atlanta, GA, 30303, USA
| | - Sei Chang
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Lana S Martin
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, 1985 Zonal Avenue, Los Angeles, CA, 90089, USA
| | - Aaron Karlsberg
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, 1985 Zonal Avenue, Los Angeles, CA, 90089, USA
| | - Ekaterina Gerasimov
- Department of Computer Science, Georgia State University, 1 Park Place, Atlanta, GA, 30303, USA
| | - Russell Littman
- UCLA Bioinformatics, 621 Charles E Young Dr S, Los Angeles, CA, 90024, USA
| | - Brian L Hill
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Nicholas C Wu
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA
| | - Harry Taegyun Yang
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Kevin Hsieh
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Linus Chen
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Eli Littman
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Taylor Shabani
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - German Enik
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Douglas Yao
- Department of Molecular, Cell, and Developmental Biology, University of California Los Angeles, 650 Charles E. Young Drive South, Los Angeles, CA, 90095, USA
| | - Ren Sun
- Department of Molecular and Medical Pharmacology, University of California Los Angeles, 650 Charles E. Young Drive South, Los Angeles, CA, 90095, USA
| | - Jan Schroeder
- Epigenetics & Reprogramming Laboratory, Monash University, 15 Innovation Walk, Melbourne, VIC, 3800, Australia
| | - Eleazar Eskin
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, 1 Park Place, Atlanta, GA, 30303, USA
- The Laboratory of Bioinformatics, I.M, Sechenov First Moscow State Medical University, Moscow, Russia, 119991
| | - Pavel Skums
- Department of Computer Science, Georgia State University, 1 Park Place, Atlanta, GA, 30303, USA
| | - Mihai Pop
- Department of Computer Science and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, 20742, USA
| | - Serghei Mangul
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, 1985 Zonal Avenue, Los Angeles, CA, 90089, USA.
| |
Collapse
|
6
|
Athena: Automated Tuning of k-mer based Genomic Error Correction Algorithms using Language Models. Sci Rep 2019; 9:16157. [PMID: 31695060 PMCID: PMC6834855 DOI: 10.1038/s41598-019-52196-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2019] [Accepted: 10/07/2019] [Indexed: 01/30/2023] Open
Abstract
The performance of most error-correction (EC) algorithms that operate on genomics reads is dependent on the proper choice of its configuration parameters, such as the value of k in k-mer based techniques. In this work, we target the problem of finding the best values of these configuration parameters to optimize error correction and consequently improve genome assembly. We perform this in an adaptive manner, adapted to different datasets and to EC tools, due to the observation that different configuration parameters are optimal for different datasets, i.e., from different platforms and species, and vary with the EC algorithm being applied. We use language modeling techniques from the Natural Language Processing (NLP) domain in our algorithmic suite, Athena, to automatically tune the performance-sensitive configuration parameters. Through the use of N-Gram and Recurrent Neural Network (RNN) language modeling, we validate the intuition that the EC performance can be computed quantitatively and efficiently using the “perplexity” metric, repurposed from NLP. After training the language model, we show that the perplexity metric calculated from a sample of the test (or production) data has a strong negative correlation with the quality of error correction of erroneous NGS reads. Therefore, we use the perplexity metric to guide a hill climbing-based search, converging toward the best configuration parameter value. Our approach is suitable for both de novo and comparative sequencing (resequencing), eliminating the need for a reference genome to serve as the ground truth. We find that Athena can automatically find the optimal value of k with a very high accuracy for 7 real datasets and using 3 different k-mer based EC algorithms, Lighter, Blue, and Racer. The inverse relation between the perplexity metric and alignment rate exists under all our tested conditions—for real and synthetic datasets, for all kinds of sequencing errors (insertion, deletion, and substitution), and for high and low error rates. The absolute value of that correlation is at least 73%. In our experiments, the best value of k found by Athena achieves an alignment rate within 0.53% of the oracle best value of k found through brute force searching (i.e., scanning through the entire range of k values). Athena’s selected value of k lies within the top-3 best k values using N-Gram models and the top-5 best k values using RNN models With best parameter selection by Athena, the assembly quality (NG50) is improved by a Geometric Mean of 4.72X across the 7 real datasets.
Collapse
|
7
|
Fischer-Hwang I, Ochoa I, Weissman T, Hernaez M. Denoising of Aligned Genomic Data. Sci Rep 2019; 9:15067. [PMID: 31636330 PMCID: PMC6803637 DOI: 10.1038/s41598-019-51418-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2019] [Accepted: 09/25/2019] [Indexed: 12/30/2022] Open
Abstract
Noise in genomic sequencing data is known to have effects on various stages of genomic data analysis pipelines. Variant identification is an important step of many of these pipelines, and is increasingly being used in clinical settings to aid medical practices. We propose a denoising method, dubbed SAMDUDE, which operates on aligned genomic data in order to improve variant calling performance. Denoising human data with SAMDUDE resulted in improved variant identification in both individual chromosome as well as whole genome sequencing (WGS) data sets. In the WGS data set, denoising led to identification of almost 2,000 additional true variants, and elimination of over 1,500 erroneously identified variants. In contrast, we found that denoising with other state-of-the-art denoisers significantly worsens variant calling performance. SAMDUDE is written in Python and is freely available at https://github.com/ihwang/SAMDUDE .
Collapse
Affiliation(s)
- Irena Fischer-Hwang
- Stanford University, Department of Electrical Engineering, Stanford, 94305, USA.
| | - Idoia Ochoa
- University of Illinois Urbana-Champaign, Department of Electrical and Computer Engineering, Urbana, 61801, USA
| | - Tsachy Weissman
- Stanford University, Department of Electrical Engineering, Stanford, 94305, USA
| | - Mikel Hernaez
- University of Illinois Urbana-Champaign, Carl R. Woese Institute for Genomic Biology, Urbana, 61801, USA.
| |
Collapse
|
8
|
Heydari M, Miclotte G, Van de Peer Y, Fostier J. Illumina error correction near highly repetitive DNA regions improves de novo genome assembly. BMC Bioinformatics 2019; 20:298. [PMID: 31159722 PMCID: PMC6545690 DOI: 10.1186/s12859-019-2906-2] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2019] [Accepted: 05/17/2019] [Indexed: 11/10/2022] Open
Abstract
Background Several standalone error correction tools have been proposed to correct sequencing errors in Illumina data in order to facilitate de novo genome assembly. However, in a recent survey, we showed that state-of-the-art assemblers often did not benefit from this pre-correction step. We found that many error correction tools introduce new errors in reads that overlap highly repetitive DNA regions such as low-complexity patterns or short homopolymers, ultimately leading to a more fragmented assembly. Results We propose BrownieCorrector, an error correction tool for Illumina sequencing data that focuses on the correction of only those reads that overlap short DNA patterns that are highly repetitive in the genome. BrownieCorrector extracts all reads that contain such a pattern and clusters them into different groups using a community detection algorithm that takes into account both the sequence similarity between overlapping reads and their respective paired-end reads. Each cluster holds reads that originate from the same genomic region and hence each cluster can be corrected individually, thus providing a consistent correction for all reads within that cluster. Conclusions BrownieCorrector is benchmarked using six real Illumina datasets for different eukaryotic genomes. The prior use of BrownieCorrector improves assembly results over the use of uncorrected reads in all cases. In comparison with other error correction tools, BrownieCorrector leads to the best assembly results in most cases even though less than 2% of the reads within a dataset are corrected. Additionally, we investigate the impact of error correction on hybrid assembly where the corrected Illumina reads are supplemented with PacBio data. Our results confirm that BrownieCorrector improves the quality of hybrid genome assembly as well. BrownieCorrector is written in standard C++11 and released under GPL license. BrownieCorrector relies on multithreading to take advantage of multi-core/multi-CPU systems. The source code is available at https://github.com/biointec/browniecorrector. Electronic supplementary material The online version of this article (10.1186/s12859-019-2906-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Mahdi Heydari
- Department of Information Technology, Ghent University-imec, IDLab, Ghent, B-9052, Belgium.,Bioinformatics Institute Ghent, Ghent, B-9052, Belgium
| | - Giles Miclotte
- Department of Information Technology, Ghent University-imec, IDLab, Ghent, B-9052, Belgium.,Bioinformatics Institute Ghent, Ghent, B-9052, Belgium
| | - Yves Van de Peer
- Bioinformatics Institute Ghent, Ghent, B-9052, Belgium.,Center for Plant Systems Biology, VIB, Ghent, B-9052, Belgium.,Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, B-9052, Belgium.,Department of Genetics, Genome Research Institute, University of Pretoria, Pretoria, South Africa
| | - Jan Fostier
- Department of Information Technology, Ghent University-imec, IDLab, Ghent, B-9052, Belgium. .,Bioinformatics Institute Ghent, Ghent, B-9052, Belgium.
| |
Collapse
|
9
|
Zhao L, Xie J, Bai L, Chen W, Wang M, Zhang Z, Wang Y, Zhao Z, Li J. Mining statistically-solid k-mers for accurate NGS error correction. BMC Genomics 2018; 19:912. [PMID: 30598110 PMCID: PMC6311904 DOI: 10.1186/s12864-018-5272-y] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND NGS data contains many machine-induced errors. The most advanced methods for the error correction heavily depend on the selection of solid k-mers. A solid k-mer is a k-mer frequently occurring in NGS reads. The other k-mers are called weak k-mers. A solid k-mer does not likely contain errors, while a weak k-mer most likely contains errors. An intensively investigated problem is to find a good frequency cutoff f0 to balance the numbers of solid and weak k-mers. Once the cutoff is determined, a more challenging but less-studied problem is to: (i) remove a small subset of solid k-mers that are likely to contain errors, and (ii) add a small subset of weak k-mers, that are likely to contain no errors, into the remaining set of solid k-mers. Identification of these two subsets of k-mers can improve the correction performance. RESULTS We propose to use a Gamma distribution to model the frequencies of erroneous k-mers and a mixture of Gaussian distributions to model correct k-mers, and combine them to determine f0. To identify the two special subsets of k-mers, we use the z-score of k-mers which measures the number of standard deviations a k-mer's frequency is from the mean. Then these statistically-solid k-mers are used to construct a Bloom filter for error correction. Our method is markedly superior to the state-of-art methods, tested on both real and synthetic NGS data sets. CONCLUSION The z-score is adequate to distinguish solid k-mers from weak k-mers, particularly useful for pinpointing out solid k-mers having very low frequency. Applying z-score on k-mer can markedly improve the error correction accuracy.
Collapse
Affiliation(s)
- Liang Zhao
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
- School of Computing and Electronic Information, Guangxi University, Nanning, China
| | - Jin Xie
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
| | - Lin Bai
- School of Computing and Electronic Information, Guangxi University, Nanning, China
| | - Wen Chen
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
| | - Mingju Wang
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
| | - Zhonglei Zhang
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
| | - Yiqi Wang
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
| | - Zhe Zhao
- School of Computing and Electronic Information, Guangxi University, Nanning, China
| | - Jinyan Li
- Advanced Analytics Institute, Faculty of Engineering & IT, University of Technology Sydney, NSW 2007, Australia
| |
Collapse
|
10
|
Manekar SC, Sathe SR. A benchmark study of k-mer counting methods for high-throughput sequencing. Gigascience 2018; 7:5140149. [PMID: 30346548 PMCID: PMC6280066 DOI: 10.1093/gigascience/giy125] [Citation(s) in RCA: 26] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2017] [Accepted: 10/16/2018] [Indexed: 11/25/2022] Open
Abstract
The rapid development of high-throughput sequencing technologies means that hundreds of gigabytes of sequencing data can be produced in a single study. Many bioinformatics tools require counts of substrings of length k in DNA/RNA sequencing reads obtained for applications such as genome and transcriptome assembly, error correction, multiple sequence alignment, and repeat detection. Recently, several techniques have been developed to count k-mers in large sequencing datasets, with a trade-off between the time and memory required to perform this function. We assessed several k-mer counting programs and evaluated their relative performance, primarily on the basis of runtime and memory usage. We also considered additional parameters such as disk usage, accuracy, parallelism, the impact of compressed input, performance in terms of counting large k values and the scalability of the application to larger datasets.We make specific recommendations for the setup of a current state-of-the-art program and suggestions for further development.
Collapse
Affiliation(s)
- Swati C Manekar
- Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology, Nagpur 440 010, India
| | - Shailesh R Sathe
- Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology, Nagpur 440 010, India
| |
Collapse
|
11
|
Savel D, LaFramboise T, Grama A, Koyuturk M. Pluribus-Exploring the Limits of Error Correction Using a Suffix Tree. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:1378-1388. [PMID: 27362987 PMCID: PMC5754272 DOI: 10.1109/tcbb.2016.2586060] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Next generation sequencing technologies enable efficient and cost-effective genome sequencing. However, sequencing errors increase the complexity of the de novo assembly process, and reduce the quality of the assembled sequences. Many error correction techniques utilizing substring frequencies have been developed to mitigate this effect. In this paper, we present a novel and effective method called Pluribus, for correcting sequencing errors using a generalized suffix trie. Pluribus utilizes multiple manifestations of an error in the trie to accurately identify errors and suggest corrections. We show that Pluribus produces the least number of false positives across a diverse set of real sequencing datasets when compared to other methods. Furthermore, Pluribus can be used in conjunction with other contemporary error correction methods to achieve higher levels of accuracy than either tool alone. These increases in error correction accuracy are also realized in the quality of the contigs that are generated during assembly. We explore, in-depth, the behavior of Pluribus , to explain the observed improvement in accuracy and assembly performance. Pluribus is freely available at http://compbio. CASE edu/pluribus/.
Collapse
|
12
|
Large-scale comparative metagenomics of Blastocystis, a common member of the human gut microbiome. ISME JOURNAL 2017; 11:2848-2863. [PMID: 28837129 PMCID: PMC5702742 DOI: 10.1038/ismej.2017.139] [Citation(s) in RCA: 117] [Impact Index Per Article: 16.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/08/2017] [Revised: 07/05/2017] [Accepted: 07/14/2017] [Indexed: 12/12/2022]
Abstract
The influence of unicellular eukaryotic microorganisms on human gut health and disease is still largely unexplored. Blastocystis spp. commonly colonize the gut, but its clinical significance and ecological role are currently unsettled. We have developed a high-sensitivity bioinformatic pipeline to detect Blastocystis subtypes (STs) from shotgun metagenomics, and applied it to 12 large data sets, comprising 1689 subjects of different geographic origin, disease status and lifestyle. We confirmed and extended previous observations on the high prevalence the microrganism in the population (14.9%), its non-random and ST-specific distribution, and its ability to cause persistent (asymptomatic) colonization. These findings, along with the higher prevalence observed in non-westernized individuals, the lack of positive association with any of the disease considered, and decreased presence in individuals with dysbiosis associated with colorectal cancer and Crohn's disease, strongly suggest that Blastocystis is a component of the healthy gut microbiome. Further, we found an inverse association between body mass index and Blastocystis, and strong co-occurrence with archaeal organisms (Methanobrevibacter smithii) and several bacterial species. The association of specific microbial community structures with Blastocystis was confirmed by the high predictability (up to 0.91 area under the curve) of the microorganism colonization based on the species-level composition of the microbiome. Finally, we reconstructed and functionally profiled 43 new draft Blastocystis genomes and discovered a higher intra subtype variability of ST1 and ST2 compared with ST3 and ST4. Altogether, we provide an in-depth epidemiologic, ecological, and genomic analysis of Blastocystis, and show how metagenomics can be crucial to advance population genomics of human parasites.
Collapse
|
13
|
Evaluation of the impact of Illumina error correction tools on de novo genome assembly. BMC Bioinformatics 2017; 18:374. [PMID: 28821237 PMCID: PMC5563063 DOI: 10.1186/s12859-017-1784-8] [Citation(s) in RCA: 38] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2017] [Accepted: 08/11/2017] [Indexed: 01/20/2023] Open
Abstract
BACKGROUND Recently, many standalone applications have been proposed to correct sequencing errors in Illumina data. The key idea is that downstream analysis tools such as de novo genome assemblers benefit from a reduced error rate in the input data. Surprisingly, a systematic validation of this assumption using state-of-the-art assembly methods is lacking, even for recently published methods. RESULTS For twelve recent Illumina error correction tools (EC tools) we evaluated both their ability to correct sequencing errors and their ability to improve de novo genome assembly in terms of contig size and accuracy. CONCLUSIONS We confirm that most EC tools reduce the number of errors in sequencing data without introducing many new errors. However, we found that many EC tools suffer from poor performance in certain sequence contexts such as regions with low coverage or regions that contain short repeated or low-complexity sequences. Reads overlapping such regions are often ill-corrected in an inconsistent manner, leading to breakpoints in the resulting assemblies that are not present in assemblies obtained from uncorrected data. Resolving this systematic flaw in future EC tools could greatly improve the applicability of such tools.
Collapse
|
14
|
Zhao L, Chen Q, Li W, Jiang P, Wong L, Li J. MapReduce for accurate error correction of next-generation sequencing data. Bioinformatics 2017; 33:3844-3851. [PMID: 28205674 DOI: 10.1093/bioinformatics/btx089] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2016] [Accepted: 02/14/2017] [Indexed: 11/14/2022] Open
Affiliation(s)
- Liang Zhao
- School of Computing and Electronic Information, Guangxi University, Nanning, China
- Taihe Hospital, Hubei University of Medicine, Hubei, China
| | - Qingfeng Chen
- School of Computing and Electronic Information, Guangxi University, Nanning, China
| | - Wencui Li
- Taihe Hospital, Hubei University of Medicine, Hubei, China
| | - Peng Jiang
- School of Computing and Electronic Information, Guangxi University, Nanning, China
| | - Limsoon Wong
- School of Computing, National University of Singapore, Singapore, Singapore
| | - Jinyan Li
- Advanced Analytics Institute and Centre for Health Technologies, University of Technology Sydney, Broadway, NSW, Australia
| |
Collapse
|
15
|
Hershberg U, Luning Prak ET. The analysis of clonal expansions in normal and autoimmune B cell repertoires. Philos Trans R Soc Lond B Biol Sci 2016; 370:rstb.2014.0239. [PMID: 26194753 PMCID: PMC4528416 DOI: 10.1098/rstb.2014.0239] [Citation(s) in RCA: 84] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
Clones are the fundamental building blocks of immune repertoires. The number of different clones relates to the diversity of the repertoire, whereas their size and sequence diversity are linked to selective pressures. Selective pressures act both between clones and within different sequence variants of a clone. Understanding how clonal selection shapes the immune repertoire is one of the most basic questions in all of immunology. But how are individual clones defined? Here we discuss different approaches for defining clones, starting with how antibodies are diversified during different stages of B cell development. Next, we discuss how clones are defined using different experimental methods. We focus on high-throughput sequencing datasets, and the computational challenges and opportunities that these data have for mining the antibody repertoire landscape. We discuss methods that visualize sequence variants within the same clone and allow us to consider collections of shared mutations to determine which sequences share a common ancestry. Finally, we comment on features of frequently encountered expanded B cell clones that may be of particular interest in the setting of autoimmunity and other chronic conditions.
Collapse
Affiliation(s)
- Uri Hershberg
- School of Biomedical Engineering, Science and Health Systems, Drexel University, Bossone 7-711, 3141 Chestnut Street, Philadelphia, PA 19104, USA Department of Immunology and Microbiology, College of Medicine, Drexel University, Bossone 7-711, 3141 Chestnut Street, Philadelphia, PA 19104, USA
| | - Eline T Luning Prak
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, 405B Stellar Chance Labs, 422 Curie Boulevard, Philadelphia, PA 19104, USA
| |
Collapse
|
16
|
Sameith K, Roscito JG, Hiller M. Iterative error correction of long sequencing reads maximizes accuracy and improves contig assembly. Brief Bioinform 2016; 18:1-8. [PMID: 26868358 PMCID: PMC5221426 DOI: 10.1093/bib/bbw003] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2015] [Revised: 01/02/2016] [Indexed: 11/13/2022] Open
Abstract
Next-generation sequencers such as Illumina can now produce reads up to 300 bp with high throughput, which is attractive for genome assembly. A first step in genome assembly is to computationally correct sequencing errors. However, correcting all errors in these longer reads is challenging. Here, we show that reads with remaining errors after correction often overlap repeats, where short erroneous k-mers occur in other copies of the repeat. We developed an iterative error correction pipeline that runs the previously published String Graph Assembler (SGA) in multiple rounds of k-mer-based correction with an increasing k-mer size, followed by a final round of overlap-based correction. By combining the advantages of small and large k-mers, this approach corrects more errors in repeats and minimizes the total amount of erroneous reads. We show that higher read accuracy increases contig lengths two to three times. We provide SGA-Iteratively Correcting Errors (https://github.com/hillerlab/IterativeErrorCorrection/) that implements iterative error correction by using modules from SGA.
Collapse
Affiliation(s)
- Katrin Sameith
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
- Max Planck Institute for the Physics of Complex Systems, Dresden, Germany
| | - Juliana G Roscito
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
- Max Planck Institute for the Physics of Complex Systems, Dresden, Germany
| | - Michael Hiller
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
- Max Planck Institute for the Physics of Complex Systems, Dresden, Germany
- Corresponding author. Michael Hiller. Max Planck Institute of Molecular Cell Biology and Genetics & Max Planck Institute for the Physics of Complex Systems, 01307 Dresden, Germany. E-mail:
| |
Collapse
|
17
|
Tong L, Yang C, Wu PY, Wang MD. Evaluating the impact of sequencing error correction for RNA-seq data with ERCC RNA spike-in controls. ... IEEE-EMBS INTERNATIONAL CONFERENCE ON BIOMEDICAL AND HEALTH INFORMATICS. IEEE-EMBS INTERNATIONAL CONFERENCE ON BIOMEDICAL AND HEALTH INFORMATICS 2016; 2016:74-77. [PMID: 27532064 DOI: 10.1109/bhi.2016.7455838] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Sequencing errors are a major issue for several next-generation sequencing-based applications such as de novo assembly and single nucleotide polymorphism detection. Several error-correction methods have been developed to improve raw data quality. However, error-correction performance is hard to evaluate because of the lack of a ground truth. In this study, we propose a novel approach which using ERCC RNA spike-in controls as the ground truth to facilitate error-correction performance evaluation. After aligning raw and corrected RNA-seq data, we characterized the quality of reads by three metrics: mismatch patterns (i.e., the substitution rate of A to C) of reads aligned with one mismatch, mismatch patterns of reads aligned with two mismatches and the percentage increase of reads aligned to reference. We observed that the mismatch patterns for reads aligned with one mismatch are significantly correlated between ERCC spike-ins and real RNA samples. Based on such observations, we conclude that ERCC spike-ins can serve as ground truths for error correction beyond their previous applications for validation of dynamic range and fold-change response. Also, the mismatch patterns for ERCC reads aligned with one mismatch can serve as a novel and reliable metric to evaluate the performance of error-correction tools.
Collapse
Affiliation(s)
- Li Tong
- Dept. of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, GA 30332, USA
| | - Cheng Yang
- Dept. of Biomedical Engineering, Peking University, No.5 Yiheyuan Road Haidian District, Beijing, P.R. China 100871
| | - Po-Yen Wu
- School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA
| | - May D Wang
- Dept. of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, GA 30332, USA
| |
Collapse
|
18
|
Alic AS, Ruzafa D, Dopazo J, Blanquer I. Objective review of de novostand-alone error correction methods for NGS data. WILEY INTERDISCIPLINARY REVIEWS: COMPUTATIONAL MOLECULAR SCIENCE 2016. [DOI: 10.1002/wcms.1239] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Affiliation(s)
- Andy S. Alic
- Institute of Instrumentation for Molecular Imaging (I3M); Universitat Politècnica de València; València Spain
| | - David Ruzafa
- Departamento de Quìmica Fìsica e Instituto de Biotecnologìa, Facultad de Ciencias; Universidad de Granada; Granada Spain
| | - Joaquin Dopazo
- Department of Computational Genomics; Príncipe Felipe Research Centre (CIPF); Valencia Spain
- CIBER de Enfermedades Raras (CIBERER); Valencia Spain
- Functional Genomics Node (INB) at CIPF; Valencia Spain
| | - Ignacio Blanquer
- Institute of Instrumentation for Molecular Imaging (I3M); Universitat Politècnica de València; València Spain
- Biomedical Imaging Research Group GIBI 2; Polytechnic University Hospital La Fe; Valencia Spain
| |
Collapse
|
19
|
Laehnemann D, Borkhardt A, McHardy AC. Denoising DNA deep sequencing data-high-throughput sequencing errors and their correction. Brief Bioinform 2016; 17:154-79. [PMID: 26026159 PMCID: PMC4719071 DOI: 10.1093/bib/bbv029] [Citation(s) in RCA: 180] [Impact Index Per Article: 22.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2015] [Revised: 04/09/2015] [Indexed: 12/23/2022] Open
Abstract
Characterizing the errors generated by common high-throughput sequencing platforms and telling true genetic variation from technical artefacts are two interdependent steps, essential to many analyses such as single nucleotide variant calling, haplotype inference, sequence assembly and evolutionary studies. Both random and systematic errors can show a specific occurrence profile for each of the six prominent sequencing platforms surveyed here: 454 pyrosequencing, Complete Genomics DNA nanoball sequencing, Illumina sequencing by synthesis, Ion Torrent semiconductor sequencing, Pacific Biosciences single-molecule real-time sequencing and Oxford Nanopore sequencing. There is a large variety of programs available for error removal in sequencing read data, which differ in the error models and statistical techniques they use, the features of the data they analyse, the parameters they determine from them and the data structures and algorithms they use. We highlight the assumptions they make and for which data types these hold, providing guidance which tools to consider for benchmarking with regard to the data properties. While no benchmarking results are included here, such specific benchmarks would greatly inform tool choices and future software development. The development of stand-alone error correctors, as well as single nucleotide variant and haplotype callers, could also benefit from using more of the knowledge about error profiles and from (re)combining ideas from the existing approaches presented here.
Collapse
|
20
|
Kowalski T, Grabowski S, Deorowicz S. Indexing Arbitrary-Length k-Mers in Sequencing Reads. PLoS One 2015; 10:e0133198. [PMID: 26182400 PMCID: PMC4504488 DOI: 10.1371/journal.pone.0133198] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2015] [Accepted: 06/24/2015] [Indexed: 11/25/2022] Open
Abstract
We propose a lightweight data structure for indexing and querying collections of NGS reads data in main memory. The data structure supports the interface proposed in the pioneering work by Philippe et al. for counting and locating k-mers in sequencing reads. Our solution, PgSA (pseudogenome suffix array), based on finding overlapping reads, is competitive to the existing algorithms in the space use, query times, or both. The main applications of our index include variant calling, error correction and analysis of reads from RNA-seq experiments.
Collapse
Affiliation(s)
- Tomasz Kowalski
- Institute of Applied Computer Science, Lodz University of Technology, Al. Politechniki 11, 90-924 Łódź, Poland
| | - Szymon Grabowski
- Institute of Applied Computer Science, Lodz University of Technology, Al. Politechniki 11, 90-924 Łódź, Poland
| | - Sebastian Deorowicz
- Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland
| |
Collapse
|
21
|
Marinier E, Brown DG, McConkey BJ. Pollux: platform independent error correction of single and mixed genomes. BMC Bioinformatics 2015; 16:10. [PMID: 25592313 PMCID: PMC4307147 DOI: 10.1186/s12859-014-0435-6] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2014] [Accepted: 12/17/2014] [Indexed: 12/13/2022] Open
Abstract
Background Second-generation sequencers generate millions of relatively short, but error-prone, reads. These errors make sequence assembly and other downstream projects more challenging. Correcting these errors improves the quality of assemblies and projects which benefit from error-free reads. Results We have developed a general-purpose error corrector that corrects errors introduced by Illumina, Ion Torrent, and Roche 454 sequencing technologies and can be applied to single- or mixed-genome data. In addition to correcting substitution errors, we locate and correct insertion, deletion, and homopolymer errors while remaining sensitive to low coverage areas of sequencing projects. Using published data sets, we correct 94% of Illumina MiSeq errors, 88% of Ion Torrent PGM errors, 85% of Roche 454 GS Junior errors. Introduced errors are 20 to 70 times more rare than successfully corrected errors. Furthermore, we show that the quality of assemblies improves when reads are corrected by our software. Conclusions Pollux is highly effective at correcting errors across platforms, and is consistently able to perform as well or better than currently available error correction software. Pollux provides general-purpose error correction and may be used in applications with or without assembly.
Collapse
Affiliation(s)
- Eric Marinier
- David R. Cheriton School of Computer Science, University of Waterloo, 200 University Ave W, Waterloo, ON N2L 3G1, Canada.
| | - Daniel G Brown
- David R. Cheriton School of Computer Science, University of Waterloo, 200 University Ave W, Waterloo, ON N2L 3G1, Canada.
| | - Brendan J McConkey
- Department of Biology, University of Waterloo, 200 University Ave W, N2L3G1 Waterloo, Canada.
| |
Collapse
|