1
|
Tang T, Hutvagner G, Wang W, Li J. Simultaneous compression of multiple error-corrected short-read sets for faster data transmission and better de novo assemblies. Brief Funct Genomics 2022; 21:387-398. [PMID: 35848773 DOI: 10.1093/bfgp/elac016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2022] [Revised: 06/10/2022] [Accepted: 06/14/2022] [Indexed: 11/14/2022] Open
Abstract
Next-Generation Sequencing has produced incredible amounts of short-reads sequence data for de novo genome assembly over the last decades. For efficient transmission of these huge datasets, high-performance compression algorithms have been intensively studied. As both the de novo assembly and error correction methods utilize the overlaps between reads data, a concern is that the will the sequencing errors bring up negative effects on genome assemblies also affect the compression of the NGS data. This work addresses two problems: how current error correction algorithms can enable the compression algorithms to make the sequence data much more compact, and whether the sequence-modified reads by the error-correction algorithms will lead to quality improvement for de novo contig assembly. As multiple sets of short reads are often produced by a single biomedical project in practice, we propose a graph-based method to reorder the files in the collection of multiple sets and then compress them simultaneously for a further compression improvement after error correction. We use examples to illustrate that accurate error correction algorithms can significantly reduce the number of mismatched nucleotides in the reference-free compression, hence can greatly improve the compression performance. Extensive test on practical collections of multiple short-read sets does confirm that the compression performance on the error-corrected data (with unchanged size) significantly outperforms that on the original data, and that the file reordering idea contributes furthermore. The error correction on the original reads has also resulted in quality improvements of the genome assemblies, sometimes remarkably. However, it is still an open question that how to combine appropriate error correction methods with an assembly algorithm so that the assembly performance can be always significantly improved.
Collapse
Affiliation(s)
- Tao Tang
- Data Science Institute, University of Technology Sydney, 81 Broadway, Ultimo, 2007, NSW, Australia.,School of Mordern Posts, Nanjing University of Posts and Telecommunications, 9 Wenyuan Rd, Qixia District, 210003, Jiangsu, China
| | - Gyorgy Hutvagner
- School of Biomedical Engineering, University of Technology Sydney, 81 Broadway, Ultimo, 2007, NSW, Australia
| | - Wenjian Wang
- School of Computer and Information Technology, Shanxi University, Shanxi Road, 030006, Shanxi, China
| | - Jinyan Li
- Data Science Institute, University of Technology Sydney, 81 Broadway, Ultimo, 2007, NSW, Australia
| |
Collapse
|
2
|
Chen J, Li F, Wang M, Li J, Marquez-Lago TT, Leier A, Revote J, Li S, Liu Q, Song J. BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data. Front Big Data 2022; 4:727216. [PMID: 35118375 PMCID: PMC8805145 DOI: 10.3389/fdata.2021.727216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2021] [Accepted: 12/13/2021] [Indexed: 11/22/2022] Open
Abstract
Background Simple Sequence Repeats (SSRs) are short tandem repeats of nucleotide sequences. It has been shown that SSRs are associated with human diseases and are of medical relevance. Accordingly, a variety of computational methods have been proposed to mine SSRs from genomes. Conventional methods rely on a high-quality complete genome to identify SSRs. However, the sequenced genome often misses several highly repetitive regions. Moreover, many non-model species have no entire genomes. With the recent advances of next-generation sequencing (NGS) techniques, large-scale sequence reads for any species can be rapidly generated using NGS. In this context, a number of methods have been proposed to identify thousands of SSR loci within large amounts of reads for non-model species. While the most commonly used NGS platforms (e.g., Illumina platform) on the market generally provide short paired-end reads, merging overlapping paired-end reads has become a common way prior to the identification of SSR loci. This has posed a big data analysis challenge for traditional stand-alone tools to merge short read pairs and identify SSRs from large-scale data. Results In this study, we present a new Hadoop-based software program, termed BigFiRSt, to address this problem using cutting-edge big data technology. BigFiRSt consists of two major modules, BigFLASH and BigPERF, implemented based on two state-of-the-art stand-alone tools, FLASH and PERF, respectively. BigFLASH and BigPERF address the problem of merging short read pairs and mining SSRs in the big data manner, respectively. Comprehensive benchmarking experiments show that BigFiRSt can dramatically reduce the execution times of fast read pairs merging and SSRs mining from very large-scale DNA sequence data. Conclusions The excellent performance of BigFiRSt mainly resorts to the Big Data Hadoop technology to merge read pairs and mine SSRs in parallel and distributed computing on clusters. We anticipate BigFiRSt will be a valuable tool in the coming biological Big Data era.
Collapse
Affiliation(s)
- Jinxiang Chen
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Fuyi Li
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia
- Monash Centre for Data Science, Monash University, Melbourne, VIC, Australia
- Department of Microbiology and Immunity, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, VIC, Australia
| | - Miao Wang
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Junlong Li
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Tatiana T. Marquez-Lago
- Department of Genetics, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
| | - André Leier
- Department of Genetics, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
| | - Jerico Revote
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia
| | - Shuqin Li
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Quanzhong Liu
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
- Quanzhong Liu
| | - Jiangning Song
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia
- Monash Centre for Data Science, Monash University, Melbourne, VIC, Australia
- *Correspondence: Jiangning Song
| |
Collapse
|
3
|
Kallenborn F, Hildebrandt A, Schmidt B. CARE: context-aware sequencing read error correction. Bioinformatics 2021; 37:889-895. [PMID: 32818262 DOI: 10.1093/bioinformatics/btaa738] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2020] [Revised: 07/14/2020] [Accepted: 08/14/2020] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Error correction is a fundamental pre-processing step in many Next-Generation Sequencing (NGS) pipelines, in particular for de novo genome assembly. However, existing error correction methods either suffer from high false-positive rates since they break reads into independent k-mers or do not scale efficiently to large amounts of sequencing reads and complex genomes. RESULTS We present CARE-an alignment-based scalable error correction algorithm for Illumina data using the concept of minhashing. Minhashing allows for efficient similarity search within large sequencing read collections which enables fast computation of high-quality multiple alignments. Sequencing errors are corrected by detailed inspection of the corresponding alignments. Our performance evaluation shows that CARE generates significantly fewer false-positive corrections than state-of-the-art tools (Musket, SGA, BFC, Lighter, Bcool, Karect) while maintaining a competitive number of true positives. When used prior to assembly it can achieve superior de novo assembly results for a number of real datasets. CARE is also the first multiple sequence alignment-based error corrector that is able to process a human genome Illumina NGS dataset in only 4 h on a single workstation using GPU acceleration. AVAILABILITYAND IMPLEMENTATION CARE is open-source software written in C++ (CPU version) and in CUDA/C++ (GPU version). It is licensed under GPLv3 and can be downloaded at https://github.com/fkallen/CARE. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Felix Kallenborn
- Department of Computer Science, Johannes Gutenberg University, Mainz 55122, Germany
| | - Andreas Hildebrandt
- Department of Computer Science, Johannes Gutenberg University, Mainz 55122, Germany
| | - Bertil Schmidt
- Department of Computer Science, Johannes Gutenberg University, Mainz 55122, Germany
| |
Collapse
|
4
|
GPrimer: a fast GPU-based pipeline for primer design for qPCR experiments. BMC Bioinformatics 2021; 22:220. [PMID: 33926379 PMCID: PMC8082839 DOI: 10.1186/s12859-021-04133-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2020] [Accepted: 04/14/2021] [Indexed: 11/10/2022] Open
Abstract
Background Design of valid high-quality primers is essential for qPCR experiments. MRPrimer is a powerful pipeline based on MapReduce that combines both primer design for target sequences and homology tests on off-target sequences. It takes an entire sequence DB as input and returns all feasible and valid primer pairs existing in the DB. Due to the effectiveness of primers designed by MRPrimer in qPCR analysis, it has been widely used for developing many online design tools and building primer databases. However, the computational speed of MRPrimer is too slow to deal with the sizes of sequence DBs growing exponentially and thus must be improved. Results We develop a fast GPU-based pipeline for primer design (GPrimer) that takes the same input and returns the same output with MRPrimer. MRPrimer consists of a total of seven MapReduce steps, among which two steps are very time-consuming. GPrimer significantly improves the speed of those two steps by exploiting the computational power of GPUs. In particular, it designs data structures for coalesced memory access in GPU and workload balancing among GPU threads and copies the data structures between main memory and GPU memory in a streaming fashion. For human RefSeq DB, GPrimer achieves a speedup of 57 times for the entire steps and a speedup of 557 times for the most time-consuming step using a single machine of 4 GPUs, compared with MRPrimer running on a cluster of six machines. Conclusions We propose a GPU-based pipeline for primer design that takes an entire sequence DB as input and returns all feasible and valid primer pairs existing in the DB at once without an additional step using BLAST-like tools. The software is available at https://github.com/qhtjrmin/GPrimer.git.
Collapse
|
5
|
Koessler T, Paradiso V, Piscuoglio S, Nienhold R, Ho L, Christinat Y, Terracciano LM, Cathomas G, Wicki A, McKee TA, Nouspikel T. Reliability of liquid biopsy analysis: an inter-laboratory comparison of circulating tumor DNA extraction and sequencing with different platforms. J Transl Med 2020; 100:1475-1484. [PMID: 32616816 DOI: 10.1038/s41374-020-0459-7] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2020] [Revised: 06/19/2020] [Accepted: 06/19/2020] [Indexed: 01/11/2023] Open
Abstract
Liquid biopsy, the analysis of circulating tumor DNA (ctDNA), is a promising tool in oncology, especially in personalized medicine. Although its main applications currently focus on selection and adjustment of therapy, ctDNA may also be used to monitor residual disease, establish prognosis, detect relapses, and possibly screen at-risk individuals. CtDNA represents a small and variable proportion of circulating cell-free DNA (ccfDNA) which is itself present at a low concentration in normal individuals and so analyzing ctDNA is technically challenging. Various commercial systems have recently appeared on the market, but it remains difficult for practitioners to compare their performance and to determine whether they yield comparable results. As a first step toward establishing national guidelines for ctDNA analyses, four laboratories in Switzerland joined a comparative exercise to assess ccfDNA extraction and ctDNA analysis by sequencing. Extraction was performed using six distinct methods and yielded ccfDNA of equally high quality, suitable for sequencing. Sequencing of synthetic samples containing predefined amounts of eight mutations was performed on three different systems, with similar results. In all four laboratories, mutations were easily identified down to 1% allele frequency, whereas detection at 0.1% proved challenging. Linearity was excellent in all cases and while molecular yield was superior with one system this did not impact on sensitivity. This study also led to several additional conclusions: First, national guidelines should concentrate on principles of good laboratory practice rather than recommend a particular system. Second, it is essential that laboratories thoroughly validate every aspect of extraction and sequencing, in particular with respect to initial amount of DNA and average sequencing depth. Finally, as software proved critical for mutation detection, laboratories should validate the performance of variant callers and underlying algorithms with respect to various types of mutations.
Collapse
Affiliation(s)
- Thibaud Koessler
- Department of Oncology, Geneva University Hospitals, Geneva, Switzerland
| | - Viola Paradiso
- Institute of Medical Genetics and Pathology, University Hospital Basel, Basel, Switzerland
| | - Salvatore Piscuoglio
- Institute of Medical Genetics and Pathology, University Hospital Basel, Basel, Switzerland.,Visceral surgery research laboratory, Department of Biomedicine, University of Basel, Basel, Switzerland
| | - Ronny Nienhold
- Institute of Pathology, Cantonal Hospital Basel-Land, Liestal, Switzerland
| | - Liza Ho
- Clinical Pathology Service, Diagnostic Department, Geneva University Hospitals, Geneva, Switzerland
| | - Yann Christinat
- Clinical Pathology Service, Diagnostic Department, Geneva University Hospitals, Geneva, Switzerland
| | - Luigi M Terracciano
- Institute of Medical Genetics and Pathology, University Hospital Basel, Basel, Switzerland
| | - Gieri Cathomas
- Visceral surgery research laboratory, Department of Biomedicine, University of Basel, Basel, Switzerland
| | - Andreas Wicki
- Department of Oncology & Hematology, Medical University Clinic, Cantonal Hospital Basel-Land, Liestal, and University of Basel, Basel, Switzerland
| | - Thomas A McKee
- Clinical Pathology Service, Diagnostic Department, Geneva University Hospitals, Geneva, Switzerland
| | - Thierry Nouspikel
- Medical Genetics, Diagnostic Department, Geneva University Hospitals, Geneva, Switzerland.
| |
Collapse
|
6
|
Jiang P, Luo J, Wang Y, Deng P, Schmidt B, Tang X, Chen N, Wong L, Zhao L. kmcEx: memory-frugal and retrieval-efficient encoding of counted k-mers. Bioinformatics 2020; 35:4871-4878. [PMID: 31038666 DOI: 10.1093/bioinformatics/btz299] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2018] [Revised: 04/02/2019] [Accepted: 04/19/2019] [Indexed: 12/25/2022] Open
Abstract
MOTIVATION K-mers along with their frequency have served as an elementary building block for error correction, repeat detection, multiple sequence alignment, genome assembly, etc., attracting intensive studies in k-mer counting. However, the output of k-mer counters itself is large; very often, it is too large to fit into main memory, leading to highly narrowed usability. RESULTS We introduce a novel idea of encoding k-mers as well as their frequency, achieving good memory saving and retrieval efficiency. Specifically, we propose a Bloom filter-like data structure to encode counted k-mers by coupled-bit arrays-one for k-mer representation and the other for frequency encoding. Experiments on five real datasets show that the average memory-saving ratio on all 31-mers is as high as 13.81 as compared with raw input, with 7 hash functions. At the same time, the retrieval time complexity is well controlled (effectively constant), and the false-positive rate is decreased by two orders of magnitude. AVAILABILITY AND IMPLEMENTATION The source codes of our algorithm are available at github.com/lzhLab/kmcEx. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Peng Jiang
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, Hubei, China
| | - Jie Luo
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, Hubei, China
| | - Yiqi Wang
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, Hubei, China
| | - Pingji Deng
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, Hubei, China
| | - Bertil Schmidt
- Institute of Computer Science, Johannes Gutenberg University Mainz, Mainz Germany
| | - Xiangjun Tang
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, Hubei, China
| | - Ningjiang Chen
- School of Computing and Electronic Information, Guangxi University, Nanning, Guangxi, China
| | - Limsoon Wong
- School of Computing, National University of Singapore, Singapore, Singapore
| | - Liang Zhao
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, Hubei, China.,School of Computing and Electronic Information, Guangxi University, Nanning, Guangxi, China
| |
Collapse
|
7
|
Yoest JM, Shirai CL, Duncavage EJ. Sequencing-Based Measurable Residual Disease Testing in Acute Myeloid Leukemia. Front Cell Dev Biol 2020; 8:249. [PMID: 32457898 PMCID: PMC7225302 DOI: 10.3389/fcell.2020.00249] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2019] [Accepted: 03/24/2020] [Indexed: 12/31/2022] Open
Abstract
Next generation sequencing (NGS) methods have allowed for unprecedented genomic characterization of acute myeloid leukemia (AML) over the last several years. Further advances in NGS-based methods including error correction using unique molecular identifiers (UMIs) have more recently enabled the use of NGS-based measurable residual disease (MRD) detection. This review focuses on the use of NGS-based MRD detection in AML, including basic methodologies and clinical applications.
Collapse
Affiliation(s)
- Jennifer M Yoest
- Department of Pathology, Case Western Reserve University, Cleveland, OH, United States
| | - Cara Lunn Shirai
- Department of Pathology and Immunology, Washington University in St. Louis, St. Louis, MO, United States
| | - Eric J Duncavage
- Department of Pathology and Immunology, Washington University in St. Louis, St. Louis, MO, United States
| |
Collapse
|
8
|
Jiang P, Hu Y, Wang Y, Zhang J, Zhu Q, Bai L, Tong Q, Li T, Zhao L. Efficient Mining of Variants From Trios for Ventricular Septal Defect Association Study. Front Genet 2019; 10:670. [PMID: 31440271 PMCID: PMC6694746 DOI: 10.3389/fgene.2019.00670] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2018] [Accepted: 06/27/2019] [Indexed: 11/28/2022] Open
Abstract
Ventricular septal defect (VSD) is a fatal congenital heart disease showing severe consequence in affected infants. Early diagnosis plays an important role, particularly through genetic variants. Existing panel-based approaches of variants mining suffer from shortage of large panels, costly sequencing, and missing rare variants. Although a trio-based method alleviates these limitations to some extent, it is agnostic to novel mutations and computational intensive. Considering these limitations, we are studying a novel variants mining algorithm from trio-based sequencing data and apply it on a VSD trio to identify associated mutations. Our approach starts with irrelevant k-mer filtering from sequences of a trio via a newly conceived coupled Bloom Filter, then corrects sequencing errors by using a statistical approach and extends kept k-mers into long sequences. These extended sequences are used as input for variants needed. Later, the obtained variants are comprehensively analyzed against existing databases to mine VSD-related mutations. Experiments show that our trio-based algorithm narrows down candidate coding genes and lncRNAs by about 10- and 5-folds comparing with single sequence-based approaches, respectively. Meanwhile, our algorithm is 10 times faster and 2 magnitudes memory-frugal compared with existing state-of-the-art approach. By applying our approach to a VSD trio, we fish out an unreported gene—CD80, a combination of two genes—MYBPC3 and TRDN and a lncRNA—NONHSAT096266.2, which are highly likely to be VSD-related.
Collapse
Affiliation(s)
- Peng Jiang
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
| | - Yaofei Hu
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
| | - Yiqi Wang
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
| | - Jin Zhang
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
| | - Qinghong Zhu
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
| | - Lin Bai
- School of Computing and Electronic Information, Guangxi University, Nanning, China
| | - Qiang Tong
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
| | - Tao Li
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
| | - Liang Zhao
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China.,School of Computing and Electronic Information, Guangxi University, Nanning, China
| |
Collapse
|
9
|
Zhao L, Xie J, Bai L, Chen W, Wang M, Zhang Z, Wang Y, Zhao Z, Li J. Mining statistically-solid k-mers for accurate NGS error correction. BMC Genomics 2018; 19:912. [PMID: 30598110 PMCID: PMC6311904 DOI: 10.1186/s12864-018-5272-y] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND NGS data contains many machine-induced errors. The most advanced methods for the error correction heavily depend on the selection of solid k-mers. A solid k-mer is a k-mer frequently occurring in NGS reads. The other k-mers are called weak k-mers. A solid k-mer does not likely contain errors, while a weak k-mer most likely contains errors. An intensively investigated problem is to find a good frequency cutoff f0 to balance the numbers of solid and weak k-mers. Once the cutoff is determined, a more challenging but less-studied problem is to: (i) remove a small subset of solid k-mers that are likely to contain errors, and (ii) add a small subset of weak k-mers, that are likely to contain no errors, into the remaining set of solid k-mers. Identification of these two subsets of k-mers can improve the correction performance. RESULTS We propose to use a Gamma distribution to model the frequencies of erroneous k-mers and a mixture of Gaussian distributions to model correct k-mers, and combine them to determine f0. To identify the two special subsets of k-mers, we use the z-score of k-mers which measures the number of standard deviations a k-mer's frequency is from the mean. Then these statistically-solid k-mers are used to construct a Bloom filter for error correction. Our method is markedly superior to the state-of-art methods, tested on both real and synthetic NGS data sets. CONCLUSION The z-score is adequate to distinguish solid k-mers from weak k-mers, particularly useful for pinpointing out solid k-mers having very low frequency. Applying z-score on k-mer can markedly improve the error correction accuracy.
Collapse
Affiliation(s)
- Liang Zhao
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
- School of Computing and Electronic Information, Guangxi University, Nanning, China
| | - Jin Xie
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
| | - Lin Bai
- School of Computing and Electronic Information, Guangxi University, Nanning, China
| | - Wen Chen
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
| | - Mingju Wang
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
| | - Zhonglei Zhang
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
| | - Yiqi Wang
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
| | - Zhe Zhao
- School of Computing and Electronic Information, Guangxi University, Nanning, China
| | - Jinyan Li
- Advanced Analytics Institute, Faculty of Engineering & IT, University of Technology Sydney, NSW 2007, Australia
| |
Collapse
|
10
|
Pfeiffer F, Gröber C, Blank M, Händler K, Beyer M, Schultze JL, Mayer G. Systematic evaluation of error rates and causes in short samples in next-generation sequencing. Sci Rep 2018; 8:10950. [PMID: 30026539 PMCID: PMC6053417 DOI: 10.1038/s41598-018-29325-6] [Citation(s) in RCA: 169] [Impact Index Per Article: 28.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2018] [Accepted: 07/09/2018] [Indexed: 01/08/2023] Open
Abstract
Next-generation sequencing (NGS) is the method of choice when large numbers of sequences have to be obtained. While the technique is widely applied, varying error rates have been observed. We analysed millions of reads obtained after sequencing of one single sequence on an Illumina sequencer. According to our analysis, the index-PCR for sample preparation has no effect on the observed error rate, even though PCR is traditionally seen as one of the major contributors to enhanced error rates in NGS. In addition, we observed very persistent pre-phasing effects although the base calling software corrects for these. Removal of shortened sequences abolished these effects and allowed analysis of the actual mutations. The average error rate determined was 0.24 ± 0.06% per base and the percentage of mutated sequences was found to be 6.4 ± 1.24%. Constant regions at the 5'- and 3'-end, e.g., primer binding sites used in in vitro selection procedures seem to have no effect on mutation rates and re-sequencing of samples obtains very reproducible results. As phasing effects and other sequencing problems vary between equipment and individual setups, we recommend evaluation of error rates and types to all NGS-users to improve the quality and analysis of NGS data.
Collapse
Affiliation(s)
- Franziska Pfeiffer
- University of Bonn, LIMES Institute, Chemical Biology, Gerhard-Domagk-Str. 1, 53121, Bonn, Germany
| | - Carsten Gröber
- AptaIT GmbH, Am Klopferspitz 19A, 82152, Planegg, Germany
| | - Michael Blank
- AptaIT GmbH, Am Klopferspitz 19A, 82152, Planegg, Germany
| | - Kristian Händler
- University of Bonn, LIMES Institute, Genomics and Immunoregulation, Carl-Troll-Str. 31, 53115, Bonn, Germany
- German Center for Neurodegenerative Diseases (DZNE) and University of Bonn, Platform for Single Cell Genomics and Epigenomics, Sigmund-Freud-Str. 25, 53127, Bonn, Germany
| | - Marc Beyer
- University of Bonn, LIMES Institute, Genomics and Immunoregulation, Carl-Troll-Str. 31, 53115, Bonn, Germany
- German Center for Neurodegenerative Diseases (DZNE) and University of Bonn, Platform for Single Cell Genomics and Epigenomics, Sigmund-Freud-Str. 25, 53127, Bonn, Germany
- DZNE, Molecular Immunology in Neurodegeneration, Sigmund-Freud-Str. 27, 53127, Bonn, Germany
| | - Joachim L Schultze
- University of Bonn, LIMES Institute, Genomics and Immunoregulation, Carl-Troll-Str. 31, 53115, Bonn, Germany
- German Center for Neurodegenerative Diseases (DZNE) and University of Bonn, Platform for Single Cell Genomics and Epigenomics, Sigmund-Freud-Str. 25, 53127, Bonn, Germany
| | - Günter Mayer
- University of Bonn, LIMES Institute, Chemical Biology, Gerhard-Domagk-Str. 1, 53121, Bonn, Germany.
- Center of Aptamer Research and Development, Gerhard-Domagk-Str. 1, 53121, Bonn, Germany.
| |
Collapse
|
11
|
Prosser C, Meyer W, Ellis J, Lee R. Evolutionary ARMS Race: Antimalarial Resistance Molecular Surveillance. Trends Parasitol 2018; 34:322-334. [PMID: 29396203 DOI: 10.1016/j.pt.2018.01.001] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2017] [Revised: 01/02/2018] [Accepted: 01/03/2018] [Indexed: 01/13/2023]
Abstract
Molecular surveillance of antimalarial drug resistance markers has become an important part of resistance detection and containment. In the current climate of multidrug resistance, including resistance to the global front-line drug artemisinin, there is a consensus to upscale molecular surveillance. The most salient limitation to current surveillance efforts is that skill and infrastructure requirements preclude many regions. This includes sub-Saharan Africa, where Plasmodium falciparum is responsible for most of the global malaria disease burden. New molecular and data technologies have emerged with an emphasis on accessibility. These may allow surveillance to be conducted in broad settings where it is most needed, including at the primary healthcare level in endemic countries, and extending to the village health worker.
Collapse
Affiliation(s)
- Christiane Prosser
- Molecular Mycology Research Laboratory, Centre for Infectious Diseases and Microbiology, Westmead Clinical School-Sydney Medical School, Marie Bashir Institute for Infectious Diseases and Biosecurity, University of Sydney, Sydney, NSW, Australia; Westmead Institute for Medical Research, Westmead, NSW, Australia.
| | - Wieland Meyer
- Molecular Mycology Research Laboratory, Centre for Infectious Diseases and Microbiology, Westmead Clinical School-Sydney Medical School, Marie Bashir Institute for Infectious Diseases and Biosecurity, University of Sydney, Sydney, NSW, Australia; Westmead Institute for Medical Research, Westmead, NSW, Australia
| | - John Ellis
- School of Life Sciences, University of Technology Sydney, NSW, Australia
| | - Rogan Lee
- Centre for Infectious Diseases and Microbiology Laboratory Services, Institute of Clinical Pathology & Medical Research, Westmead Hospital, Westmead, NSW, Australia
| |
Collapse
|