Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Zhao L, Chen Q, Li W, Jiang P, Wong L, Li J. MapReduce for accurate error correction of next-generation sequencing data. Bioinformatics 2017;33:3844-3851. [PMID: 28205674 DOI: 10.1093/bioinformatics/btx089] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2016] [Accepted: 02/14/2017] [Indexed: 11/14/2022] Open

For:	Zhao L, Chen Q, Li W, Jiang P, Wong L, Li J. MapReduce for accurate error correction of next-generation sequencing data. Bioinformatics 2017;33:3844-3851. [PMID: 28205674 DOI: 10.1093/bioinformatics/btx089] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2016] [Accepted: 02/14/2017] [Indexed: 11/14/2022] Open

Number

Cited by Other Article(s)

Tang T, Hutvagner G, Wang W, Li J. Simultaneous compression of multiple error-corrected short-read sets for faster data transmission and better de novo assemblies. Brief Funct Genomics 2022;21:387-398. [PMID: 35848773 DOI: 10.1093/bfgp/elac016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2022] [Revised: 06/10/2022] [Accepted: 06/14/2022] [Indexed: 11/14/2022] Open

Abstract

Next-Generation Sequencing has produced incredible amounts of short-reads sequence data for de novo genome assembly over the last decades. For efficient transmission of these huge datasets, high-performance compression algorithms have been intensively studied. As both the de novo assembly and error correction methods utilize the overlaps between reads data, a concern is that the will the sequencing errors bring up negative effects on genome assemblies also affect the compression of the NGS data. This work addresses two problems: how current error correction algorithms can enable the compression algorithms to make the sequence data much more compact, and whether the sequence-modified reads by the error-correction algorithms will lead to quality improvement for de novo contig assembly. As multiple sets of short reads are often produced by a single biomedical project in practice, we propose a graph-based method to reorder the files in the collection of multiple sets and then compress them simultaneously for a further compression improvement after error correction. We use examples to illustrate that accurate error correction algorithms can significantly reduce the number of mismatched nucleotides in the reference-free compression, hence can greatly improve the compression performance. Extensive test on practical collections of multiple short-read sets does confirm that the compression performance on the error-corrected data (with unchanged size) significantly outperforms that on the original data, and that the file reordering idea contributes furthermore. The error correction on the original reads has also resulted in quality improvements of the genome assemblies, sometimes remarkably. However, it is still an open question that how to combine appropriate error correction methods with an assembly algorithm so that the assembly performance can be always significantly improved.

Collapse

Chen J, Li F, Wang M, Li J, Marquez-Lago TT, Leier A, Revote J, Li S, Liu Q, Song J. BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data. Front Big Data 2022;4:727216. [PMID: 35118375 PMCID: PMC8805145 DOI: 10.3389/fdata.2021.727216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2021] [Accepted: 12/13/2021] [Indexed: 11/22/2022] Open

Abstract

Background

Simple Sequence Repeats (SSRs) are short tandem repeats of nucleotide sequences. It has been shown that SSRs are associated with human diseases and are of medical relevance. Accordingly, a variety of computational methods have been proposed to mine SSRs from genomes. Conventional methods rely on a high-quality complete genome to identify SSRs. However, the sequenced genome often misses several highly repetitive regions. Moreover, many non-model species have no entire genomes. With the recent advances of next-generation sequencing (NGS) techniques, large-scale sequence reads for any species can be rapidly generated using NGS. In this context, a number of methods have been proposed to identify thousands of SSR loci within large amounts of reads for non-model species. While the most commonly used NGS platforms (e.g., Illumina platform) on the market generally provide short paired-end reads, merging overlapping paired-end reads has become a common way prior to the identification of SSR loci. This has posed a big data analysis challenge for traditional stand-alone tools to merge short read pairs and identify SSRs from large-scale data.

Results

In this study, we present a new Hadoop-based software program, termed BigFiRSt, to address this problem using cutting-edge big data technology. BigFiRSt consists of two major modules, BigFLASH and BigPERF, implemented based on two state-of-the-art stand-alone tools, FLASH and PERF, respectively. BigFLASH and BigPERF address the problem of merging short read pairs and mining SSRs in the big data manner, respectively. Comprehensive benchmarking experiments show that BigFiRSt can dramatically reduce the execution times of fast read pairs merging and SSRs mining from very large-scale DNA sequence data.

Conclusions

The excellent performance of BigFiRSt mainly resorts to the Big Data Hadoop technology to merge read pairs and mine SSRs in parallel and distributed computing on clusters. We anticipate BigFiRSt will be a valuable tool in the coming biological Big Data era.

Collapse

Affiliation(s)

Jinxiang Chen Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
Fuyi Li Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia Monash Centre for Data Science, Monash University, Melbourne, VIC, Australia Department of Microbiology and Immunity, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, VIC, Australia
Miao Wang Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
Junlong Li Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
Tatiana T. Marquez-Lago Department of Genetics, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
André Leier Department of Genetics, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
Jerico Revote Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia
Shuqin Li Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
Quanzhong Liu Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China Quanzhong Liu
Jiangning Song Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia Monash Centre for Data Science, Monash University, Melbourne, VIC, Australia *Correspondence: Jiangning Song

Collapse

Kallenborn F, Hildebrandt A, Schmidt B. CARE: context-aware sequencing read error correction. Bioinformatics 2021;37:889-895. [PMID: 32818262 DOI: 10.1093/bioinformatics/btaa738] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2020] [Revised: 07/14/2020] [Accepted: 08/14/2020] [Indexed: 11/14/2022] Open

GPrimer: a fast GPU-based pipeline for primer design for qPCR experiments. BMC Bioinformatics 2021;22:220. [PMID: 33926379 PMCID: PMC8082839 DOI: 10.1186/s12859-021-04133-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2020] [Accepted: 04/14/2021] [Indexed: 11/10/2022] Open

Koessler T, Paradiso V, Piscuoglio S, Nienhold R, Ho L, Christinat Y, Terracciano LM, Cathomas G, Wicki A, McKee TA, Nouspikel T. Reliability of liquid biopsy analysis: an inter-laboratory comparison of circulating tumor DNA extraction and sequencing with different platforms. J Transl Med 2020;100:1475-1484. [PMID: 32616816 DOI: 10.1038/s41374-020-0459-7] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2020] [Revised: 06/19/2020] [Accepted: 06/19/2020] [Indexed: 01/11/2023] Open

Abstract

Liquid biopsy, the analysis of circulating tumor DNA (ctDNA), is a promising tool in oncology, especially in personalized medicine. Although its main applications currently focus on selection and adjustment of therapy, ctDNA may also be used to monitor residual disease, establish prognosis, detect relapses, and possibly screen at-risk individuals. CtDNA represents a small and variable proportion of circulating cell-free DNA (ccfDNA) which is itself present at a low concentration in normal individuals and so analyzing ctDNA is technically challenging. Various commercial systems have recently appeared on the market, but it remains difficult for practitioners to compare their performance and to determine whether they yield comparable results. As a first step toward establishing national guidelines for ctDNA analyses, four laboratories in Switzerland joined a comparative exercise to assess ccfDNA extraction and ctDNA analysis by sequencing. Extraction was performed using six distinct methods and yielded ccfDNA of equally high quality, suitable for sequencing. Sequencing of synthetic samples containing predefined amounts of eight mutations was performed on three different systems, with similar results. In all four laboratories, mutations were easily identified down to 1% allele frequency, whereas detection at 0.1% proved challenging. Linearity was excellent in all cases and while molecular yield was superior with one system this did not impact on sensitivity. This study also led to several additional conclusions: First, national guidelines should concentrate on principles of good laboratory practice rather than recommend a particular system. Second, it is essential that laboratories thoroughly validate every aspect of extraction and sequencing, in particular with respect to initial amount of DNA and average sequencing depth. Finally, as software proved critical for mutation detection, laboratories should validate the performance of variant callers and underlying algorithms with respect to various types of mutations.

Collapse

Jiang P, Luo J, Wang Y, Deng P, Schmidt B, Tang X, Chen N, Wong L, Zhao L. kmcEx: memory-frugal and retrieval-efficient encoding of counted k-mers. Bioinformatics 2020;35:4871-4878. [PMID: 31038666 DOI: 10.1093/bioinformatics/btz299] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2018] [Revised: 04/02/2019] [Accepted: 04/19/2019] [Indexed: 12/25/2022] Open

Yoest JM, Shirai CL, Duncavage EJ. Sequencing-Based Measurable Residual Disease Testing in Acute Myeloid Leukemia. Front Cell Dev Biol 2020;8:249. [PMID: 32457898 PMCID: PMC7225302 DOI: 10.3389/fcell.2020.00249] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2019] [Accepted: 03/24/2020] [Indexed: 12/31/2022] Open

Jiang P, Hu Y, Wang Y, Zhang J, Zhu Q, Bai L, Tong Q, Li T, Zhao L. Efficient Mining of Variants From Trios for Ventricular Septal Defect Association Study. Front Genet 2019;10:670. [PMID: 31440271 PMCID: PMC6694746 DOI: 10.3389/fgene.2019.00670] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2018] [Accepted: 06/27/2019] [Indexed: 11/28/2022] Open

Zhao L, Xie J, Bai L, Chen W, Wang M, Zhang Z, Wang Y, Zhao Z, Li J. Mining statistically-solid k-mers for accurate NGS error correction. BMC Genomics 2018;19:912. [PMID: 30598110 PMCID: PMC6311904 DOI: 10.1186/s12864-018-5272-y] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

Pfeiffer F, Gröber C, Blank M, Händler K, Beyer M, Schultze JL, Mayer G. Systematic evaluation of error rates and causes in short samples in next-generation sequencing. Sci Rep 2018;8:10950. [PMID: 30026539 PMCID: PMC6053417 DOI: 10.1038/s41598-018-29325-6] [Citation(s) in RCA: 169] [Impact Index Per Article: 28.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2018] [Accepted: 07/09/2018] [Indexed: 01/08/2023] Open

Prosser C, Meyer W, Ellis J, Lee R. Evolutionary ARMS Race: Antimalarial Resistance Molecular Surveillance. Trends Parasitol 2018;34:322-334. [PMID: 29396203 DOI: 10.1016/j.pt.2018.01.001] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2017] [Revised: 01/02/2018] [Accepted: 01/03/2018] [Indexed: 01/13/2023]