1
|
Gong W, Pan X, Xu D, Ji G, Wang Y, Tian Y, Cai J, Li J, Zhang Z, Yuan X. Benchmarking DNA Methylation Analysis of 14 Alignment Algorithms for Whole Genome Bisulfite Sequencing in Mammals. Comput Struct Biotechnol J 2022; 20:4704-4716. [PMID: 36147684 PMCID: PMC9465269 DOI: 10.1016/j.csbj.2022.08.051] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2022] [Revised: 08/22/2022] [Accepted: 08/22/2022] [Indexed: 01/10/2023] Open
Abstract
Whole genome bisulfite sequencing (WGBS) is an essential technique for methylome studies. Although a series of tools have been developed to overcome the mapping challenges caused by bisulfite treatment, the latest available tools have not been evaluated on the performance of reads mapping as well as on biological insights in multiple mammals. Herein, based on the real and simulated WGBS data of 14.77 billion reads, we undertook 936 mappings to benchmark and evaluate 14 wildly utilized alignment algorithms from reads mapping to biological interpretation in humans, cattle and pigs: Bwa-meth, BSBolt, BSMAP, Walt, Abismal, Batmeth2, Hisat_3n, Hisat_3n_repeat, Bismark-bwt2-e2e, Bismark-his2, BSSeeker2-bwt, BSSeeker2-soap2, BSSeeker2-bwt2-e2e and BSSeeker2-bwt2-local. Specifically, Bwa-meth, BSBolt, BSMAP, Bismark-bwt2-e2e and Walt exhibited higher uniquely mapped reads, mapped precision, recall and F1 score than other nine alignment algorithms, and the influences of distinct alignment algorithms on the methylomes varied considerably at the numbers and methylation levels of CpG sites, the calling of differentially methylated CpGs (DMCs) and regions (DMRs). Moreover, we reported that BSMAP showed the highest accuracy at the detection of CpG coordinates and methylation levels, the calling of DMCs, DMRs, DMR-related genes and signaling pathways. These results suggested that careful selection of algorithms to profile the genome-wide DNA methylation is required, and our works provided investigators with useful information on the choice of alignment algorithms to effectively improve the DNA methylation detection accuracy in mammals.
Collapse
Affiliation(s)
- Wentao Gong
- Guangdong Laboratory of Lingnan Modern Agriculture, National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou 510642, China
| | - Xiangchun Pan
- Guangdong Laboratory of Lingnan Modern Agriculture, National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou 510642, China
| | - Dantong Xu
- Guangdong Laboratory of Lingnan Modern Agriculture, National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou 510642, China
| | - Guanyu Ji
- Shenzhen Gendo Health Technology CO,. Ltd, Shenzhen 518122, China
| | - Yifei Wang
- Guangdong Laboratory of Lingnan Modern Agriculture, National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou 510642, China
| | - Yuhan Tian
- Guangdong Laboratory of Lingnan Modern Agriculture, National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou 510642, China
| | - Jiali Cai
- Guangdong Laboratory of Lingnan Modern Agriculture, National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou 510642, China
| | - Jiaqi Li
- Guangdong Laboratory of Lingnan Modern Agriculture, National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou 510642, China
| | - Zhe Zhang
- Guangdong Laboratory of Lingnan Modern Agriculture, National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou 510642, China
- Corresponding authors.
| | - Xiaolong Yuan
- Guangdong Laboratory of Lingnan Modern Agriculture, National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou 510642, China
- Corresponding authors.
| |
Collapse
|
2
|
New evaluation methods of read mapping by 17 aligners on simulated and empirical NGS data: an updated comparison of DNA- and RNA-Seq data from Illumina and Ion Torrent technologies. Neural Comput Appl 2021; 33:15669-15692. [PMID: 34155424 PMCID: PMC8208613 DOI: 10.1007/s00521-021-06188-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2020] [Accepted: 06/02/2021] [Indexed: 12/13/2022]
Abstract
During the last (15) years, improved omics sequencing technologies have expanded the scale and resolution of various biological applications, generating high-throughput datasets that require carefully chosen software tools to be processed. Therefore, following the sequencing development, bioinformatics researchers have been challenged to implement alignment algorithms for next-generation sequencing reads. However, nowadays selection of aligners based on genome characteristics is poorly studied, so our benchmarking study extended the “state of art” comparing 17 different aligners. The chosen tools were assessed on empirical human DNA- and RNA-Seq data, as well as on simulated datasets in human and mouse, evaluating a set of parameters previously not considered in such kind of benchmarks. As expected, we found that each tool was the best in specific conditions. For Ion Torrent single-end RNA-Seq samples, the most suitable aligners were CLC and BWA-MEM, which reached the best results in terms of efficiency, accuracy, duplication rate, saturation profile and running time. About Illumina paired-end osteomyelitis transcriptomics data, instead, the best performer algorithm, together with the already cited CLC, resulted Novoalign, which excelled in accuracy and saturation analyses. Segemehl and DNASTAR performed the best on both DNA-Seq data, with Segemehl particularly suitable for exome data. In conclusion, our study could guide users in the selection of a suitable aligner based on genome and transcriptome characteristics. However, several other aspects, emerged from our work, should be considered in the evolution of alignment research area, such as the involvement of artificial intelligence to support cloud computing and mapping to multiple genomes.
Collapse
|
3
|
Zhou Q, Lim JQ, Sung WK, Li G. An integrated package for bisulfite DNA methylation data analysis with Indel-sensitive mapping. BMC Bioinformatics 2019; 20:47. [PMID: 30669962 PMCID: PMC6343306 DOI: 10.1186/s12859-018-2593-4] [Citation(s) in RCA: 48] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2018] [Accepted: 12/27/2018] [Indexed: 02/03/2023] Open
Abstract
Background DNA methylation plays crucial roles in most eukaryotic organisms. Bisulfite sequencing (BS-Seq) is a sequencing approach that provides quantitative cytosine methylation levels in genome-wide scope and single-base resolution. However, genomic variations such as insertions and deletions (indels) affect methylation calling, and the alignment of reads near/across indels becomes inaccurate in the presence of polymorphisms. Hence, the simultaneous detection of DNA methylation and indels is important for exploring the mechanisms of functional regulation in organisms. Results These problems motivated us to develop the algorithm BatMeth2, which can align BS reads with high accuracy while allowing for variable-length indels with respect to the reference genome. The results from simulated and real bisulfite DNA methylation data demonstrated that our proposed method increases alignment accuracy. Additionally, BatMeth2 can calculate the methylation levels of individual loci, genomic regions or functional regions such as genes/transposable elements. Additional programs were also developed to provide methylation data annotation, visualization, and differentially methylated cytosine/region (DMC/DMR) detection. The whole package provides new tools and will benefit bisulfite data analysis. Conclusion BatMeth2 improves DNA methylation calling, particularly for regions close to indels. It is an autorun package and easy to use. In addition, a DNA methylation visualization program and a differential analysis program are provided in BatMeth2. We believe that BatMeth2 will facilitate the study of the mechanisms of DNA methylation in development and disease. BatMeth2 is an open source software program and is available on GitHub (https://github.com/GuoliangLi-HZAU/BatMeth2/). Electronic supplementary material The online version of this article (10.1186/s12859-018-2593-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Qiangwei Zhou
- National Key Laboratory of Crop Genetic Improvement, Agricultural Bioinformatics Key Laboratory of Hubei Province, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China
| | - Jing-Quan Lim
- Department of Computer Science, National University of Singapore, Singapore, 117417, Singapore.,Lymphoma Genomic Translational Research Laboratory, National Cancer Centre, Singapore, Singapore
| | - Wing-Kin Sung
- Department of Computer Science, National University of Singapore, Singapore, 117417, Singapore. .,Department of Computational and Systems Biology, Genome Institute of Singapore, Singapore, 138672, Singapore. .,Agricultural Bioinformatics Key Laboratory of Hubei Province, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China.
| | - Guoliang Li
- National Key Laboratory of Crop Genetic Improvement, Agricultural Bioinformatics Key Laboratory of Hubei Province, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China.
| |
Collapse
|
4
|
Lee H, Lee KW, Lee T, Park D, Chung J, Lee C, Park WY, Son DS. Performance evaluation method for read mapping tool in clinical panel sequencing. Genes Genomics 2017; 40:189-197. [PMID: 29568413 PMCID: PMC5846869 DOI: 10.1007/s13258-017-0621-9] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2017] [Accepted: 10/11/2017] [Indexed: 01/28/2023]
Abstract
In addition to the rapid advancement in Next-Generation Sequencing (NGS) technology, clinical panel sequencing is being used increasingly in clinical studies and tests. However, tools that are used in NGS data analysis have not been comparatively evaluated in performance for panel sequencing. This study aimed to evaluate the tools used in the alignment process, the first procedure in bioinformatics analysis, by comparing tools that have been widely used with ones that have been introduced recently. With the accumulated panel sequencing data, detected variant lists were cataloged and inserted into simulated reads produced from the reference genome (h19). The amount of unmapped reads and misaligned reads, mapping quality distribution, and runtime were measured as standards for comparison. As the most widely used tools, Bowtie2 and BWA–MEM each showed explicit performance with AUC of 0.9984 and 0.9970 respectively. Kart, maintaining superior runtime and less number of misaligned read, also similarly possessed high level of AUC (0.9723). Such selection and optimization method of tools appropriate for panel sequencing can be utilized for fields requiring error minimization, such as clinical application and liquid biopsy studies.
Collapse
Affiliation(s)
- Hojun Lee
- 1Samsung Genome Institute (SGI), Samsung Medical Center (SMC), Seoul, 06351 South Korea
| | - Ki-Wook Lee
- 1Samsung Genome Institute (SGI), Samsung Medical Center (SMC), Seoul, 06351 South Korea.,2Department of Digital Health, SAIHST, Sungkyunkwan University, Seoul, 06351 South Korea
| | - Taeseob Lee
- 1Samsung Genome Institute (SGI), Samsung Medical Center (SMC), Seoul, 06351 South Korea
| | - Donghyun Park
- 1Samsung Genome Institute (SGI), Samsung Medical Center (SMC), Seoul, 06351 South Korea
| | - Jongsuk Chung
- 1Samsung Genome Institute (SGI), Samsung Medical Center (SMC), Seoul, 06351 South Korea.,3Department of Molecular Cell Biology, Sungkyunkwan University School of Medicine, Suwon, 16419 South Korea
| | - Chung Lee
- 1Samsung Genome Institute (SGI), Samsung Medical Center (SMC), Seoul, 06351 South Korea.,4Department of Health Sciences and Technology, SAIHST, Sungkyunkwan University, Seoul, 06351 South Korea
| | - Woong-Yang Park
- 1Samsung Genome Institute (SGI), Samsung Medical Center (SMC), Seoul, 06351 South Korea.,3Department of Molecular Cell Biology, Sungkyunkwan University School of Medicine, Suwon, 16419 South Korea.,4Department of Health Sciences and Technology, SAIHST, Sungkyunkwan University, Seoul, 06351 South Korea
| | - Dae-Soon Son
- 1Samsung Genome Institute (SGI), Samsung Medical Center (SMC), Seoul, 06351 South Korea
| |
Collapse
|
5
|
Suitability of Different Mapping Algorithms for Genome-Wide Polymorphism Scans with Pool-Seq Data. G3-GENES GENOMES GENETICS 2016; 6:3507-3515. [PMID: 27613752 PMCID: PMC5100849 DOI: 10.1534/g3.116.034488] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
The cost-effectiveness of sequencing pools of individuals (Pool-Seq) provides the basis for the popularity and widespread use of this method for many research questions, ranging from unraveling the genetic basis of complex traits, to the clonal evolution of cancer cells. Because the accuracy of Pool-Seq could be affected by many potential sources of error, several studies have determined, for example, the influence of sequencing technology, the library preparation protocol, and mapping parameters. Nevertheless, the impact of the mapping tools has not yet been evaluated. Using simulated and real Pool-Seq data, we demonstrate a substantial impact of the mapping tools, leading to characteristic false positives in genome-wide scans. The problem of false positives was particularly pronounced when data with different read lengths and insert sizes were compared. Out of 14 evaluated algorithms novoalign, bwa mem and clc4 are most suitable for mapping Pool-Seq data. Nevertheless, no single algorithm is sufficient for avoiding all false positives. We show that the intersection of the results of two mapping algorithms provides a simple, yet effective, strategy to eliminate false positives. We propose that the implementation of a consistent Pool-Seq bioinformatics pipeline, building on the recommendations of this study, can substantially increase the reliability of Pool-Seq results, in particular when libraries generated with different protocols are being compared.
Collapse
|
6
|
Zheng Q, Grice EA. AlignerBoost: A Generalized Software Toolkit for Boosting Next-Gen Sequencing Mapping Accuracy Using a Bayesian-Based Mapping Quality Framework. PLoS Comput Biol 2016; 12:e1005096. [PMID: 27706155 PMCID: PMC5051939 DOI: 10.1371/journal.pcbi.1005096] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2016] [Accepted: 08/02/2016] [Indexed: 01/09/2023] Open
Abstract
Accurate mapping of next-generation sequencing (NGS) reads to reference genomes is crucial for almost all NGS applications and downstream analyses. Various repetitive elements in human and other higher eukaryotic genomes contribute in large part to ambiguously (non-uniquely) mapped reads. Most available NGS aligners attempt to address this by either removing all non-uniquely mapping reads, or reporting one random or "best" hit based on simple heuristics. Accurate estimation of the mapping quality of NGS reads is therefore critical albeit completely lacking at present. Here we developed a generalized software toolkit "AlignerBoost", which utilizes a Bayesian-based framework to accurately estimate mapping quality of ambiguously mapped NGS reads. We tested AlignerBoost with both simulated and real DNA-seq and RNA-seq datasets at various thresholds. In most cases, but especially for reads falling within repetitive regions, AlignerBoost dramatically increases the mapping precision of modern NGS aligners without significantly compromising the sensitivity even without mapping quality filters. When using higher mapping quality cutoffs, AlignerBoost achieves a much lower false mapping rate while exhibiting comparable or higher sensitivity compared to the aligner default modes, therefore significantly boosting the detection power of NGS aligners even using extreme thresholds. AlignerBoost is also SNP-aware, and higher quality alignments can be achieved if provided with known SNPs. AlignerBoost’s algorithm is computationally efficient, and can process one million alignments within 30 seconds on a typical desktop computer. AlignerBoost is implemented as a uniform Java application and is freely available at https://github.com/Grice-Lab/AlignerBoost.
Collapse
Affiliation(s)
- Qi Zheng
- Department of Dermatology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
- * E-mail: (QZ); (EAG)
| | - Elizabeth A. Grice
- Department of Dermatology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
- * E-mail: (QZ); (EAG)
| |
Collapse
|
7
|
Liu B, Gao Y, Wang Y. LAMSA: fast split read alignment with long approximate matches. Bioinformatics 2016; 33:192-201. [DOI: 10.1093/bioinformatics/btw594] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2015] [Revised: 07/20/2016] [Accepted: 09/08/2016] [Indexed: 12/20/2022] Open
|
8
|
Guan P, Sung WK. Structural variation detection using next-generation sequencing data: A comparative technical review. Methods 2016; 102:36-49. [PMID: 26845461 DOI: 10.1016/j.ymeth.2016.01.020] [Citation(s) in RCA: 98] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2015] [Revised: 01/09/2016] [Accepted: 01/31/2016] [Indexed: 12/11/2022] Open
Abstract
Structural variations (SVs) are mutations in the genome of size at least fifty nucleotides. They contribute to the phenotypic differences among healthy individuals, cause severe diseases and even cancers by breaking or linking genes. Thus, it is crucial to systematically profile SVs in the genome. In the past decade, many next-generation sequencing (NGS)-based SV detection methods have been proposed due to the significant cost reduction of NGS experiments and their ability to unbiasedly detect SVs to the base-pair resolution. These SV detection methods vary in both sensitivity and specificity, since they use different SV-property-dependent and library-property-dependent features. As a result, predictions from different SV callers are often inconsistent. Besides, the noises in the data (both platform-specific sequencing error and artificial chimeric reads) impede the specificity of SV detection. Poorly characterized regions in the human genome (e.g., repeat regions) greatly impact the reads mapping and in turn affect the SV calling accuracy. Calling of complex SVs requires specialized SV callers. Apart from accuracy, processing speed of SV caller is another factor deciding its usability. Knowing the pros and cons of different SV calling techniques and the objectives of the biological study are essential for biologists and bioinformaticians to make informed decisions. This paper describes different components in the SV calling pipeline and reviews the techniques used by existing SV callers. Through simulation study, we also demonstrate that library properties, especially insert size, greatly impact the sensitivity of different SV callers. We hope the community can benefit from this work both in designing new SV calling methods and in selecting the appropriate SV caller for specific biological studies.
Collapse
Affiliation(s)
- Peiyong Guan
- School of Computing, National University of Singapore, 117543, Singapore
| | - Wing-Kin Sung
- School of Computing, National University of Singapore, 117543, Singapore; Computational & Mathematical Biology Group, Genome Institute of Singapore, 138672, Singapore.
| |
Collapse
|