1
|
Zhang Z, Liu Y, Li X, Liu Y, Wang Y, Jiang T. HapKled: a haplotype-aware structural variant calling approach for Oxford nanopore sequencing data. Front Genet 2024; 15:1435087. [PMID: 39045321 PMCID: PMC11263161 DOI: 10.3389/fgene.2024.1435087] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2024] [Accepted: 06/13/2024] [Indexed: 07/25/2024] Open
Abstract
Introduction: Structural Variants (SVs) are a type of variation that can significantly influence phenotypes and cause diseases. Thus, the accurate detection of SVs is a vital part of modern genetic analysis. The advent of long-read sequencing technology ushers in a new era of more accurate and comprehensive SV calling, and many tools have been developed to call SVs using long-read data. Haplotype-tagging is a procedure that can tag haplotype information on reads and can thus potentially improve the SV detection; nevertheless, few methods make use of this information. In this article, we introduce HapKled, a new SV detection tool that can accurately detect SVs from Oxford Nanopore Technologies (ONT) long-read alignment data. Methods: HapKled utilizes haplotype information underlying alignment data by conducting haplotype-tagging using Whatshap on the reads to improve the detection performance, with three unique calling mechanics including altering clustering conditions according to haplotype information of signatures, determination of similar SVs based on haplotype information, and slack filtering conditions based on haplotype quality. Results: In our evaluations, HapKled outperformed state-of-the-art tools and can deliver better SV detection results on both simulated and real sequencing data. The code and experiments of HapKled can be obtained from https://github.com/CoREse/HapKled. Discussion: With the superb SV detection performance that HapKled can deliver, HapKled could be useful in bioinformatics research, clinical diagnosis, and medical research and development.
Collapse
Affiliation(s)
- Zhendong Zhang
- Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang, China
| | - Yue Liu
- Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang, China
| | - Xin Li
- Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang, China
| | - Yadong Liu
- Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang, China
- Zhengzhou Research Institute, Harbin Institute of Technology, Zhengzhou, Henan, China
| | - Yadong Wang
- Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang, China
- Zhengzhou Research Institute, Harbin Institute of Technology, Zhengzhou, Henan, China
| | - Tao Jiang
- Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang, China
- Zhengzhou Research Institute, Harbin Institute of Technology, Zhengzhou, Henan, China
| |
Collapse
|
2
|
Zhang Z, Jiang T, Li G, Cao S, Liu Y, Liu B, Wang Y. Kled: an ultra-fast and sensitive structural variant detection tool for long-read sequencing data. Brief Bioinform 2024; 25:bbae049. [PMID: 38385878 PMCID: PMC10883419 DOI: 10.1093/bib/bbae049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Revised: 01/12/2024] [Accepted: 01/26/2024] [Indexed: 02/23/2024] Open
Abstract
Structural Variants (SVs) are a crucial type of genetic variant that can significantly impact phenotypes. Therefore, the identification of SVs is an essential part of modern genomic analysis. In this article, we present kled, an ultra-fast and sensitive SV caller for long-read sequencing data given the specially designed approach with a novel signature-merging algorithm, custom refinement strategies and a high-performance program structure. The evaluation results demonstrate that kled can achieve optimal SV calling compared to several state-of-the-art methods on simulated and real long-read data for different platforms and sequencing depths. Furthermore, kled excels at rapid SV calling and can efficiently utilize multiple Central Processing Unit (CPU) cores while maintaining low memory usage. The source code for kled can be obtained from https://github.com/CoREse/kled.
Collapse
Affiliation(s)
- Zhendong Zhang
- Center for Bioinformatics, Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
- Key Laboratory of Biological Bigdata, Ministry of Education, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
| | - Tao Jiang
- Center for Bioinformatics, Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
- Zhengzhou Research Institute, Harbin Institute of Technology, Zhengzhou, Henan, 450000, China
- Key Laboratory of Biological Bigdata, Ministry of Education, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
| | - Gaoyang Li
- Center for Bioinformatics, Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
- Key Laboratory of Biological Bigdata, Ministry of Education, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
| | - Shuqi Cao
- Center for Bioinformatics, Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
- Key Laboratory of Biological Bigdata, Ministry of Education, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
| | - Yadong Liu
- Center for Bioinformatics, Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
- Zhengzhou Research Institute, Harbin Institute of Technology, Zhengzhou, Henan, 450000, China
- Key Laboratory of Biological Bigdata, Ministry of Education, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
| | - Bo Liu
- Center for Bioinformatics, Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
- Zhengzhou Research Institute, Harbin Institute of Technology, Zhengzhou, Henan, 450000, China
- Key Laboratory of Biological Bigdata, Ministry of Education, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
| | - Yadong Wang
- Center for Bioinformatics, Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
- Zhengzhou Research Institute, Harbin Institute of Technology, Zhengzhou, Henan, 450000, China
- Key Laboratory of Biological Bigdata, Ministry of Education, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
| |
Collapse
|
3
|
Wiewiórka M, Szmurło A, Stankiewicz P, Gambin T. Cloud-native distributed genomic pileup operations. Bioinformatics 2022; 39:6900922. [PMID: 36515465 PMCID: PMC9848050 DOI: 10.1093/bioinformatics/btac804] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2022] [Revised: 11/16/2022] [Accepted: 12/13/2022] [Indexed: 12/15/2022] Open
Abstract
MOTIVATION Pileup analysis is a building block of many bioinformatics pipelines, including variant calling and genotyping. This step tends to become a bottleneck of the entire assay since the straightforward pileup implementations involve processing of all base calls from all alignments sequentially. On the other hand, a distributed version of the algorithm faces the intrinsic challenge of splitting reads-oriented file formats into self-contained partitions to avoid costly data exchange between computational nodes. RESULTS Here, we present a scalable, distributed and efficient implementation of a pileup algorithm that is suitable for deploying in cloud computing environments. In particular, we implemented: (i) our custom data-partitioning algorithm optimized to work with the alignment reads, (ii) a novel and unique approach to process alignment events from sequencing reads using the MD tags, (iii) the source code micro-optimizations for recurrent operations, and (iv) a modular structure of the algorithm. We have proven that our novel approach consistently and significantly outperforms other state-of-the-art distributed tools in terms of execution time (up to 6.5× faster) and memory usage (up to 2× less), resulting in a substantial cloud cost reduction. SeQuiLa is a cloud-native solution that can be easily deployed using any managed Kubernetes and Hadoop services available in public clouds, like Microsoft Azure Cloud, Google Cloud Platform, or Amazon Web Services. Together with the already implemented distributed range join and coverage calculations, our package provides end-users with a unified SQL interface for convenient analyses of population-scale genomic data in an interactive way. AVAILABILITY AND IMPLEMENTATION https://biodatageeks.github.io/sequila/.
Collapse
Affiliation(s)
| | | | - Paweł Stankiewicz
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | | |
Collapse
|
4
|
Genome sequence assembly algorithms and misassembly identification methods. Mol Biol Rep 2022; 49:11133-11148. [PMID: 36151399 DOI: 10.1007/s11033-022-07919-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2022] [Accepted: 09/05/2022] [Indexed: 10/14/2022]
Abstract
The sequence assembly algorithms have rapidly evolved with the vigorous growth of genome sequencing technology over the past two decades. Assembly mainly uses the iterative expansion of overlap relationships between sequences to construct the target genome. The assembly algorithms can be typically classified into several categories, such as the Greedy strategy, Overlap-Layout-Consensus (OLC) strategy, and de Bruijn graph (DBG) strategy. In particular, due to the rapid development of third-generation sequencing (TGS) technology, some prevalent assembly algorithms have been proposed to generate high-quality chromosome-level assemblies. However, due to the genome complexity, the length of short reads, and the high error rate of long reads, contigs produced by assembly may contain misassemblies adversely affecting downstream data analysis. Therefore, several read-based and reference-based methods for misassembly identification have been developed to improve assembly quality. This work primarily reviewed the development of DNA sequencing technologies and summarized sequencing data simulation methods, sequencing error correction methods, various mainstream sequence assembly algorithms, and misassembly identification methods. A large amount of computation makes the sequence assembly problem more challenging, and therefore, it is necessary to develop more efficient and accurate assembly algorithms and alternative algorithms.
Collapse
|
5
|
Lei Y, Meng Y, Guo X, Ning K, Bian Y, Li L, Hu Z, Anashkina AA, Jiang Q, Dong Y, Zhu X. Overview of structural variation calling: Simulation, identification, and visualization. Comput Biol Med 2022; 145:105534. [DOI: 10.1016/j.compbiomed.2022.105534] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2022] [Revised: 04/09/2022] [Accepted: 04/14/2022] [Indexed: 12/11/2022]
|