1
|
Wang T, Zhang Y, Wang H, Zheng Q, Yang J, Zhang T, Sun G, Liu W, Yin L, He X, You R, Wang C, Liu Z, Liu Z, Wang J, Jin X, He Z. Fast and accurate DNASeq variant calling workflow composed of LUSH toolkit. Hum Genomics 2024; 18:114. [PMID: 39390620 PMCID: PMC11465951 DOI: 10.1186/s40246-024-00666-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2023] [Accepted: 08/22/2024] [Indexed: 10/12/2024] Open
Abstract
BACKGROUND Whole genome sequencing (WGS) is becoming increasingly prevalent for molecular diagnosis, staging and prognosis because of its declining costs and the ability to detect nearly all genes associated with a patient's disease. The currently widely accepted variant calling pipeline, GATK, is limited in terms of its computational speed and efficiency, which cannot meet the growing analysis needs. RESULTS Here, we propose a fast and accurate DNASeq variant calling workflow that is purely composed of tools from LUSH toolkit. The precision and recall measurements indicate that both the LUSH and GATK pipelines exhibit high levels of consistency, with precision and recall rates exceeding 99% on the 30x NA12878 dataset. In terms of processing speed, the LUSH pipeline outperforms the GATK pipeline, completing 30x WGS data analysis in just 1.6 h, which is approximately 17 times faster than GATK. Notably, the LUSH_HC tool completes the processing from BAM to VCF in just 12 min, which is around 76 times faster than GATK. CONCLUSION These findings suggest that the LUSH pipeline is a highly promising alternative to the GATK pipeline for WGS data analysis, with the potential to significantly improve bedside analysis of acutely ill patients, large-scale cohort data analysis, and high-throughput variant calling in crop breeding programs. Furthermore, the LUSH pipeline is highly scalable and easily deployable, allowing it to be readily applied to various scenarios such as clinical diagnosis and genomic research.
Collapse
Affiliation(s)
- Taifu Wang
- BGI Genomics, Shenzhen, 518083, China
- Clin Lab, BGI Genomics, Shenzhen, 518083, China
| | - Youjin Zhang
- BGI Genomics, Shenzhen, 518083, China
- Clin Lab, BGI Genomics, Shenzhen, 518083, China
| | - Haoling Wang
- BGI Genomics, Shenzhen, 518083, China
- Clin Lab, BGI Genomics, Shenzhen, 518083, China
| | - Qiwen Zheng
- BGI Genomics, Shenzhen, 518083, China
- Clin Lab, BGI Genomics, Shenzhen, 518083, China
| | - Jiaobo Yang
- BGI Genomics, Shenzhen, 518083, China
- Clin Lab, BGI Genomics, Shenzhen, 518083, China
| | - Tiefeng Zhang
- BGI Genomics, Shenzhen, 518083, China
- Clin Lab, BGI Genomics, Shenzhen, 518083, China
| | - Geng Sun
- BGI Genomics, Shenzhen, 518083, China
- Clin Lab, BGI Genomics, Shenzhen, 518083, China
| | - Weicong Liu
- BGI Genomics, Shenzhen, 518083, China
- Clin Lab, BGI Genomics, Shenzhen, 518083, China
| | - Longhui Yin
- BGI Genomics, Shenzhen, 518083, China
- Clin Lab, BGI Genomics, Shenzhen, 518083, China
| | - Xinqiu He
- BGI Genomics, Shenzhen, 518083, China
- Clin Lab, BGI Genomics, Shenzhen, 518083, China
| | - Rui You
- BGI Genomics, Shenzhen, 518083, China
- Clin Lab, BGI Genomics, Shenzhen, 518083, China
| | - Chu Wang
- BGI Genomics, Shenzhen, 518083, China
- Clin Lab, BGI Genomics, Shenzhen, 518083, China
| | - Zhencheng Liu
- BGI Genomics, Shenzhen, 518083, China
- Clin Lab, BGI Genomics, Shenzhen, 518083, China
| | - Zhijian Liu
- BGI Genomics, Shenzhen, 518083, China
- Clin Lab, BGI Genomics, Shenzhen, 518083, China
| | - Jin'an Wang
- BGI Genomics, Shenzhen, 518083, China
- Clin Lab, BGI Genomics, Shenzhen, 518083, China
| | - Xiangqian Jin
- BGI Genomics, Shenzhen, 518083, China.
- Clin Lab, BGI Genomics, Shenzhen, 518083, China.
| | - Zengquan He
- BGI Genomics, Shenzhen, 518083, China.
- Clin Lab, BGI Genomics, Shenzhen, 518083, China.
| |
Collapse
|
2
|
Veerappa AM, Rowley MJ, Maggio A, Beaudry L, Hawkins D, Kim A, Sethi S, Sorgen PL, Guda C. CloudATAC: a cloud-based framework for ATAC-Seq data analysis. Brief Bioinform 2024; 25:bbae090. [PMID: 39041910 PMCID: PMC11264300 DOI: 10.1093/bib/bbae090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Revised: 02/05/2024] [Accepted: 02/18/2024] [Indexed: 07/24/2024] Open
Abstract
Assay for transposase-accessible chromatin with high-throughput sequencing (ATAC-seq) generates genome-wide chromatin accessibility profiles, providing valuable insights into epigenetic gene regulation at both pooled-cell and single-cell population levels. Comprehensive analysis of ATAC-seq data involves the use of various interdependent programs. Learning the correct sequence of steps needed to process the data can represent a major hurdle. Selecting appropriate parameters at each stage, including pre-analysis, core analysis, and advanced downstream analysis, is important to ensure accurate analysis and interpretation of ATAC-seq data. Additionally, obtaining and working within a limited computational environment presents a significant challenge to non-bioinformatic researchers. Therefore, we present Cloud ATAC, an open-source, cloud-based interactive framework with a scalable, flexible, and streamlined analysis framework based on the best practices approach for pooled-cell and single-cell ATAC-seq data. These frameworks use on-demand computational power and memory, scalability, and a secure and compliant environment provided by the Google Cloud. Additionally, we leverage Jupyter Notebook's interactive computing platform that combines live code, tutorials, narrative text, flashcards, quizzes, and custom visualizations to enhance learning and analysis. Further, leveraging GPU instances has significantly improved the run-time of the single-cell framework. The source codes and data are publicly available through NIH Cloud lab https://github.com/NIGMS/ATAC-Seq-and-Single-Cell-ATAC-Seq-Analysis. This manuscript describes the development of a resource module that is part of a learning platform named ``NIGMS Sandbox for Cloud-based Learning'' https://github.com/NIGMS/NIGMS-Sandbox. The overall genesis of the Sandbox is described in the editorial NIGMS Sandbox [1] at the beginning of this Supplement. This module delivers learning materials on the analysis of bulk and single-cell ATAC-seq data in an interactive format that uses appropriate cloud resources for data access and analyses.
Collapse
Affiliation(s)
| | | | - Angela Maggio
- Deloitte Consulting LLP, Health Data and AI Arlington, VA, USA
| | - Laura Beaudry
- Google Google Public Sector, Professional Services Reston, VA, USA
| | - Dale Hawkins
- Google Google Public Sector, Professional Services Reston, VA, USA
| | - Allen Kim
- Google Google Public Sector, Professional Services Reston, VA, USA
| | - Sahil Sethi
- University of Nebraska Medical Center, Omaha, NE 68105 USA
| | - Paul L Sorgen
- University of Nebraska Medical Center, Omaha, NE 68105 USA
| | | |
Collapse
|
3
|
Schmidt B, Hildebrandt A. From GPUs to AI and quantum: three waves of acceleration in bioinformatics. Drug Discov Today 2024; 29:103990. [PMID: 38663581 DOI: 10.1016/j.drudis.2024.103990] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2023] [Revised: 04/05/2024] [Accepted: 04/17/2024] [Indexed: 05/01/2024]
Abstract
The enormous growth in the amount of data generated by the life sciences is continuously shifting the field from model-driven science towards data-driven science. The need for efficient processing has led to the adoption of massively parallel accelerators such as graphics processing units (GPUs). Consequently, the development of bioinformatics methods nowadays often heavily depends on the effective use of these powerful technologies. Furthermore, progress in computational techniques and architectures continues to be highly dynamic, involving novel deep neural network models and artificial intelligence (AI) accelerators, and potentially quantum processing units in the future. These are expected to be disruptive for the life sciences as a whole and for drug discovery in particular. Here, we identify three waves of acceleration and their applications in a bioinformatics context: (i) GPU computing, (ii) AI and (iii) next-generation quantum computers.
Collapse
Affiliation(s)
- Bertil Schmidt
- Institut für Informatik, Johannes Gutenberg University, Mainz, Germany.
| | | |
Collapse
|
5
|
Cabello-Aguilar S, Vendrell JA, Solassol J. A Bioinformatics Toolkit for Next-Generation Sequencing in Clinical Oncology. Curr Issues Mol Biol 2023; 45:9737-9752. [PMID: 38132454 PMCID: PMC10741970 DOI: 10.3390/cimb45120608] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2023] [Revised: 11/28/2023] [Accepted: 12/02/2023] [Indexed: 12/23/2023] Open
Abstract
Next-generation sequencing (NGS) has taken on major importance in clinical oncology practice. With the advent of targeted therapies capable of effectively targeting specific genomic alterations in cancer patients, the development of bioinformatics processes has become crucial. Thus, bioinformatics pipelines play an essential role not only in the detection and in identification of molecular alterations obtained from NGS data but also in the analysis and interpretation of variants, making it possible to transform raw sequencing data into meaningful and clinically useful information. In this review, we aim to examine the multiple steps of a bioinformatics pipeline as used in current clinical practice, and we also provide an updated list of the necessary bioinformatics tools. This resource is intended to assist researchers and clinicians in their genetic data analyses, improving the precision and efficiency of these processes in clinical research and patient care.
Collapse
Affiliation(s)
- Simon Cabello-Aguilar
- Montpellier BioInformatics for Clinical Diagnosis (MOBIDIC), Molecular Medicine and Genomics Platform (PMMG), CHU Montpellier, 34295 Montpellier, France
- Laboratoire de Biologie des Tumeurs Solides, Département de Pathologie et Oncobiologie, CHU Montpellier, Université de Montpellier, 34295 Montpellier, France; (J.A.V.); (J.S.)
| | - Julie A. Vendrell
- Laboratoire de Biologie des Tumeurs Solides, Département de Pathologie et Oncobiologie, CHU Montpellier, Université de Montpellier, 34295 Montpellier, France; (J.A.V.); (J.S.)
| | - Jérôme Solassol
- Laboratoire de Biologie des Tumeurs Solides, Département de Pathologie et Oncobiologie, CHU Montpellier, Université de Montpellier, 34295 Montpellier, France; (J.A.V.); (J.S.)
| |
Collapse
|