1
|
Al-Aamri A, Kamarul Azman S, Daw Elbait G, Alsafar H, Henschel A. Critical assessment of on-premise approaches to scalable genome analysis. BMC Bioinformatics 2023; 24:354. [PMID: 37735350 PMCID: PMC10512525 DOI: 10.1186/s12859-023-05470-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2023] [Accepted: 09/08/2023] [Indexed: 09/23/2023] Open
Abstract
BACKGROUND Plummeting DNA sequencing cost in recent years has enabled genome sequencing projects to scale up by several orders of magnitude, which is transforming genomics into a highly data-intensive field of research. This development provides the much needed statistical power required for genotype-phenotype predictions in complex diseases. METHODS In order to efficiently leverage the wealth of information, we here assessed several genomic data science tools. The rationale to focus on on-premise installations is to cope with situations where data confidentiality and compliance regulations etc. rule out cloud based solutions. We established a comprehensive qualitative and quantitative comparison between BCFtools, SnpSift, Hail, GEMINI, and OpenCGA. The tools were compared in terms of data storage technology, query speed, scalability, annotation, data manipulation, visualization, data output representation, and availability. RESULTS Tools that leverage sophisticated data structures are noted as the most suitable for large-scale projects in varying degrees of scalability in comparison to flat-file manipulation (e.g., BCFtools, and SnpSift). Remarkably, for small to mid-size projects, even lightweight relational database. CONCLUSION The assessment criteria provide insights into the typical questions posed in scalable genomics and serve as guidance for the development of scalable computational infrastructure in genomics.
Collapse
Affiliation(s)
- Amira Al-Aamri
- Department of Electrical Engineering and Computer Science, College of Engineering, Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates
| | - Syafiq Kamarul Azman
- Department of Electrical Engineering and Computer Science, College of Engineering, Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates
| | - Gihan Daw Elbait
- Department of Biology, College of Arts and Sciences, Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates
- Center for Biotechnology (BTC), Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates
| | - Habiba Alsafar
- Center for Biotechnology (BTC), Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates
- Department of Biomedical Engineering, Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates
| | - Andreas Henschel
- Department of Electrical Engineering and Computer Science, College of Engineering, Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates.
- Center for Biotechnology (BTC), Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates.
| |
Collapse
|
2
|
Bouzinier MA, Etin D, Trifonov SI, Evdokimova VN, Ulitin V, Shen J, Kokorev A, Ghazani AA, Chekaluk Y, Albertyn Z, Giersch A, Morton CC, Abraamyan F, Bendapudi PK, Sunyaev S, Undiagnosed Diseases Network, Brigham Genomic Medicine, SEQuencing A Baby For An Optimal Outcome, Quantori, Krier JB. AnFiSA: An open-source computational platform for the analysis of sequencing data for rare genetic disease. J Biomed Inform 2022; 133:104174. [PMID: 35998814 DOI: 10.1016/j.jbi.2022.104174] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2021] [Revised: 07/23/2022] [Accepted: 08/15/2022] [Indexed: 11/28/2022]
Abstract
Despite genomic sequencing rapidly transforming from being a bench-side tool to a routine procedure in a hospital, there is a noticeable lack of genomic analysis software that supports both clinical and research workflows as well as crowdsourcing. Furthermore, most existing software packages are not forward-compatible in regards to supporting ever-changing diagnostic rules adopted by the genetics community. Regular updates of genomics databases pose challenges for reproducible and traceable automated genetic diagnostics tools. Lastly, most of the software tools score low on explainability amongst clinicians. We have created a fully open-source variant curation tool, AnFiSA, with the intention to invite and accept contributions from clinicians, researchers, and professional software developers. The design of AnFiSA addresses the aforementioned issues via the following architectural principles: using a multidimensional database management system (DBMS) for genomic data to address reproducibility, curated decision trees adaptable to changing clinical rules, and a crowdsourcing-friendly interface to address difficult-to-diagnose cases. We discuss how we have chosen our technology stack and describe the design and implementation of the software. Finally, we show in detail how selected workflows can be implemented using the current version of AnFiSA by a medical geneticist.
Collapse
Affiliation(s)
- M A Bouzinier
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.
| | - D Etin
- Forome Association, Boston, MA, USA; Oracle Corporation, USA.
| | | | - V N Evdokimova
- Forome Association, Boston, MA, USA; SBCS Scientific Biomedical Consulting Services, London, UK
| | - V Ulitin
- Forome Association, Boston, MA, USA
| | - J Shen
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - A Kokorev
- ITMO University, St. Petersburg, Russian Federation
| | - A A Ghazani
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA; Harvard Medical School, Boston, MA, USA; Brigham Genomic Medicine, Brigham and Women's Hospital, Boston, MA, USA
| | - Y Chekaluk
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Z Albertyn
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - A Giersch
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - C C Morton
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA; Department of Obstetrics and Gynecology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA; Broad Institute of MIT and Harvard, Cambridge, MA, USA; Manchester Centre for Audiology and Deafness (ManCAD), School of Health Sciences, University of Manchester, UK
| | - F Abraamyan
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - P K Bendapudi
- Division of Hemostasis and Thrombosis, Beth Israel Deaconess Medical Center, Boston, MA, USA; Division of Hematology and Blood Transfusion Service, Massachusetts General Hospital, Boston, MA, USA
| | - S Sunyaev
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA; Harvard Medical School, Boston, MA, USA
| | | | | | | | | | - J B Krier
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
3
|
Khazeeva G, Sablauskas K, van der Sanden B, Steyaert W, Kwint M, Rots D, Hinne M, van Gerven M, Yntema H, Vissers L, Gilissen C. DeNovoCNN: a deep learning approach to de novo variant calling in next generation sequencing data. Nucleic Acids Res 2022; 50:e97. [PMID: 35713566 PMCID: PMC9508836 DOI: 10.1093/nar/gkac511] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2021] [Revised: 04/19/2022] [Accepted: 05/30/2022] [Indexed: 11/14/2022] Open
Abstract
De novo mutations (DNMs) are an important cause of genetic disorders. The accurate identification of DNMs from sequencing data is therefore fundamental to rare disease research and diagnostics. Unfortunately, identifying reliable DNMs remains a major challenge due to sequence errors, uneven coverage, and mapping artifacts. Here, we developed a deep convolutional neural network (CNN) DNM caller (DeNovoCNN), that encodes the alignment of sequence reads for a trio as 160\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
}{}$ \times$\end{document}164 resolution images. DeNovoCNN was trained on DNMs of 5616 whole exome sequencing (WES) trios achieving total 96.74% recall and 96.55% precision on the test dataset. We find that DeNovoCNN has increased recall/sensitivity and precision compared to existing DNM calling approaches (GATK, DeNovoGear, DeepTrio, Samtools) based on the Genome in a Bottle reference dataset and independent WES and WGS trios. Validations of DNMs based on Sanger and PacBio HiFi sequencing confirm that DeNovoCNN outperforms existing methods. Most importantly, our results suggest that DeNovoCNN is likely robust against different exome sequencing and analyses approaches, thereby allowing the application on other datasets. DeNovoCNN is freely available as a Docker container and can be run on existing alignment (BAM/CRAM) and variant calling (VCF) files from WES and WGS without a need for variant recalling.
Collapse
Affiliation(s)
- Gelana Khazeeva
- Department of Human Genetics, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Geert Grooteplein 10, 6525 GA Nijmegen, The Netherlands
| | - Karolis Sablauskas
- Institute of Clinical Medicine, Faculty of Medicine, Vilnius University, Vilnius, Lithuania
| | - Bart van der Sanden
- Department of Human Genetics, Donders Centre for Neuroscience, Radboud University Medical Center, Geert Grooteplein 10, 6525 GA Nijmegen, The Netherlands
| | - Wouter Steyaert
- Department of Human Genetics, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Geert Grooteplein 10, 6525 GA Nijmegen, The Netherlands
| | - Michael Kwint
- Department of Human Genetics, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Geert Grooteplein 10, 6525 GA Nijmegen, The Netherlands
| | - Dmitrijs Rots
- Department of Human Genetics, Donders Centre for Neuroscience, Radboud University Medical Center, Geert Grooteplein 10, 6525 GA Nijmegen, The Netherlands
| | - Max Hinne
- Donders Institute for Brain, Cognition and Behaviour, Radboud University Nijmegen, The Netherlands
| | - Marcel van Gerven
- Donders Institute for Brain, Cognition and Behaviour, Radboud University Nijmegen, The Netherlands
| | - Helger Yntema
- Department of Human Genetics, Donders Centre for Neuroscience, Radboud University Medical Center, Geert Grooteplein 10, 6525 GA Nijmegen, The Netherlands
| | - Lisenka Vissers
- Department of Human Genetics, Donders Centre for Neuroscience, Radboud University Medical Center, Geert Grooteplein 10, 6525 GA Nijmegen, The Netherlands
| | - Christian Gilissen
- Department of Human Genetics, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Geert Grooteplein 10, 6525 GA Nijmegen, The Netherlands
| |
Collapse
|
4
|
Abstract
While next-generation sequencing (NGS) has transformed genetic testing, it generates large quantities of noisy data that require a significant amount of bioinformatics to generate useful interpretation. The accuracy of variant calling is therefore critical. Although GATK HaplotypeCaller is a widely used tool for this purpose, newer methods such as DeepVariant have shown higher accuracy in assessments of gold-standard samples for whole-genome sequencing (WGS) and whole-exome sequencing (WES), but a side-by-side comparison on clinical samples has not been performed. Trio WES was used to compare GATK (4.1.2.0) HaplotypeCaller and DeepVariant (v0.8.0). The performance of the two pipelines was evaluated according to the Mendelian error rate, transition-to-transversion (Ti/Tv) ratio, concordance rate, and pathological variant detection rate. Data from 80 trios were analyzed. The Mendelian error rate of the 77 biological trios calculated from the data by DeepVariant (3.09 ± 0.83%) was lower than that calculated from the data by GATK (5.25 ± 0.91%) (p < 0.001). DeepVariant also yielded a higher Ti/Tv ratio (2.38 ± 0.02) than GATK (2.04 ± 0.07) (p < 0.001), suggesting that DeepVariant proportionally called more true positives. The concordance rate between the 2 pipelines was 88.73%. Sixty-three disease-causing variants were detected in the 80 trios. Among them, DeepVariant detected 62 variants, and GATK detected 61 variants. The one variant called by DeepVariant but not GATK HaplotypeCaller might have been missed by GATK HaplotypeCaller due to low coverage. OTC exon 2 (139 bp) deletion was not detected by either method. Mendelian error rate calculation is an effective way to evaluate variant callers. By this method, DeepVariant outperformed GATK, while the two pipelines performed equally in other parameters.
Collapse
|
5
|
Ahmed Z, Renart EG, Mishra D, Zeeshan S. JWES: a new pipeline for whole genome/exome sequence data processing, management, and gene-variant discovery, annotation, prediction, and genotyping. FEBS Open Bio 2021; 11:2441-2452. [PMID: 34370400 PMCID: PMC8409305 DOI: 10.1002/2211-5463.13261] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2021] [Revised: 07/18/2021] [Accepted: 08/02/2021] [Indexed: 01/07/2023] Open
Abstract
Whole genome and exome sequencing (WGS/WES) are the most popular next‐generation sequencing (NGS) methodologies and are at present often used to detect rare and common genetic variants of clinical significance. We emphasize that automated sequence data processing, management, and visualization should be an indispensable component of modern WGS and WES data analysis for sequence assembly, variant detection (SNPs, SVs), imputation, and resolution of haplotypes. In this manuscript, we present a newly developed findable, accessible, interoperable, and reusable (FAIR) bioinformatics‐genomics pipeline Java based Whole Genome/Exome Sequence Data Processing Pipeline (JWES) for efficient variant discovery and interpretation, and big data modeling and visualization. JWES is a cross‐platform, user‐friendly, product line application, that entails three modules: (a) data processing, (b) storage, and (c) visualization. The data processing module performs a series of different tasks for variant calling, the data storage module efficiently manages high‐volume gene‐variant data, and the data visualization module supports variant data interpretation with Circos graphs. The performance of JWES was tested and validated in‐house with different experiments, using Microsoft Windows, macOS Big Sur, and UNIX operating systems. JWES is an open‐source and freely available pipeline, allowing scientists to take full advantage of all the computing resources available, without requiring much computer science knowledge. We have successfully applied JWES for processing, management, and gene‐variant discovery, annotation, prediction, and genotyping of WGS and WES data to analyze variable complex disorders. In summary, we report the performance of JWES with some reproducible case studies, using open access and in‐house generated, high‐quality datasets.
Collapse
Affiliation(s)
- Zeeshan Ahmed
- Institute for Health, Health Care Policy and Aging Research, Rutgers, The State University of New Jersey, New Brunswick, NJ, USA.,Department of Medicine, Rutgers Robert Wood Johnson Medical School, Rutgers Biomedical and Health Sciences, New Brunswick, NJ, USA
| | - Eduard Gibert Renart
- Institute for Health, Health Care Policy and Aging Research, Rutgers, The State University of New Jersey, New Brunswick, NJ, USA
| | - Deepshikha Mishra
- Institute for Health, Health Care Policy and Aging Research, Rutgers, The State University of New Jersey, New Brunswick, NJ, USA
| | - Saman Zeeshan
- Rutgers Cancer Institute of New Jersey, Rutgers, The State University of New Jersey, New Brunswick, NJ, USA
| |
Collapse
|
6
|
Ahmed Z, Renart EG, Zeeshan S. Genomics pipelines to investigate susceptibility in whole genome and exome sequenced data for variant discovery, annotation, prediction and genotyping. PeerJ 2021; 9:e11724. [PMID: 34395068 PMCID: PMC8320519 DOI: 10.7717/peerj.11724] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Accepted: 06/14/2021] [Indexed: 12/12/2022] Open
Abstract
Over the last few decades, genomics is leading toward audacious future, and has been changing our views about conducting biomedical research, studying diseases, and understanding diversity in our society across the human species. The whole genome and exome sequencing (WGS/WES) are two of the most popular next-generation sequencing (NGS) methodologies that are currently being used to detect genetic variations of clinical significance. Investigating WGS/WES data for the variant discovery and genotyping is based on the nexus of different data analytic applications. Although several bioinformatics applications have been developed, and many of those are freely available and published. Timely finding and interpreting genetic variants are still challenging tasks among diagnostic laboratories and clinicians. In this study, we are interested in understanding, evaluating, and reporting the current state of solutions available to process the NGS data of variable lengths and types for the identification of variants, alleles, and haplotypes. Residing within the scope, we consulted high quality peer reviewed literature published in last 10 years. We were focused on the standalone and networked bioinformatics applications proposed to efficiently process WGS and WES data, and support downstream analysis for gene-variant discovery, annotation, prediction, and interpretation. We have discussed our findings in this manuscript, which include but not are limited to the set of operations, workflow, data handling, involved tools, technologies and algorithms and limitations of the assessed applications.
Collapse
Affiliation(s)
- Zeeshan Ahmed
- Institute for Health, Health Care Policy and Aging Research, Rutgers, The State University of New Jersey, New Brunswick, NJ, USA.,Department of Medicine, Robert Wood Johnson Medical School, Rutgers, The State University of New Jersey, New Brunswick, NJ, USA
| | - Eduard Gibert Renart
- Institute for Health, Health Care Policy and Aging Research, Rutgers, The State University of New Jersey, New Brunswick, NJ, USA
| | - Saman Zeeshan
- Cancer Institute of New Jersey, Rutgers, The State University of New Jersey, New Brunswick, NJ, USA
| |
Collapse
|
7
|
Guo M, Li S, Zhou Y, Li M, Wen Z. Comparative Analysis for the Performance of Long-Read-Based Structural Variation Detection Pipelines in Tandem Repeat Regions. Front Pharmacol 2021; 12:658072. [PMID: 34163355 PMCID: PMC8215501 DOI: 10.3389/fphar.2021.658072] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2021] [Accepted: 05/14/2021] [Indexed: 12/04/2022] Open
Abstract
There has been growing recognition of the vital links between structural variations (SVs) and diverse diseases. Research suggests that, with much longer DNA fragments and abundant contextual information, long-read technologies have advantages in SV detection even in complex repetitive regions. So far, several pipelines for calling SVs from long-read sequencing data have been proposed and used in human genome research. However, the performance of these pipelines is still lack of deep exploration and adequate comparison. In this study, we comprehensively evaluated the performance of three commonly used long-read SV detection pipelines, namely PBSV, Sniffles and PBHoney, especially the performance on detecting the SVs in tandem repeat regions (TRRs). Evaluated by using a robust benchmark for germline SV detection as the gold standard, we thoroughly estimated the precision, recall and F1 score of insertions and deletions detected by the pipelines. Our results revealed that all these pipelines clearly exhibited better performance outside TRRs than that in TRRs. The F1 scores of Sniffles in and outside TRRs were 0.60 and 0.76, respectively. The performance of PBSV was similar to that of Sniffles, and was generally higher than that of PBHoney. In conclusion, our findings can be benefit for choosing the appropriate pipelines in real practice and are good complementary to the application of long-read sequencing technologies in the research of rare diseases.
Collapse
Affiliation(s)
- Mingkun Guo
- College of Chemistry, Sichuan University, Chengdu, China
| | - Shihai Li
- College of Chemistry, Sichuan University, Chengdu, China
| | - Yifan Zhou
- College of Chemistry, Sichuan University, Chengdu, China
| | - Menglong Li
- College of Chemistry, Sichuan University, Chengdu, China
| | - Zhining Wen
- College of Chemistry, Sichuan University, Chengdu, China.,Medical Big Data Center, Sichuan University, Chengdu, China
| |
Collapse
|
8
|
Kanzi AM, San JE, Chimukangara B, Wilkinson E, Fish M, Ramsuran V, de Oliveira T. Next Generation Sequencing and Bioinformatics Analysis of Family Genetic Inheritance. Front Genet 2020; 11:544162. [PMID: 33193618 PMCID: PMC7649788 DOI: 10.3389/fgene.2020.544162] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2020] [Accepted: 09/21/2020] [Indexed: 12/29/2022] Open
Abstract
Mendelian and complex genetic trait diseases continue to burden and affect society both socially and economically. The lack of effective tests has hampered diagnosis thus, the affected lack proper prognosis. Mendelian diseases are caused by genetic mutations in a singular gene while complex trait diseases are caused by the accumulation of mutations in either linked or unlinked genomic regions. Significant advances have been made in identifying novel diseases associated mutations especially with the introduction of next generation and third generation sequencing. Regardless, some diseases are still without diagnosis as most tests rely on SNP genotyping panels developed from population based genetic analyses. Analysis of family genetic inheritance using whole genomes, whole exomes or a panel of genes has been shown to be effective in identifying disease-causing mutations. In this review, we discuss next generation and third generation sequencing platforms, bioinformatic tools and genetic resources commonly used to analyze family based genomic data with a focus on identifying inherited or novel disease-causing mutations. Additionally, we also highlight the analytical, ethical and regulatory challenges associated with analyzing personal genomes which constitute the data used for family genetic inheritance.
Collapse
Affiliation(s)
- Aquillah M. Kanzi
- Kwazulu-Natal Research and Innovation Sequencing Platform (KRISP), School of Laboratory Medicine and Medical Sciences, College of Health Sciences, University of KwaZulu-Natal, Durban, South Africa
| | | | | | | | | | | | | |
Collapse
|