1
|
Olson ND, Wagner J, Dwarshuis N, Miga KH, Sedlazeck FJ, Salit M, Zook JM. Variant calling and benchmarking in an era of complete human genome sequences. Nat Rev Genet 2023:10.1038/s41576-023-00590-0. [PMID: 37059810 DOI: 10.1038/s41576-023-00590-0] [Citation(s) in RCA: 27] [Impact Index Per Article: 27.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/22/2023] [Indexed: 04/16/2023]
Abstract
Genetic variant calling from DNA sequencing has enabled understanding of germline variation in hundreds of thousands of humans. Sequencing technologies and variant-calling methods have advanced rapidly, routinely providing reliable variant calls in most of the human genome. We describe how advances in long reads, deep learning, de novo assembly and pangenomes have expanded access to variant calls in increasingly challenging, repetitive genomic regions, including medically relevant regions, and how new benchmark sets and benchmarking methods illuminate their strengths and limitations. Finally, we explore the possible future of more complete characterization of human genome variation in light of the recent completion of a telomere-to-telomere human genome reference assembly and human pangenomes, and we consider the innovations needed to benchmark their newly accessible repetitive regions and complex variants.
Collapse
Affiliation(s)
- Nathan D Olson
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Justin Wagner
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Nathan Dwarshuis
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Karen H Miga
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Fritz J Sedlazeck
- Baylor College of Medicine, Human Genome Sequencing Center, Houston, TX, USA
| | | | - Justin M Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA.
| |
Collapse
|
2
|
Koboldt DC. Best practices for variant calling in clinical sequencing. Genome Med 2020; 12:91. [PMID: 33106175 PMCID: PMC7586657 DOI: 10.1186/s13073-020-00791-w] [Citation(s) in RCA: 147] [Impact Index Per Article: 36.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2019] [Accepted: 10/08/2020] [Indexed: 02/08/2023] Open
Abstract
Next-generation sequencing technologies have enabled a dramatic expansion of clinical genetic testing both for inherited conditions and diseases such as cancer. Accurate variant calling in NGS data is a critical step upon which virtually all downstream analysis and interpretation processes rely. Just as NGS technologies have evolved considerably over the past 10 years, so too have the software tools and approaches for detecting sequence variants in clinical samples. In this review, I discuss the current best practices for variant calling in clinical sequencing studies, with a particular emphasis on trio sequencing for inherited disorders and somatic mutation detection in cancer patients. I describe the relative strengths and weaknesses of panel, exome, and whole-genome sequencing for variant detection. Recommended tools and strategies for calling variants of different classes are also provided, along with guidance on variant review, validation, and benchmarking to ensure optimal performance. Although NGS technologies are continually evolving, and new capabilities (such as long-read single-molecule sequencing) are emerging, the “best practice” principles in this review should be relevant to clinical variant calling in the long term.
Collapse
Affiliation(s)
- Daniel C Koboldt
- Steve and Cindy Rasmussen Institute for Genomic Medicine at Nationwide Children's Hospital, Columbus, OH, USA. .,Department of Pediatrics, The Ohio State University, Columbus, OH, USA.
| |
Collapse
|
3
|
Zhuang X, Ye R, So MT, Lam WY, Karim A, Yu M, Ngo ND, Cherny SS, Tam PKH, Garcia-Barcelo MM, Tang CSM, Sham PC. A random forest-based framework for genotyping and accuracy assessment of copy number variations. NAR Genom Bioinform 2020; 2:lqaa071. [PMID: 33575619 PMCID: PMC7671382 DOI: 10.1093/nargab/lqaa071] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2020] [Revised: 08/18/2020] [Accepted: 08/26/2020] [Indexed: 12/24/2022] Open
Abstract
Detection of copy number variations (CNVs) is essential for uncovering genetic factors underlying human diseases. However, CNV detection by current methods is prone to error, and precisely identifying CNVs from paired-end whole genome sequencing (WGS) data is still challenging. Here, we present a framework, CNV-JACG, for Judging the Accuracy of CNVs and Genotyping using paired-end WGS data. CNV-JACG is based on a random forest model trained on 21 distinctive features characterizing the CNV region and its breakpoints. Using the data from the 1000 Genomes Project, Genome in a Bottle Consortium, the Human Genome Structural Variation Consortium and in-house technical replicates, we show that CNV-JACG has superior sensitivity over the latest genotyping method, SV2, particularly for the small CNVs (≤1 kb). We also demonstrate that CNV-JACG outperforms SV2 in terms of Mendelian inconsistency in trios and concordance between technical replicates. Our study suggests that CNV-JACG would be a useful tool in assessing the accuracy of CNVs to meet the ever-growing needs for uncovering the missing heritability linked to CNVs.
Collapse
Affiliation(s)
- Xuehan Zhuang
- Department of Surgery, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong, China
| | - Rui Ye
- Department of Psychiatry, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong, China
| | - Man-Ting So
- Department of Surgery, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong, China
| | - Wai-Yee Lam
- Department of Surgery, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong, China
| | - Anwarul Karim
- Department of Surgery, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong, China
| | - Michelle Yu
- Department of Surgery, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong, China
| | - Ngoc Diem Ngo
- National Hospital of Pediatrics, Ha Noi 100000, Vietnam
| | - Stacey S Cherny
- Department of Psychiatry, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong, China
| | - Paul Kwong-Hang Tam
- Department of Surgery, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong, China
| | | | - Clara Sze-Man Tang
- Department of Surgery, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong, China
| | - Pak Chung Sham
- Department of Psychiatry, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong, China
| |
Collapse
|
4
|
A robust benchmark for detection of germline large deletions and insertions. Nat Biotechnol 2020; 38:1347-1355. [PMID: 32541955 PMCID: PMC8454654 DOI: 10.1038/s41587-020-0538-8] [Citation(s) in RCA: 183] [Impact Index Per Article: 45.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2019] [Accepted: 04/28/2020] [Indexed: 12/19/2022]
Abstract
New technologies and analysis methods are enabling genomic structural variants (SVs) to be detected with ever-increasing accuracy, resolution, and comprehensiveness. To help translate these methods to routine research and clinical practice, we developed the first sequence-resolved benchmark set for identification of both false negative and false positive germline large insertions and deletions. To create this benchmark for a broadly consented son in a Personal Genome Project trio with broadly available cells and DNA, the Genome in a Bottle (GIAB) Consortium integrated 19 sequence-resolved variant calling methods from diverse technologies. The final benchmark set contains 12745 isolated, sequence-resolved insertion (7281) and deletion (5464) calls ≥50 base pairs (bp). The Tier 1 benchmark regions, for which any extra calls are putative false positives, cover 2.51 Gbp and 5262 insertions and 4095 deletions supported by ≥1 diploid assembly. We demonstrate the benchmark set reliably identifies false negatives and false positives in high-quality SV callsets from short-, linked-, and long-read sequencing and optical mapping.
Collapse
|
5
|
Abstract
Next generation DNA sequencing (NGS) has the potential to improve the diagnostic and prognostic utility of newborn screening programmes. This study assesses the feasibility of automating NGS on dried blood spot (DBS) DNA in a United Kingdom National Health Service (UK NHS) laboratory. An NGS panel targeting the entire coding sequence of five genes relevant to disorders currently screened for in newborns in the UK was validated on DBS DNA. An automated process for DNA extraction, NGS and bioinformatics analysis was developed. The process was tested on DBS to determine feasibility, turnaround time and cost. The analytical sensitivity of the assay was 100% and analytical specificity was 99.96%, with a mean 99.5% concordance of variant calls between DBS and venous blood samples in regions with ≥30× coverage (96.8% across all regions; all variant calls were single nucleotide variants (SNVs), with indel performance not assessed). The pipeline enabled processing of up to 1000 samples a week with a turnaround time of four days from receipt of sample to reporting. This study concluded that it is feasible to automate targeted NGS on routine DBS samples in a UK NHS laboratory setting, but it may not currently be cost effective as a first line test.
Collapse
|
6
|
Langley SA, Miga KH, Karpen GH, Langley CH. Haplotypes spanning centromeric regions reveal persistence of large blocks of archaic DNA. eLife 2019; 8:e42989. [PMID: 31237235 PMCID: PMC6592686 DOI: 10.7554/elife.42989] [Citation(s) in RCA: 35] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2018] [Accepted: 05/20/2019] [Indexed: 12/22/2022] Open
Abstract
Despite critical roles in chromosome segregation and disease, the repetitive structure and vast size of centromeres and their surrounding heterochromatic regions impede studies of genomic variation. Here we report the identification of large-scale haplotypes (cenhaps) in humans that span the centromere-proximal regions of all metacentric chromosomes, including the arrays of highly repeated α-satellites on which centromeres form. Cenhaps reveal deep diversity, including entire introgressed Neanderthal centromeres and equally ancient lineages among Africans. These centromere-spanning haplotypes contain variants, including large differences in α-satellite DNA content, which may influence the fidelity and bias of chromosome transmission. The discovery of cenhaps creates new opportunities to investigate their contribution to phenotypic variation, especially in meiosis and mitosis, as well as to more incisively model the unexpectedly rich evolution of these challenging genomic regions.
Collapse
Affiliation(s)
- Sasha A Langley
- Department of Molecular and Cell BiologyUniversity of California, BerkeleyBerkeleyUnited States
- Biological Systems and Engineering DivisionLawrence Berkeley National LaboratoryBerkeleyUnited States
| | - Karen H Miga
- UC Santa Cruz Genomics InstituteUniversity of California, Santa CruzSanta CruzUnited States
| | - Gary H Karpen
- Department of Molecular and Cell BiologyUniversity of California, BerkeleyBerkeleyUnited States
- Biological Systems and Engineering DivisionLawrence Berkeley National LaboratoryBerkeleyUnited States
| | - Charles H Langley
- Department of Evolution and EcologyUniversity of California, DavisDavisUnited States
| |
Collapse
|
7
|
Zook JM, McDaniel J, Olson ND, Wagner J, Parikh H, Heaton H, Irvine SA, Trigg L, Truty R, McLean CY, De La Vega FM, Xiao C, Sherry S, Salit M. An open resource for accurately benchmarking small variant and reference calls. Nat Biotechnol 2019; 37:561-566. [PMID: 30936564 PMCID: PMC6500473 DOI: 10.1038/s41587-019-0074-6] [Citation(s) in RCA: 188] [Impact Index Per Article: 37.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2018] [Accepted: 02/19/2019] [Indexed: 12/30/2022]
Abstract
Benchmark small variant calls are required for developing, optimizing and assessing the performance of sequencing and bioinformatics methods. Here, as part of the Genome in a Bottle (GIAB) Consortium, we apply a reproducible, cloud-based pipeline to integrate multiple short- and linked-read sequencing datasets and provide benchmark calls for human genomes. We generate benchmark calls for one previously analyzed GIAB sample, as well as six genomes from the Personal Genome Project. These new genomes have broad, open consent, making this a 'first of its kind' resource that is available to the community for multiple downstream applications. We produce 17% more benchmark single nucleotide variations, 176% more indels and 12% larger benchmark regions than previously published GIAB benchmarks. We demonstrate that this benchmark reliably identifies errors in existing callsets and highlight challenges in interpreting performance metrics when using benchmarks that are not perfect or comprehensive. Finally, we identify strengths and weaknesses of callsets by stratifying performance according to variant type and genome context.
Collapse
Affiliation(s)
- Justin M Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA.
| | - Jennifer McDaniel
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Nathan D Olson
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Justin Wagner
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Hemang Parikh
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Haynes Heaton
- 10x Genomics, Pleasanton, CA, USA
- Wellcome Trust Sanger Institute,, Hinxton, Cambridge, UK
| | | | - Len Trigg
- Real Time Genomics, Hamilton, New Zealand
| | | | - Cory Y McLean
- Verily Life Sciences, South San Francisco, CA, USA
- Google Inc., Mountain View, CA, USA
| | - Francisco M De La Vega
- Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA
| | - Chunlin Xiao
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Stephen Sherry
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Marc Salit
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
- Joint Initiative for Metrology in Biology, Stanford, CA, USA
- Department of Bioengineering, Stanford University, Stanford, CA, USA
| |
Collapse
|
8
|
Zhou B, Arthur JG, Ho SS, Pattni R, Huang Y, Wong WH, Urban AE. Extensive and deep sequencing of the Venter/HuRef genome for developing and benchmarking genome analysis tools. Sci Data 2018; 5:180261. [PMID: 30561434 PMCID: PMC6298255 DOI: 10.1038/sdata.2018.261] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2018] [Accepted: 10/04/2018] [Indexed: 12/30/2022] Open
Abstract
We produced an extensive collection of deep re-sequencing datasets for the Venter/HuRef genome using the Illumina massively-parallel DNA sequencing platform. The original Venter genome sequence is a very-high quality phased assembly based on Sanger sequencing. Therefore, researchers developing novel computational tools for the analysis of human genome sequence variation for the dominant Illumina sequencing technology can test and hone their algorithms by making variant calls from these Venter/HuRef datasets and then immediately confirm the detected variants in the Sanger assembly, freeing them of the need for further experimental validation. This process also applies to implementing and benchmarking existing genome analysis pipelines. We prepared and sequenced 200 bp and 350 bp short-insert whole-genome sequencing libraries (sequenced to 100x and 40x genomic coverages respectively) as well as 2 kb, 5 kb, and 12 kb mate-pair libraries (49x, 122x, and 145x physical coverages respectively). Lastly, we produced a linked-read library (128x physical coverage) from which we also performed haplotype phasing.
Collapse
Affiliation(s)
- Bo Zhou
- Department of Psychiatry and Behavioral Sciences, Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Joseph G. Arthur
- Department of Statistics, Department of Biomedical Data Science, Bio-X Program, Stanford University, Stanford, California 94305, USA
| | - Steve S. Ho
- Department of Psychiatry and Behavioral Sciences, Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Reenal Pattni
- Department of Psychiatry and Behavioral Sciences, Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Yiling Huang
- Department of Psychiatry and Behavioral Sciences, Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Wing H. Wong
- Department of Statistics, Department of Biomedical Data Science, Bio-X Program, Stanford University, Stanford, California 94305, USA
| | - Alexander E. Urban
- Department of Psychiatry and Behavioral Sciences, Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
- Tashia and John Morgridge Faculty Scholar, Stanford Child Health Research Institute, Palo Alto, California 94305, USA
| |
Collapse
|