1
|
Couper LI, Dodge TO, Hemker JA, Kim BY, Exposito-Alonso M, Brem RB, Mordecai EA, Bitter MC. Evolutionary adaptation under climate change: Aedes sp. demonstrates potential to adapt to warming. Proc Natl Acad Sci U S A 2025; 122:e2418199122. [PMID: 39772738 PMCID: PMC11745351 DOI: 10.1073/pnas.2418199122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2024] [Accepted: 12/16/2024] [Indexed: 01/11/2025] Open
Abstract
Climate warming is expected to shift the distributions of mosquitoes and mosquito-borne diseases, promoting expansions at cool range edges and contractions at warm range edges. However, whether mosquito populations could maintain their warm edges through evolutionary adaptation remains unknown. Here, we investigate the potential for thermal adaptation in Aedes sierrensis, a congener of the major disease vector species that experiences large thermal gradients in its native range, by assaying tolerance to prolonged and acute heat exposure, and its genetic basis in a diverse, field-derived population. We found pervasive evidence of heritable genetic variation in mosquito heat tolerance, and phenotypic trade-offs in tolerance to prolonged versus acute heat exposure. Further, we found genomic variation associated with prolonged heat tolerance was clustered in several regions of the genome, suggesting the presence of larger structural variants such as chromosomal inversions. A simple evolutionary model based on our data estimates that the maximum rate of evolutionary adaptation in mosquito heat tolerance will exceed the projected rate of climate warming, implying the potential for mosquitoes to track warming via genetic adaptation.
Collapse
Affiliation(s)
- Lisa I. Couper
- Department of Biology, Stanford University, Stanford, CA94305
- Division of Environmental Health Sciences, University of California, Berkeley, CA94704
| | | | - James A. Hemker
- Department of Biology, Stanford University, Stanford, CA94305
| | - Bernard Y. Kim
- Department of Biology, Stanford University, Stanford, CA94305
| | - Moi Exposito-Alonso
- Department of Integrative Biology, University of California, Berkeley, CA94704
- HHMI, Chevy Chase, MD20815
| | - Rachel B. Brem
- Department of Plant & Microbial Biology, University of California, Berkeley, CA94704
| | | | - Mark C. Bitter
- Department of Biology, Stanford University, Stanford, CA94305
| |
Collapse
|
2
|
Billingsley KJ, Meredith M, Daida K, Jerez PA, Negi S, Malik L, Genner RM, Moller A, Zheng X, Gibson SB, Mastoras M, Baker B, Kouam C, Paquette K, Jarreau P, Makarious MB, Moore A, Hong S, Vitale D, Shah S, Monlong J, Pantazis CB, Asri M, Shafin K, Carnevali P, Marenco S, Auluck P, Mandal A, Miga KH, Rhie A, Reed X, Ding J, Cookson MR, Nalls M, Singleton A, Miller DE, Chaisson M, Timp W, Gibbs J, Phillippy AM, Kolmogorov M, Jain M, Sedlazeck FJ, Paten B, Blauwendraat C. Long-read sequencing of hundreds of diverse brains provides insight into the impact of structural variation on gene expression and DNA methylation. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.12.16.628723. [PMID: 39764002 PMCID: PMC11702628 DOI: 10.1101/2024.12.16.628723] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/18/2025]
Abstract
Structural variants (SVs) drive gene expression in the human brain and are causative of many neurological conditions. However, most existing genetic studies have been based on short-read sequencing methods, which capture fewer than half of the SVs present in any one individual. Long-read sequencing (LRS) enhances our ability to detect disease-associated and functionally relevant structural variants (SVs); however, its application in large-scale genomic studies has been limited by challenges in sample preparation and high costs. Here, we leverage a new scalable wet-lab protocol and computational pipeline for whole-genome Oxford Nanopore Technologies sequencing and apply it to neurologically normal control samples from the North American Brain Expression Consortium (NABEC) (European ancestry) and Human Brain Collection Core (HBCC) (African or African admixed ancestry) cohorts. Through this work, we present a publicly available long-read resource from 351 human brain samples (median N50: 27 Kbp and at an average depth of ~40x genome coverage). We discover approximately 234,905 SVs and produce locally phased assemblies that cover 95% of all protein-coding genes in GRCh38. Utilizing matched expression datasets for these samples, we apply quantitative trait locus (QTL) analyses and identify SVs that impact gene expression in post-mortem frontal cortex brain tissue. Further, we determine haplotype-specific methylation signatures at millions of CpGs and, with this data, identify cis-acting SVs. In summary, these results highlight that large-scale LRS can identify complex regulatory mechanisms in the brain that were inaccessible using previous approaches. We believe this new resource provides a critical step toward understanding the biological effects of genetic variation in the human brain.
Collapse
Affiliation(s)
- Kimberley J. Billingsley
- Center for Alzheimer’s and Related Dementias, National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
| | | | - Kensuke Daida
- Center for Alzheimer’s and Related Dementias, National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
- Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA
| | - Pilar Alvarez Jerez
- Center for Alzheimer’s and Related Dementias, National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
- Department of Neurodegenerative Disease, UCL Queen Square Institute of Neurology, University College London, London, UK
| | - Shloka Negi
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | - Laksh Malik
- Center for Alzheimer’s and Related Dementias, National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
| | - Rylee M. Genner
- Center for Alzheimer’s and Related Dementias, National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Abraham Moller
- Center for Alzheimer’s and Related Dementias, National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
| | - Xinchang Zheng
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Sophia B. Gibson
- Department of Genome Sciences, University of Washington, Seattle, Washington, USA
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, Washington, USA
| | - Mira Mastoras
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | - Breeana Baker
- Center for Alzheimer’s and Related Dementias, National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
| | - Cedric Kouam
- Center for Alzheimer’s and Related Dementias, National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
| | - Kimberly Paquette
- Center for Alzheimer’s and Related Dementias, National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
| | - Paige Jarreau
- Center for Alzheimer’s and Related Dementias, National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
| | - Mary B. Makarious
- Center for Alzheimer’s and Related Dementias, National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
- DataTecnica, Washington, DC, USA
| | - Anni Moore
- Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA
| | - Samantha Hong
- Center for Alzheimer’s and Related Dementias, National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
| | - Dan Vitale
- Center for Alzheimer’s and Related Dementias, National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
- DataTecnica, Washington, DC, USA
| | - Syed Shah
- Center for Alzheimer’s and Related Dementias, National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
- DataTecnica, Washington, DC, USA
| | - Jean Monlong
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | - Caroline B. Pantazis
- Center for Alzheimer’s and Related Dementias, National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
| | - Mobin Asri
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | | | - Paolo Carnevali
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Stefano Marenco
- Human Brain Collection Core, Division of Intramural Research, National Institute of Mental Health, NIH, Bethesda, MD, USA
| | - Pavan Auluck
- Human Brain Collection Core, Division of Intramural Research, National Institute of Mental Health, NIH, Bethesda, MD, USA
| | - Ajeet Mandal
- Human Brain Collection Core, Division of Intramural Research, National Institute of Mental Health, NIH, Bethesda, MD, USA
| | - Karen H. Miga
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | - Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Xylena Reed
- Center for Alzheimer’s and Related Dementias, National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
| | - Jinhui Ding
- Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA
| | - Mark R. Cookson
- Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA
| | - Mike Nalls
- Center for Alzheimer’s and Related Dementias, National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
- DataTecnica, Washington, DC, USA
| | - Andrew Singleton
- Center for Alzheimer’s and Related Dementias, National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
- Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA
| | - Danny E. Miller
- Department of Genome Sciences, University of Washington, Seattle, Washington, USA
- Department of Pediatrics, Division of Genetic Medicine, University of Washington and Seattle Children’s Hospital, Seattle, WA, USA
- Department of Laboratory Medicine and Pathology, University of Washington, Seattle, WA
- Brotman Baty Institute for Precision Medicine, University of Washington, Seattle, WA
| | - Mark Chaisson
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Winston Timp
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - J.Raphael Gibbs
- Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA
| | - Adam M. Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Mikhail Kolmogorov
- Center for Cancer Research, National Cancer Institute, National Institutes of Health, USA
| | - Miten Jain
- Department of Bioengineering, Department of Physics, Khoury College of Computer Sciences, Northeastern University, Boston, MA, USA
| | - Fritz J. Sedlazeck
- Department of Molecular and Human Genetics, Baylor College of Medicine, TX, USA
- Department of Computer Science, Rice University, Houston, TX, USA
| | | | - Cornelis Blauwendraat
- Center for Alzheimer’s and Related Dementias, National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
- Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA
| |
Collapse
|
3
|
Hiatt SM, Lawlor JMJ, Handley LH, Latner DR, Bonnstetter ZT, Finnila CR, Thompson ML, Boston LB, Williams M, Rodriguez Nunez I, Jenkins J, Kelley WV, Bebin EM, Lopez MA, Hurst ACE, Korf BR, Schmutz J, Grimwood J, Cooper GM. Long-read genome sequencing and variant reanalysis increase diagnostic yield in neurodevelopmental disorders. Genome Res 2024; 34:1747-1762. [PMID: 39299904 DOI: 10.1101/gr.279227.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Accepted: 08/19/2024] [Indexed: 09/22/2024]
Abstract
Variant detection from long-read genome sequencing (lrGS) has proven to be more accurate and comprehensive than variant detection from short-read genome sequencing (srGS). However, the rate at which lrGS can increase molecular diagnostic yield for rare disease is not yet precisely characterized. We performed lrGS using Pacific Biosciences "HiFi" technology on 96 short-read-negative probands with rare diseases that were suspected to be genetic. We generated hg38-aligned variants and de novo phased genome assemblies, and subsequently annotated, filtered, and curated variants using clinical standards. New disease-relevant or potentially relevant genetic findings were identified in 16/96 (16.7%) probands, nine of which (8/96, ∼9.4%) harbored pathogenic or likely pathogenic variants. Nine probands (∼9.4%) had variants that were accurately called in both srGS and lrGS and represent changes to clinical interpretation, mostly from recently published gene-disease associations. Seven cases included variants that were only correctly interpreted in lrGS, including copy-number variants (CNVs), an inversion, a mobile element insertion, two low-complexity repeat expansions, and a 1 bp deletion. While evidence for each of these variants is, in retrospect, visible in srGS, they were either not called within srGS data, were represented by calls with incorrect sizes or structures, or failed quality control and filtration. Thus, while reanalysis of older srGS data clearly increases diagnostic yield, we find that lrGS allows for substantial additional yield (7/96, 7.3%) beyond srGS. We anticipate that as lrGS analysis improves, and as lrGS data sets grow allowing for better variant-frequency annotation, the additional lrGS-only rare disease yield will grow over time.
Collapse
Affiliation(s)
- Susan M Hiatt
- HudsonAlpha Institute for Biotechnology, Huntsville, Alabama 35806, USA;
| | - James M J Lawlor
- HudsonAlpha Institute for Biotechnology, Huntsville, Alabama 35806, USA
| | - Lori H Handley
- HudsonAlpha Institute for Biotechnology, Huntsville, Alabama 35806, USA
| | - Donald R Latner
- HudsonAlpha Institute for Biotechnology, Huntsville, Alabama 35806, USA
| | | | - Candice R Finnila
- HudsonAlpha Institute for Biotechnology, Huntsville, Alabama 35806, USA
| | | | - Lori Beth Boston
- HudsonAlpha Institute for Biotechnology, Huntsville, Alabama 35806, USA
| | - Melissa Williams
- HudsonAlpha Institute for Biotechnology, Huntsville, Alabama 35806, USA
| | | | - Jerry Jenkins
- HudsonAlpha Institute for Biotechnology, Huntsville, Alabama 35806, USA
| | - Whitley V Kelley
- HudsonAlpha Institute for Biotechnology, Huntsville, Alabama 35806, USA
| | - E Martina Bebin
- Department of Neurology, University of Alabama at Birmingham, Birmingham, Alabama 35924, USA
| | - Michael A Lopez
- Department of Neurology, University of Alabama at Birmingham, Birmingham, Alabama 35924, USA
- Department of Pediatrics, University of Alabama at Birmingham, Birmingham, Alabama 35924, USA
- Department of Genetics, University of Alabama at Birmingham, Birmingham, Alabama 35924, USA
| | - Anna C E Hurst
- Department of Genetics, University of Alabama at Birmingham, Birmingham, Alabama 35924, USA
| | - Bruce R Korf
- Department of Genetics, University of Alabama at Birmingham, Birmingham, Alabama 35924, USA
| | - Jeremy Schmutz
- HudsonAlpha Institute for Biotechnology, Huntsville, Alabama 35806, USA
| | - Jane Grimwood
- HudsonAlpha Institute for Biotechnology, Huntsville, Alabama 35806, USA
| | - Gregory M Cooper
- HudsonAlpha Institute for Biotechnology, Huntsville, Alabama 35806, USA;
| |
Collapse
|
4
|
Gustafson JA, Gibson SB, Damaraju N, Zalusky MPG, Hoekzema K, Twesigomwe D, Yang L, Snead AA, Richmond PA, De Coster W, Olson ND, Guarracino A, Li Q, Miller AL, Goffena J, Anderson ZB, Storz SHR, Ward SA, Sinha M, Gonzaga-Jauregui C, Clarke WE, Basile AO, Corvelo A, Reeves C, Helland A, Musunuri RL, Revsine M, Patterson KE, Paschal CR, Zakarian C, Goodwin S, Jensen TD, Robb E, McCombie WR, Sedlazeck FJ, Zook JM, Montgomery SB, Garrison E, Kolmogorov M, Schatz MC, McLaughlin RN, Dashnow H, Zody MC, Loose M, Jain M, Eichler EE, Miller DE. High-coverage nanopore sequencing of samples from the 1000 Genomes Project to build a comprehensive catalog of human genetic variation. Genome Res 2024; 34:2061-2073. [PMID: 39358015 DOI: 10.1101/gr.279273.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2024] [Accepted: 09/16/2024] [Indexed: 10/04/2024]
Abstract
Fewer than half of individuals with a suspected Mendelian or monogenic condition receive a precise molecular diagnosis after comprehensive clinical genetic testing. Improvements in data quality and costs have heightened interest in using long-read sequencing (LRS) to streamline clinical genomic testing, but the absence of control data sets for variant filtering and prioritization has made tertiary analysis of LRS data challenging. To address this, the 1000 Genomes Project (1KGP) Oxford Nanopore Technologies Sequencing Consortium aims to generate LRS data from at least 800 of the 1KGP samples. Our goal is to use LRS to identify a broader spectrum of variation so we may improve our understanding of normal patterns of human variation. Here, we present data from analysis of the first 100 samples, representing all 5 superpopulations and 19 subpopulations. These samples, sequenced to an average depth of coverage of 37× and sequence read N50 of 54 kbp, have high concordance with previous studies for identifying single nucleotide and indel variants outside of homopolymer regions. Using multiple structural variant (SV) callers, we identify an average of 24,543 high-confidence SVs per genome, including shared and private SVs likely to disrupt gene function as well as pathogenic expansions within disease-associated repeats that were not detected using short reads. Evaluation of methylation signatures revealed expected patterns at known imprinted loci, samples with skewed X-inactivation patterns, and novel differentially methylated regions. All raw sequencing data, processed data, and summary statistics are publicly available, providing a valuable resource for the clinical genetics community to discover pathogenic SVs.
Collapse
Affiliation(s)
- Jonas A Gustafson
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, Washington 98195, USA
- Molecular and Cellular Biology Program, University of Washington, Seattle, Washington 98195, USA
| | - Sophia B Gibson
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, Washington 98195, USA
- Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA
| | - Nikhita Damaraju
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, Washington 98195, USA
- Institute for Public Health Genetics, University of Washington, Seattle, Washington 98195, USA
| | - Miranda P G Zalusky
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, Washington 98195, USA
| | - Kendra Hoekzema
- Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA
| | - David Twesigomwe
- Sydney Brenner Institute for Molecular Bioscience, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg 2193, South Africa
| | - Lei Yang
- Pacific Northwest Research Institute, Seattle, Washington 98122, USA
| | - Anthony A Snead
- Department of Biology, New York University, New York, New York 10003, USA
| | | | - Wouter De Coster
- Applied and Translational Neurogenomics Group, VIB Center for Molecular Neurology, VIB, Antwerp 2650, Belgium
- Department of Biomedical Sciences, University of Antwerp, Antwerp 2000, Belgium
| | - Nathan D Olson
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, Maryland 20899, USA
| | - Andrea Guarracino
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, Tennessee 38163, USA
- Human Technopole, Milan 20157, Italy
| | - Qiuhui Li
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218, USA
| | - Angela L Miller
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, Washington 98195, USA
| | - Joy Goffena
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, Washington 98195, USA
| | - Zachary B Anderson
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, Washington 98195, USA
| | - Sophie H R Storz
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, Washington 98195, USA
| | - Sydney A Ward
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, Washington 98195, USA
| | - Maisha Sinha
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, Washington 98195, USA
| | - Claudia Gonzaga-Jauregui
- International Laboratory for Human Genome Research, Laboratorio Internacional de Investigación sobre el Genoma Humano, Universidad Nacional Autónoma de México, Mexico City 76230, Mexico
| | - Wayne E Clarke
- New York Genome Center, New York, New York 10013, USA
- Outlier Informatics Inc., Saskatoon, Saskatchewan S7H 1L4, Canada
| | - Anna O Basile
- New York Genome Center, New York, New York 10013, USA
| | - André Corvelo
- New York Genome Center, New York, New York 10013, USA
| | | | | | | | - Mahler Revsine
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218, USA
| | - Karynne E Patterson
- Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA
| | - Cate R Paschal
- Department of Laboratory Medicine and Pathology, University of Washington, Seattle, Washington 98195, USA
- Department of Laboratories, Seattle Children's Hospital, Seattle, Washington 98195, USA
| | - Christina Zakarian
- Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA
| | - Sara Goodwin
- Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA
| | - Tanner D Jensen
- Department of Genetics, Stanford University, Stanford, California 94305, USA
| | - Esther Robb
- Department of Computer Science, Stanford University, Stanford, California 94305, USA
| | | | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030, USA
- Department of Computer Science, Rice University, Houston, Texas 77251, USA
| | - Justin M Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, Maryland 20899, USA
| | | | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, Tennessee 38163, USA
| | - Mikhail Kolmogorov
- Cancer Data Science Laboratory, National Cancer Institute, NIH, Bethesda, Maryland 20892, USA
| | | | - Richard N McLaughlin
- Molecular and Cellular Biology Program, University of Washington, Seattle, Washington 98195, USA
- Pacific Northwest Research Institute, Seattle, Washington 98122, USA
| | - Harriet Dashnow
- Department of Human Genetics, University of Utah, Salt Lake City, Utah 84112, USA
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, Colorado 80045, USA
| | - Michael C Zody
- International Laboratory for Human Genome Research, Laboratorio Internacional de Investigación sobre el Genoma Humano, Universidad Nacional Autónoma de México, Mexico City 76230, Mexico
| | - Matt Loose
- Deep Seq, School of Life Sciences, University of Nottingham, Nottingham NG7 2TQ, UK
| | - Miten Jain
- Department of Bioengineering, Northeastern University, Boston, Massachusetts 02115, USA
- Department of Physics, Northeastern University, Boston, Massachusetts 02115, USA
- Khoury College of Computer Sciences, Northeastern University, Boston, Massachusetts 02115, USA
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA
- Brotman Baty Institute for Precision Medicine, University of Washington, Seattle, Washington 98195, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, Washington 98195, USA
| | - Danny E Miller
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, Washington 98195, USA;
- Department of Laboratory Medicine and Pathology, University of Washington, Seattle, Washington 98195, USA
- Brotman Baty Institute for Precision Medicine, University of Washington, Seattle, Washington 98195, USA
| |
Collapse
|
5
|
Couper LI, Dodge TO, Hemker JA, Kim BY, Exposito-Alonso M, Brem RB, Mordecai EA, Bitter MC. Evolutionary adaptation under climate change: Aedes sp. demonstrates potential to adapt to warming. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.08.23.609454. [PMID: 39229052 PMCID: PMC11370604 DOI: 10.1101/2024.08.23.609454] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 09/05/2024]
Abstract
Climate warming is expected to shift the distributions of mosquitoes and mosquito-borne diseases, facilitating expansions at cool range edges and contractions at warm range edges. However, whether mosquito populations could maintain their warm edges through evolutionary adaptation remains unknown. Here, we investigate the potential for thermal adaptation in Aedes sierrensis, a congener of the major disease vector species that experiences large thermal gradients in its native range, by assaying tolerance to prolonged and acute heat exposure, and its genetic basis in a diverse, field-derived population. We found pervasive evidence of heritable genetic variation in acute heat tolerance, which phenotypically trades off with tolerance to prolonged heat exposure. A simple evolutionary model based on our data shows that the estimated maximum rate of evolutionary adaptation in mosquito heat tolerance typically exceeds that of projected climate warming under idealized conditions. Our findings indicate that natural mosquito populations may have the potential to track projected warming via genetic adaptation. Prior climate-based projections may thus underestimate the range of mosquito and mosquito-borne disease distributions under future climate conditions.
Collapse
Affiliation(s)
- Lisa I Couper
- Stanford University, Department of Biology
- University of California, Berkeley, Division of Environmental Health Sciences
| | | | | | | | - Moi Exposito-Alonso
- University of California, Berkeley, Department of Integrative Biology
- Howard Hughes Medical Institute
| | - Rachel B Brem
- University of California, Berkeley, Department of Plant & Microbial Biology
| | | | | |
Collapse
|
6
|
Zheng J, Li T, Ye H, Jiang Z, Jiang W, Yang H, Wu Z, Xie Z. Comprehensive identification of pathogenic variants in retinoblastoma by long- and short-read sequencing. Cancer Lett 2024; 598:217121. [PMID: 39009069 DOI: 10.1016/j.canlet.2024.217121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2023] [Revised: 06/16/2024] [Accepted: 07/11/2024] [Indexed: 07/17/2024]
Abstract
Retinoblastoma (RB) is the most common intraocular malignancy in childhood. The causal variants in RB are mostly characterized by previously used short-read sequencing (SRS) analysis, which has technical limitations in identifying structural variants (SVs) and phasing information. Long-read sequencing (LRS) technology has advantages over SRS in detecting SVs, phased genetic variants, and methylation. In this study, we comprehensively characterized the genetic landscape of RB using combinatorial LRS and SRS of 16 RB tumors and 16 matched blood samples. We detected a total of 232 somatic SVs, with an average of 14.5 SVs per sample across the whole genome in our cohort. We identified 20 distinct pathogenic variants disrupting RB1 gene, including three novel small variants and five somatic SVs. We found more somatic SVs were detected from LRS than SRS (140 vs. 122) in RB samples with WGS data, particularly the insertions (18 vs. 1). Furthermore, our analysis shows that, with the exception of one sample who lacked the methylation data, all samples presented biallelic inactivation of RB1 in various forms, including two cases with the biallelic hypermethylated promoter and four cases with compound heterozygous mutations which were missing in SRS analysis. By inferring relative timing of somatic events, we reveal the genetic progression that RB1 disruption early and followed by copy number changes, including amplifications of Chr2p and deletions of Chr16q, during RB tumorigenesis. Altogether, we characterize the comprehensive genetic landscape of RB, providing novel insights into the genetic alterations and mechanisms contributing to RB initiation and development. Our work also establishes a framework to analyze genomic landscape of cancers based on LRS data.
Collapse
Affiliation(s)
- Jingjing Zheng
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou, China
| | - Tong Li
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou, China
| | - Huijing Ye
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou, China
| | - Zehang Jiang
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou, China
| | - Wenbing Jiang
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou, China
| | - Huasheng Yang
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou, China.
| | - Zhikun Wu
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou, China.
| | - Zhi Xie
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou, China.
| |
Collapse
|
7
|
Schreiber M, Jayakodi M, Stein N, Mascher M. Plant pangenomes for crop improvement, biodiversity and evolution. Nat Rev Genet 2024; 25:563-577. [PMID: 38378816 PMCID: PMC7616794 DOI: 10.1038/s41576-024-00691-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/14/2023] [Indexed: 02/22/2024]
Abstract
Plant genome sequences catalogue genes and the genetic elements that regulate their expression. Such inventories further research aims as diverse as mapping the molecular basis of trait diversity in domesticated plants or inquiries into the origin of evolutionary innovations in flowering plants millions of years ago. The transformative technological progress of DNA sequencing in the past two decades has enabled researchers to sequence ever more genomes with greater ease. Pangenomes - complete sequences of multiple individuals of a species or higher taxonomic unit - have now entered the geneticists' toolkit. The genomes of crop plants and their wild relatives are being studied with translational applications in breeding in mind. But pangenomes are applicable also in ecological and evolutionary studies, as they help classify and monitor biodiversity across the tree of life, deepen our understanding of how plant species diverged and show how plants adapt to changing environments or new selection pressures exerted by human beings.
Collapse
Affiliation(s)
- Mona Schreiber
- Department of Biology, University of Marburg, Marburg, Germany
| | - Murukarthick Jayakodi
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
| | - Nils Stein
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
- Martin Luther University Halle-Wittenberg, Halle (Saale), Germany
| | - Martin Mascher
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany.
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany.
| |
Collapse
|
8
|
Ormond C, Ryan NM, Byerley W, Heron EA, Corvin A. Investigating copy number variants in schizophrenia pedigrees using a new consensus pipeline called PECAN. Sci Rep 2024; 14:17518. [PMID: 39080331 PMCID: PMC11289470 DOI: 10.1038/s41598-024-66021-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2023] [Accepted: 06/26/2024] [Indexed: 08/02/2024] Open
Abstract
Copy number variants (CNVs) have been implicated in many human diseases, including psychiatric disorders. Whole genome sequencing offers advantages in CNV calling compared to previous array-based methods. Here we present a robust and transparent CNV calling pipeline, PECAN (PEdigree Copy number vAriaNt calling), for short-read, whole genome sequencing data, comprised of a novel combination of four calling methods and structural variant genotyping. This method is scalable and can incorporate pedigree information to retain lower-confidence CNVs that would otherwise be discarded. We have robustly benchmarked PECAN using gold-standard CNV calls for two well-established evaluation samples, NA12878 and HG002, showing that PECAN performs with high precision and recall on both datasets, outperforming another pedigree-based CNV calling pipeline. As part of this work, we provide a list of high-confidence gold standard CNVs for the NA12878 reference sample, curated from multiple studies. We applied PECAN to a collection of pedigrees multiply affected with schizophrenia and identified a rare deletion that perfectly co-segregates with schizophrenia in one of the pedigrees. The CNV overlaps the gene PITRM1, which has been implicated in a complex phenotype including ataxia, developmental delay, and schizophrenia-like episodes in affected adults.
Collapse
Affiliation(s)
- Cathal Ormond
- Neuropsychiatric Genetics Research Group, Department of Psychiatry, Trinity Centre for Health Sciences, Trinity College Dublin, James' Street, Dublin 8, Ireland
| | - Niamh M Ryan
- Neuropsychiatric Genetics Research Group, Department of Psychiatry, Trinity Centre for Health Sciences, Trinity College Dublin, James' Street, Dublin 8, Ireland
| | - William Byerley
- Department of Psychiatry and Behavioral Sciences, University of California, San Francisco, CA, USA
| | - Elizabeth A Heron
- Neuropsychiatric Genetics Research Group, Department of Psychiatry, Trinity Centre for Health Sciences, Trinity College Dublin, James' Street, Dublin 8, Ireland
| | - Aiden Corvin
- Neuropsychiatric Genetics Research Group, Department of Psychiatry, Trinity Centre for Health Sciences, Trinity College Dublin, James' Street, Dublin 8, Ireland.
| |
Collapse
|
9
|
Samano A, Kumar N, Liao Y, Ishtiaq F, Chakraborty M. Genome structural variants shape adaptive success of an invasive urban malaria vector Anopheles stephensi. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.29.605641. [PMID: 39211149 PMCID: PMC11360885 DOI: 10.1101/2024.07.29.605641] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/04/2024]
Abstract
Global changes are associated with the emergence of several invasive species. However, the genomic determinants of the adaptive success of an invasive species in a new environment remain poorly understood. Genomic structural variants (SVs), consisting of copy number variants, play an important role in adaptation. SVs often cause large adaptive shifts in ecologically important traits, which makes SVs compelling candidates for driving rapid adaptations to environmental changes, which is critical to invasive success. To address this problem, we investigated the role SVs play in the adaptive success of Anopheles stephensi , a primary vector of urban malaria in South Asia and an invasive malaria vector in several South Asian islands and Africa. We collected whole genome sequencing data from 115 mosquitoes from invasive island populations and four locations from mainland India, an ancestral range for the species. We identified 2,988 duplication copy number variants and 16,038 deletions in these strains, with ∼50% overlapping genes. SVs are enriched in genomic regions with signatures of selective sweeps in the mainland and invasive island populations, implying a putative adaptive role of SVs. Nearly all high-frequency SVs, including the candidate adaptive variants, in the invasive island populations are present on the mainland, suggesting a major contribution of existing variation to the success of the island populations. Among the candidate adaptive SVs, three duplications involving toxin-resistance genes evolved, likely due to the widespread application of insecticides in India since the 1950s. We also identify two SVs associated with the adaptation of An. stephensi larvae to brackish water in the island and two coastal mainland populations, where the mutations likely originated. Our results suggest that existing SVs play a vital role in the evolutionary success of An. stephensi in new environmental conditions.
Collapse
|
10
|
Sarwal V, Lee S, Yang J, Sankararaman S, Chaisson M, Eskin E, Mangul S. VISTA: an integrated framework for structural variant discovery. Brief Bioinform 2024; 25:bbae462. [PMID: 39297879 PMCID: PMC11411772 DOI: 10.1093/bib/bbae462] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2024] [Revised: 08/27/2024] [Accepted: 09/07/2024] [Indexed: 09/26/2024] Open
Abstract
Structural variation (SV) refers to insertions, deletions, inversions, and duplications in human genomes. SVs are present in approximately 1.5% of the human genome. Still, this small subset of genetic variation has been implicated in the pathogenesis of psoriasis, Crohn's disease and other autoimmune disorders, autism spectrum and other neurodevelopmental disorders, and schizophrenia. Since identifying structural variants is an important problem in genetics, several specialized computational techniques have been developed to detect structural variants directly from sequencing data. With advances in whole-genome sequencing (WGS) technologies, a plethora of SV detection methods have been developed. However, dissecting SVs from WGS data remains a challenge, with the majority of SV detection methods prone to a high false-positive rate, and no existing method able to precisely detect a full range of SVs present in a sample. Previous studies have shown that none of the existing SV callers can maintain high accuracy across various SV lengths and genomic coverages. Here, we report an integrated structural variant calling framework, Variant Identification and Structural Variant Analysis (VISTA), that leverages the results of individual callers using a novel and robust filtering and merging algorithm. In contrast to existing consensus-based tools which ignore the length and coverage, VISTA overcomes this limitation by executing various combinations of top-performing callers based on variant length and genomic coverage to generate SV events with high accuracy. We evaluated the performance of VISTA on comprehensive gold-standard datasets across varying organisms and coverage. We benchmarked VISTA using the Genome-in-a-Bottle gold standard SV set, haplotype-resolved de novo assemblies from the Human Pangenome Reference Consortium, along with an in-house polymerase chain reaction (PCR)-validated mouse gold standard set. VISTA maintained the highest F1 score among top consensus-based tools measured using a comprehensive gold standard across both mouse and human genomes. VISTA also has an optimized mode, where the calls can be optimized for precision or recall. VISTA-optimized can attain 100% precision and the highest sensitivity among other variant callers. In conclusion, VISTA represents a significant advancement in structural variant calling, offering a robust and accurate framework that outperforms existing consensus-based tools and sets a new standard for SV detection in genomic research.
Collapse
Affiliation(s)
- Varuni Sarwal
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, United States
| | - Seungmo Lee
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, United States
| | - Jianzhi Yang
- Department of Quantitative and Computational Biology, Dana and David Dornsife College of Letters, Arts and Sciences University of Southern California, 3540 S Figueroa St, Los Angeles, California 90089, United States
| | - Sriram Sankararaman
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, United States
| | - Mark Chaisson
- Department of Quantitative and Computational Biology, Dana and David Dornsife College of Letters, Arts and Sciences University of Southern California, 3540 S Figueroa St, Los Angeles, California 90089, United States
| | - Eleazar Eskin
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, United States
| | - Serghei Mangul
- Department of Quantitative and Computational Biology, Dana and David Dornsife College of Letters, Arts and Sciences University of Southern California, 3540 S Figueroa St, Los Angeles, California 90089, United States
- Department of Clinical Pharmacy, Alfred E. Mann School of Pharmacy, University of Southern California, 1540 Alcazar Street, Los Angeles, CA 90033, United States
| |
Collapse
|
11
|
Jugas R, Vitkova H. ProcaryaSV: structural variation detection pipeline for bacterial genomes using short-read sequencing. BMC Bioinformatics 2024; 25:233. [PMID: 38982375 PMCID: PMC11234778 DOI: 10.1186/s12859-024-05843-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Accepted: 06/13/2024] [Indexed: 07/11/2024] Open
Abstract
BACKGROUND Structural variations play an important role in bacterial genomes. They can mediate genome adaptation quickly in response to the external environment and thus can also play a role in antibiotic resistance. The detection of structural variations in bacteria is challenging, and the recognition of even small rearrangements can be important. Even though most detection tools are aimed at and benchmarked on eukaryotic genomes, they can also be used on prokaryotic genomes. The key features of detection are the ability to detect small rearrangements and support haploid genomes. Because of the limiting performance of a single detection tool, combining the detection abilities of multiple tools can lead to more robust results. There are already available workflows for structural variation detection for long-reads technologies and for the detection of single-nucleotide variation and indels, both aimed at bacteria. Yet we are unaware of structural variations detection workflows for the short-reads sequencing platform. Motivated by this gap we created our workflow. Further, we were interested in increasing the detection performance and providing more robust results. RESULTS We developed an open-source bioinformatics pipeline, ProcaryaSV, for the detection of structural variations in bacterial isolates from paired-end short sequencing reads. Multiple tools, starting with quality control and trimming of sequencing data, alignment to the reference genome, and multiple structural variation detection tools, are integrated. All the partial results are then processed and merged with an in-house merging algorithm. Compared with a single detection approach, ProcaryaSV has improved detection performance and is a reproducible easy-to-use tool. CONCLUSIONS The ProcaryaSV pipeline provides an integrative approach to structural variation detection from paired-end next-generation sequencing of bacterial samples. It can be easily installed and used on Linux machines. It is publicly available on GitHub at https://github.com/robinjugas/ProcaryaSV .
Collapse
Affiliation(s)
- Robin Jugas
- Department of Biomedical Engineering, Brno University of Technology, Brno, Czech Republic
| | - Helena Vitkova
- Department of Biomedical Engineering, Brno University of Technology, Brno, Czech Republic.
| |
Collapse
|
12
|
Liang X, Duan Q, Li B, Wang Y, Bu Y, Zhang Y, Kuang Z, Mao L, An X, Wang H, Yang X, Wan N, Feng Z, Shen W, Miao W, Chen J, Liu S, Storz JF, Liu J, Nevo E, Li K. Genomic structural variation contributes to evolved changes in gene expression in high-altitude Tibetan sheep. Proc Natl Acad Sci U S A 2024; 121:e2322291121. [PMID: 38913905 PMCID: PMC11228492 DOI: 10.1073/pnas.2322291121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2024] [Accepted: 05/06/2024] [Indexed: 06/26/2024] Open
Abstract
Tibetan sheep were introduced to the Qinghai Tibet plateau roughly 3,000 B.P., making this species a good model for investigating genetic mechanisms of high-altitude adaptation over a relatively short timescale. Here, we characterize genomic structural variants (SVs) that distinguish Tibetan sheep from closely related, low-altitude Hu sheep, and we examine associated changes in tissue-specific gene expression. We document differentiation between the two sheep breeds in frequencies of SVs associated with genes involved in cardiac function and circulation. In Tibetan sheep, we identified high-frequency SVs in a total of 462 genes, including EPAS1, PAPSS2, and PTPRD. Single-cell RNA-Seq data and luciferase reporter assays revealed that the SVs had cis-acting effects on the expression levels of these three genes in specific tissues and cell types. In Tibetan sheep, we identified a high-frequency chromosomal inversion that exhibited modified chromatin architectures relative to the noninverted allele that predominates in Hu sheep. The inversion harbors several genes with altered expression patterns related to heart protection, brown adipocyte proliferation, angiogenesis, and DNA repair. These findings indicate that SVs represent an important source of genetic variation in gene expression and may have contributed to high-altitude adaptation in Tibetan sheep.
Collapse
Affiliation(s)
- Xiaolong Liang
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou730000, China
| | - Qijiao Duan
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou730000, China
| | - Bowen Li
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou730000, China
| | - Yinjia Wang
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou730000, China
| | - Yueting Bu
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou730000, China
| | - Yonglu Zhang
- Fengjia Town Health Center, Rushan City, Weihai City264200, China
| | - Zhuoran Kuang
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou730000, China
| | - Leyan Mao
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou730000, China
| | - Xuan An
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou730000, China
| | - Huihua Wang
- Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing100193, China
| | - Xiaojie Yang
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou730000, China
| | - Na Wan
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou730000, China
| | - Zhilong Feng
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou730000, China
| | - Wei Shen
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou730000, China
| | - Weilan Miao
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou730000, China
| | - Jiaqi Chen
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou730000, China
| | - Sanyuan Liu
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou730000, China
| | - Jay F. Storz
- School of Biological Sciences, University of Nebraska, Lincoln, NE68588
| | - Jianquan Liu
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou730000, China
| | - Eviatar Nevo
- Institute of Evolution, University of Haifa, Haifa3498838, Israel
| | - Kexin Li
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou730000, China
| |
Collapse
|
13
|
Kramer M, Goodwin S, Wappel R, Borio M, Offit K, Feldman DR, Stadler ZK, McCombie WR. Exploring the genetic and epigenetic underpinnings of early-onset cancers: Variant prioritization for long read whole genome sequencing from family cancer pedigrees. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.27.601096. [PMID: 39005350 PMCID: PMC11244929 DOI: 10.1101/2024.06.27.601096] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/16/2024]
Abstract
Despite significant advances in our understanding of genetic cancer susceptibility, known inherited cancer predisposition syndromes explain at most 20% of early-onset cancers. As early-onset cancer prevalence continues to increase, the need to assess previously inaccessible areas of the human genome, harnessing a trio or quad family-based architecture for variant filtration, may reveal further insights into cancer susceptibility. To assess a broader spectrum of variation than can be ascertained by multi-gene panel sequencing, or even whole genome sequencing with short reads, we employed long read whole genome sequencing using an Oxford Nanopore Technology (ONT) PromethION of 3 families containing an early-onset cancer proband using a trio or quad family architecture. Analysis included 2 early-onset colorectal cancer family trios and one quad consisting of two siblings with testicular cancer, all with unaffected parents. Structural variants (SVs), epigenetic profiles and single nucleotide variants (SNVs) were determined for each individual, and a filtering strategy was employed to refine and prioritize candidate variants based on the family architecture. The family architecture enabled us to focus on inapposite variants while filtering variants shared with the unaffected parents, significantly decreasing background variation that can hamper identification of potentially disease causing differences. Candidate d e novo and compound heterozygous variants were identified in this way. Gene expression, in matched neoplastic and pre-neoplastic lesions, was assessed for one trio. Our study demonstrates the feasibility of a streamlined analysis of genomic variants from long read ONT whole genome sequencing and a way to prioritize key variants for further evaluation of pathogenicity, while revealing what may be missing from panel based analyses.
Collapse
|
14
|
Zhang Y, Jin J, Wang N, Sun Q, Feng D, Zhu S, Wang Z, Li S, Ye J, Chai L, Xie Z, Deng X. Cytochrome P450 CitCYP97B modulates carotenoid accumulation diversity by hydroxylating β-cryptoxanthin in Citrus. PLANT COMMUNICATIONS 2024; 5:100847. [PMID: 38379285 PMCID: PMC11211522 DOI: 10.1016/j.xplc.2024.100847] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/07/2023] [Revised: 12/21/2023] [Accepted: 02/18/2024] [Indexed: 02/22/2024]
Abstract
Carotenoids in plant foods provide health benefits by functioning as provitamin A. One of the vital provitamin A carotenoids, β-cryptoxanthin, is typically plentiful in citrus fruit. However, little is known about the genetic basis of β-cryptoxanthin accumulation in citrus. Here, we performed a widely targeted metabolomic analysis of 65 major carotenoids and carotenoid derivatives to characterize carotenoid accumulation in Citrus and determine the taxonomic profile of β-cryptoxanthin. We used data from 81 newly sequenced representative accessions and 69 previously sequenced Citrus cultivars to reveal the genetic basis of β-cryptoxanthin accumulation through a genome-wide association study. We identified a causal gene, CitCYP97B, which encodes a cytochrome P450 protein whose substrate and metabolic pathways in land plants were undetermined. We subsequently demonstrated that CitCYP97B functions as a novel monooxygenase that specifically hydroxylates the β-ring of β-cryptoxanthin in a heterologous expression system. In planta experiments provided further evidence that CitCYP97B negatively regulates β-cryptoxanthin content. Using the sequenced Citrus accessions, we found that two critical structural cis-element variations contribute to increased expression of CitCYP97B, thereby altering β-cryptoxanthin accumulation in fruit. Hybridization/introgression appear to have contributed to the prevalence of two cis-element variations in different Citrus types during citrus evolution. Overall, these findings extend our understanding of the regulation and diversity of carotenoid metabolism in fruit crops and provide a genetic target for production of β-cryptoxanthin-biofortified products.
Collapse
Affiliation(s)
- Yingzi Zhang
- National Key Laboratory for Germplasm Innovation & Utilization of Horticultural Crops, College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan 430070, China
| | - Jiajing Jin
- National Key Laboratory for Germplasm Innovation & Utilization of Horticultural Crops, College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan 430070, China
| | - Nan Wang
- National Key Laboratory for Germplasm Innovation & Utilization of Horticultural Crops, College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan 430070, China
| | - Quan Sun
- National Key Laboratory for Germplasm Innovation & Utilization of Horticultural Crops, College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan 430070, China
| | - Di Feng
- National Key Laboratory for Germplasm Innovation & Utilization of Horticultural Crops, College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan 430070, China
| | - Shenchao Zhu
- National Key Laboratory for Germplasm Innovation & Utilization of Horticultural Crops, College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan 430070, China
| | - Zexin Wang
- National Key Laboratory for Germplasm Innovation & Utilization of Horticultural Crops, College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan 430070, China
| | - Shunxin Li
- National Key Laboratory for Germplasm Innovation & Utilization of Horticultural Crops, College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan 430070, China
| | - Junli Ye
- National Key Laboratory for Germplasm Innovation & Utilization of Horticultural Crops, College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan 430070, China
| | - Lijun Chai
- National Key Laboratory for Germplasm Innovation & Utilization of Horticultural Crops, College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan 430070, China
| | - Zongzhou Xie
- National Key Laboratory for Germplasm Innovation & Utilization of Horticultural Crops, College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan 430070, China
| | - Xiuxin Deng
- National Key Laboratory for Germplasm Innovation & Utilization of Horticultural Crops, College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan 430070, China; Hubei Hongshan Laboratory, Wuhan, Hubei 430070, China.
| |
Collapse
|
15
|
Steyaert W, Sagath L, Demidov G, Yépez VA, Esteve-Codina A, Gagneur J, Ellwanger K, Derks R, Weiss M, den Ouden A, van den Heuvel S, Swinkels H, Zomer N, Steehouwer M, O'Gorman L, Astuti G, Neveling K, Schüle R, Xu J, Synofzik M, Beijer D, Hengel H, Schöls L, Claeys KG, Baets J, Van de Vondel L, Ferlini A, Selvatici R, Morsy H, Saeed Abd Elmaksoud M, Straub V, Müller J, Pini V, Perry L, Sarkozy A, Zaharieva I, Muntoni F, Bugiardini E, Polavarapu K, Horvath R, Reid E, Lochmüller H, Spinazzi M, Savarese M, Matalonga L, Laurie S, Brunner HG, Graessner H, Beltran S, Ossowski S, Vissers LELM, Gilissen C, Hoischen A. Unravelling undiagnosed rare disease cases by HiFi long-read genome sequencing. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.05.03.24305331. [PMID: 38746462 PMCID: PMC11092722 DOI: 10.1101/2024.05.03.24305331] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2024]
Abstract
Solve-RD is a pan-European rare disease (RD) research program that aims to identify disease-causing genetic variants in previously undiagnosed RD families. We utilised 10-fold coverage HiFi long-read sequencing (LRS) for detecting causative structural variants (SVs), single nucleotide variants (SNVs), insertion-deletions (InDels), and short tandem repeat (STR) expansions in extensively studied RD families without clear molecular diagnoses. Our cohort includes 293 individuals from 114 genetically undiagnosed RD families selected by European Rare Disease Network (ERN) experts. Of these, 21 families were affected by so-called 'unsolvable' syndromes for which genetic causes remain unknown, and 93 families with at least one individual affected by a rare neurological, neuromuscular, or epilepsy disorder without genetic diagnosis despite extensive prior testing. Clinical interpretation and orthogonal validation of variants in known disease genes yielded thirteen novel genetic diagnoses due to de novo and rare inherited SNVs, InDels, SVs, and STR expansions. In an additional four families, we identified a candidate disease-causing SV affecting several genes including an MCF2 / FGF13 fusion and PSMA3 deletion. However, no common genetic cause was identified in any of the 'unsolvable' syndromes. Taken together, we found (likely) disease-causing genetic variants in 13.0% of previously unsolved families and additional candidate disease-causing SVs in another 4.3% of these families. In conclusion, our results demonstrate the added value of HiFi long-read genome sequencing in undiagnosed rare diseases.
Collapse
|
16
|
Tsouris A, Brach G, Friedrich A, Hou J, Schacherer J. Diallel panel reveals a significant impact of low-frequency genetic variants on gene expression variation in yeast. Mol Syst Biol 2024; 20:362-373. [PMID: 38355920 PMCID: PMC10987670 DOI: 10.1038/s44320-024-00021-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Revised: 01/29/2024] [Accepted: 01/30/2024] [Indexed: 02/16/2024] Open
Abstract
Unraveling the genetic sources of gene expression variation is essential to better understand the origins of phenotypic diversity in natural populations. Genome-wide association studies identified thousands of variants involved in gene expression variation, however, variants detected only explain part of the heritability. In fact, variants such as low-frequency and structural variants (SVs) are poorly captured in association studies. To assess the impact of these variants on gene expression variation, we explored a half-diallel panel composed of 323 hybrids originated from pairwise crosses of 26 natural Saccharomyces cerevisiae isolates. Using short- and long-read sequencing strategies, we established an exhaustive catalog of single nucleotide polymorphisms (SNPs) and SVs for this panel. Combining this dataset with the transcriptomes of all hybrids, we comprehensively mapped SNPs and SVs associated with gene expression variation. While SVs impact gene expression variation, SNPs exhibit a higher effect size with an overrepresentation of low-frequency variants compared to common ones. These results reinforce the importance of dissecting the heritability of complex traits with a comprehensive catalog of genetic variants at the population level.
Collapse
Affiliation(s)
- Andreas Tsouris
- Université de Strasbourg, CNRS, GMGM UMR 7156, Strasbourg, France
| | - Gauthier Brach
- Université de Strasbourg, CNRS, GMGM UMR 7156, Strasbourg, France
| | - Anne Friedrich
- Université de Strasbourg, CNRS, GMGM UMR 7156, Strasbourg, France
| | - Jing Hou
- Université de Strasbourg, CNRS, GMGM UMR 7156, Strasbourg, France.
| | - Joseph Schacherer
- Université de Strasbourg, CNRS, GMGM UMR 7156, Strasbourg, France.
- Institut Universitaire de France (IUF), Paris, France.
| |
Collapse
|
17
|
Jensen TD, Ni B, Reuter CM, Gorzynski JE, Fazal S, Bonner D, Ungar RA, Goddard PC, Raja A, Ashley EA, Bernstein JA, Zuchner S, Greicius MD, Montgomery SB, Schatz MC, Wheeler MT, Battle A. Integration of transcriptomics and long-read genomics prioritizes structural variants in rare disease. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.03.22.24304565. [PMID: 38585781 PMCID: PMC10996727 DOI: 10.1101/2024.03.22.24304565] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/09/2024]
Abstract
Rare structural variants (SVs) - insertions, deletions, and complex rearrangements - can cause Mendelian disease, yet they remain difficult to accurately detect and interpret. We sequenced and analyzed Oxford Nanopore long-read genomes of 68 individuals from the Undiagnosed Disease Network (UDN) with no previously identified diagnostic mutations from short-read sequencing. Using our optimized SV detection pipelines and 571 control long-read genomes, we detected 716 long-read rare (MAF < 0.01) SV alleles per genome on average, achieving a 2.4x increase from short-reads. To characterize the functional effects of rare SVs, we assessed their relationship with gene expression from blood or fibroblasts from the same individuals, and found that rare SVs overlapping enhancers were enriched (LOR = 0.46) near expression outliers. We also evaluated tandem repeat expansions (TREs) and found 14 rare TREs per genome; notably these TREs were also enriched near overexpression outliers. To prioritize candidate functional SVs, we developed Watershed-SV, a probabilistic model that integrates expression data with SV-specific genomic annotations, which significantly outperforms baseline models that don't incorporate expression data. Watershed-SV identified a median of eight high-confidence functional SVs per UDN genome. Notably, this included compound heterozygous deletions in FAM177A1 shared by two siblings, which were likely causal for a rare neurodevelopmental disorder. Our observations demonstrate the promise of integrating long-read sequencing with gene expression towards improving the prioritization of functional SVs and TREs in rare disease patients.
Collapse
|
18
|
Hiatt SM, Lawlor JM, Handley LH, Latner DR, Bonnstetter ZT, Finnila CR, Thompson ML, Boston LB, Williams M, Nunez IR, Jenkins J, Kelley WV, Bebin EM, Lopez MA, Hurst ACE, Korf BR, Schmutz J, Grimwood J, Cooper GM. Long-read genome sequencing and variant reanalysis increase diagnostic yield in neurodevelopmental disorders. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.03.22.24304633. [PMID: 38585854 PMCID: PMC10996728 DOI: 10.1101/2024.03.22.24304633] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/09/2024]
Abstract
Variant detection from long-read genome sequencing (lrGS) has proven to be considerably more accurate and comprehensive than variant detection from short-read genome sequencing (srGS). However, the rate at which lrGS can increase molecular diagnostic yield for rare disease is not yet precisely characterized. We performed lrGS using Pacific Biosciences "HiFi" technology on 96 short-read-negative probands with rare disease that were suspected to be genetic. We generated hg38-aligned variants and de novo phased genome assemblies, and subsequently annotated, filtered, and curated variants using clinical standards. New disease-relevant or potentially relevant genetic findings were identified in 16/96 (16.7%) probands, eight of which (8/96, 8.33%) harbored pathogenic or likely pathogenic variants. Newly identified variants were visible in both srGS and lrGS in nine probands (~9.4%) and resulted from changes to interpretation mostly from recent gene-disease association discoveries. Seven cases included variants that were only interpretable in lrGS, including copy-number variants, an inversion, a mobile element insertion, two low-complexity repeat expansions, and a 1 bp deletion. While evidence for each of these variants is, in retrospect, visible in srGS, they were either: not called within srGS data, were represented by calls with incorrect sizes or structures, or failed quality-control and filtration. Thus, while reanalysis of older data clearly increases diagnostic yield, we find that lrGS allows for substantial additional yield (7/96, 7.3%) beyond srGS. We anticipate that as lrGS analysis improves, and as lrGS datasets grow allowing for better variant frequency annotation, the additional lrGS-only rare disease yield will grow over time.
Collapse
Affiliation(s)
- Susan M. Hiatt
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, 35806, USA
| | | | - Lori H. Handley
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, 35806, USA
| | - Donald R. Latner
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, 35806, USA
| | | | | | | | - Lori Beth Boston
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, 35806, USA
| | - Melissa Williams
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, 35806, USA
| | | | - Jerry Jenkins
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, 35806, USA
| | | | - E. Martina Bebin
- Department of Neurology, University of Alabama at Birmingham, Birmingham, AL, 35924, USA
| | - Michael A. Lopez
- Department of Neurology, University of Alabama at Birmingham, Birmingham, AL, 35924, USA
- Department of Pediatrics, University of Alabama at Birmingham, Birmingham, AL, 35924, USA
- Department of Genetics, University of Alabama at Birmingham, Birmingham, AL, 35924, USA
| | - Anna C. E. Hurst
- Department of Genetics, University of Alabama at Birmingham, Birmingham, AL, 35924, USA
| | - Bruce R. Korf
- Department of Genetics, University of Alabama at Birmingham, Birmingham, AL, 35924, USA
| | - Jeremy Schmutz
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, 35806, USA
| | - Jane Grimwood
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, 35806, USA
| | | |
Collapse
|
19
|
Keskus A, Bryant A, Ahmad T, Yoo B, Aganezov S, Goretsky A, Donmez A, Lansdon LA, Rodriguez I, Park J, Liu Y, Cui X, Gardner J, McNulty B, Sacco S, Shetty J, Zhao Y, Tran B, Narzisi G, Helland A, Cook DE, Chang PC, Kolesnikov A, Carroll A, Molloy EK, Pushel I, Guest E, Pastinen T, Shafin K, Miga KH, Malikic S, Day CP, Robine N, Sahinalp C, Dean M, Farooqi MS, Paten B, Kolmogorov M. Severus: accurate detection and characterization of somatic structural variation in tumor genomes using long reads. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.03.22.24304756. [PMID: 38585974 PMCID: PMC10996739 DOI: 10.1101/2024.03.22.24304756] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/09/2024]
Abstract
Most current studies rely on short-read sequencing to detect somatic structural variation (SV) in cancer genomes. Long-read sequencing offers the advantage of better mappability and long-range phasing, which results in substantial improvements in germline SV detection. However, current long-read SV detection methods do not generalize well to the analysis of somatic SVs in tumor genomes with complex rearrangements, heterogeneity, and aneuploidy. Here, we present Severus: a method for the accurate detection of different types of somatic SVs using a phased breakpoint graph approach. To benchmark various short- and long-read SV detection methods, we sequenced five tumor/normal cell line pairs with Illumina, Nanopore, and PacBio sequencing platforms; on this benchmark Severus showed the highest F1 scores (harmonic mean of the precision and recall) as compared to long-read and short-read methods. We then applied Severus to three clinical cases of pediatric cancer, demonstrating concordance with known genetic findings as well as revealing clinically relevant cryptic rearrangements missed by standard genomic panels.
Collapse
Affiliation(s)
- Ayse Keskus
- Center for Cancer Research, National Cancer Institute, NIH, Bethesda, MD, USA
| | - Asher Bryant
- Center for Cancer Research, National Cancer Institute, NIH, Bethesda, MD, USA
| | - Tanveer Ahmad
- Center for Cancer Research, National Cancer Institute, NIH, Bethesda, MD, USA
| | - Byunggil Yoo
- Children’s Mercy Hospital, University of Missouri-Kansas City School of Medicine, Kansas City, MO, USA
| | | | - Anton Goretsky
- Center for Cancer Research, National Cancer Institute, NIH, Bethesda, MD, USA
- Department of Computer Science, University of Maryland, College Park, MD, USA
| | - Ataberk Donmez
- Center for Cancer Research, National Cancer Institute, NIH, Bethesda, MD, USA
- Department of Computer Science, University of Maryland, College Park, MD, USA
| | - Lisa A. Lansdon
- Children’s Mercy Hospital, University of Missouri-Kansas City School of Medicine, Kansas City, MO, USA
| | - Isabel Rodriguez
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, NIH, Rockville, MD, USA
| | - Jimin Park
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | - Yuelin Liu
- Center for Cancer Research, National Cancer Institute, NIH, Bethesda, MD, USA
- Department of Computer Science, University of Maryland, College Park, MD, USA
| | - Xiwen Cui
- Center for Cancer Research, National Cancer Institute, NIH, Bethesda, MD, USA
| | | | | | - Samuel Sacco
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | - Jyoti Shetty
- Sequencing Facility, Cancer Research Technology Program, Frederick National Laboratory for Cancer Research, Frederick, MD, USA
| | - Yongmei Zhao
- Sequencing Facility Bioinformatics Group, Biomedical Informatics and Data Science Directorate, Frederick National Laboratory for Cancer Research, Frederick, MD, USA
| | - Bao Tran
- Sequencing Facility, Cancer Research Technology Program, Frederick National Laboratory for Cancer Research, Frederick, MD, USA
| | | | | | | | | | | | | | - Erin K. Molloy
- Department of Computer Science, University of Maryland, College Park, MD, USA
| | - Irina Pushel
- Children’s Mercy Hospital, University of Missouri-Kansas City School of Medicine, Kansas City, MO, USA
| | - Erin Guest
- Children’s Mercy Hospital, University of Missouri-Kansas City School of Medicine, Kansas City, MO, USA
| | - Tomi Pastinen
- Children’s Mercy Hospital, University of Missouri-Kansas City School of Medicine, Kansas City, MO, USA
| | - Kishwar Shafin
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, NIH, Rockville, MD, USA
| | - Karen H. Miga
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | - Salem Malikic
- Center for Cancer Research, National Cancer Institute, NIH, Bethesda, MD, USA
| | - Chi-Ping Day
- Center for Cancer Research, National Cancer Institute, NIH, Bethesda, MD, USA
| | | | - Cenk Sahinalp
- Center for Cancer Research, National Cancer Institute, NIH, Bethesda, MD, USA
| | - Michael Dean
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, NIH, Rockville, MD, USA
| | - Midhat S. Farooqi
- Children’s Mercy Hospital, University of Missouri-Kansas City School of Medicine, Kansas City, MO, USA
| | | | - Mikhail Kolmogorov
- Center for Cancer Research, National Cancer Institute, NIH, Bethesda, MD, USA
| |
Collapse
|
20
|
Wang S, Lin J, Jia P, Xu T, Li X, Liu Y, Xu D, Bush SJ, Meng D, Ye K. De novo and somatic structural variant discovery with SVision-pro. Nat Biotechnol 2024:10.1038/s41587-024-02190-7. [PMID: 38519720 DOI: 10.1038/s41587-024-02190-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2023] [Accepted: 02/27/2024] [Indexed: 03/25/2024]
Abstract
Long-read-based de novo and somatic structural variant (SV) discovery remains challenging, necessitating genomic comparison between samples. We developed SVision-pro, a neural-network-based instance segmentation framework that represents genome-to-genome-level sequencing differences visually and discovers SV comparatively between genomes without any prerequisite for inference models. SVision-pro outperforms state-of-the-art approaches, in particular, the resolving of complex SVs is improved, with low Mendelian error rates, high sensitivity of low-frequency SVs and reduced false-positive rates compared with SV merging approaches.
Collapse
Affiliation(s)
- Songbo Wang
- Department of Gynecology and Obstetrics, Center for Mathematical Medical, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an, China
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China
| | - Jiadong Lin
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China
| | - Peng Jia
- Department of Gynecology and Obstetrics, Center for Mathematical Medical, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an, China
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China
| | - Tun Xu
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China
| | - Xiujuan Li
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China
| | - Yuezhuangnan Liu
- School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, China
| | - Dan Xu
- School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, China
| | - Stephen J Bush
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China
| | - Deyu Meng
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China
- School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an, China
- Macau Institute of Systems Engineering, Macau University of Science and Technology, Taipa, Macau
- Pazhou Laboratory (Huangpu), Guangzhou, Guangdong, China
| | - Kai Ye
- Department of Gynecology and Obstetrics, Center for Mathematical Medical, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an, China.
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China.
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China.
- School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, China.
- Faculty of Science, Leiden University, Leiden, The Netherlands.
- Genome Institute, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an, China.
| |
Collapse
|
21
|
Wu Z, Li T, Jiang Z, Zheng J, Gu Y, Liu Y, Liu Y, Xie Z. Human pangenome analysis of sequences missing from the reference genome reveals their widespread evolutionary, phenotypic, and functional roles. Nucleic Acids Res 2024; 52:2212-2230. [PMID: 38364871 PMCID: PMC10954445 DOI: 10.1093/nar/gkae086] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2023] [Revised: 01/18/2024] [Accepted: 01/27/2024] [Indexed: 02/18/2024] Open
Abstract
Nonreference sequences (NRSs) are DNA sequences present in global populations but absent in the current human reference genome. However, the extent and functional significance of NRSs in the human genomes and populations remains unclear. Here, we de novo assembled 539 genomes from five genetically divergent human populations using long-read sequencing technology, resulting in the identification of 5.1 million NRSs. These were merged into 45284 unique NRSs, with 29.7% being novel discoveries. Among these NRSs, 38.7% were common across the five populations, and 35.6% were population specific. The use of a graph-based pangenome approach allowed for the detection of 565 transcript expression quantitative trait loci on NRSs, with 426 of these being novel findings. Moreover, 26 NRS candidates displayed evidence of adaptive selection within human populations. Genes situated in close proximity to or intersecting with these candidates may be associated with metabolism and type 2 diabetes. Genome-wide association studies revealed 14 NRSs to be significantly associated with eight phenotypes. Additionally, 154 NRSs were found to be in strong linkage disequilibrium with 258 phenotype-associated SNPs in the GWAS catalogue. Our work expands the understanding of human NRSs and provides novel insights into their functions, facilitating evolutionary and biomedical researches.
Collapse
Affiliation(s)
- Zhikun Wu
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou, China
| | - Tong Li
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou, China
| | - Zehang Jiang
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou, China
| | - Jingjing Zheng
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou, China
| | - Yizhou Gu
- Center for Precision Medicine, Sun Yat-sen University, Guangzhou, China
- University of Wisconsin-Madison, WI, USA
| | - Yizhi Liu
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou, China
| | - Yun Liu
- MOE Key Laboratory of Metabolism and Molecular Medicine, Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences and Shanghai Xuhui Central Hospital, Fudan University, Shanghai, China
| | - Zhi Xie
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou, China
- Center for Precision Medicine, Sun Yat-sen University, Guangzhou, China
| |
Collapse
|
22
|
Leonard AS, Mapel XM, Pausch H. Pangenome-genotyped structural variation improves molecular phenotype mapping in cattle. Genome Res 2024; 34:300-309. [PMID: 38355307 PMCID: PMC10984387 DOI: 10.1101/gr.278267.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Accepted: 02/01/2024] [Indexed: 02/16/2024]
Abstract
Expression and splicing quantitative trait loci (e/sQTL) are large contributors to phenotypic variability. Achieving sufficient statistical power for e/sQTL mapping requires large cohorts with both genotypes and molecular phenotypes, and so, the genomic variation is often called from short-read alignments, which are unable to comprehensively resolve structural variation. Here we build a pangenome from 16 HiFi haplotype-resolved cattle assemblies to identify small and structural variation and genotype them with PanGenie in 307 short-read samples. We find high (>90%) concordance of PanGenie-genotyped and DeepVariant-called small variation and confidently genotype close to 21 million small and 43,000 structural variants in the larger population. We validate 85% of these structural variants (with MAF > 0.1) directly with a subset of 25 short-read samples that also have medium coverage HiFi reads. We then conduct e/sQTL mapping with this comprehensive variant set in a subset of 117 cattle that have testis transcriptome data, and find 92 structural variants as causal candidates for eQTL and 73 for sQTL. We find that roughly half of the top associated structural variants affecting expression or splicing are transposable elements, such as SV-eQTL for STN1 and MYH7 and SV-sQTL for CEP89 and ASAH2 Extensive linkage disequilibrium between small and structural variation results in only 28 additional eQTL and 17 sQTL discovered when including SVs, although many top associated SVs are compelling candidates.
Collapse
Affiliation(s)
| | - Xena M Mapel
- Animal Genomics, ETH Zurich, 8092 Zurich, Switzerland
| | - Hubert Pausch
- Animal Genomics, ETH Zurich, 8092 Zurich, Switzerland
| |
Collapse
|
23
|
Gustafson JA, Gibson SB, Damaraju N, Zalusky MPG, Hoekzema K, Twesigomwe D, Yang L, Snead AA, Richmond PA, De Coster W, Olson ND, Guarracino A, Li Q, Miller AL, Goffena J, Anderson Z, Storz SHR, Ward SA, Sinha M, Gonzaga-Jauregui C, Clarke WE, Basile AO, Corvelo A, Reeves C, Helland A, Musunuri RL, Revsine M, Patterson KE, Paschal CR, Zakarian C, Goodwin S, Jensen TD, Robb E, McCombie WR, Sedlazeck FJ, Zook JM, Montgomery SB, Garrison E, Kolmogorov M, Schatz MC, McLaughlin RN, Dashnow H, Zody MC, Loose M, Jain M, Eichler EE, Miller DE. Nanopore sequencing of 1000 Genomes Project samples to build a comprehensive catalog of human genetic variation. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.03.05.24303792. [PMID: 38496498 PMCID: PMC10942501 DOI: 10.1101/2024.03.05.24303792] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/19/2024]
Abstract
Less than half of individuals with a suspected Mendelian condition receive a precise molecular diagnosis after comprehensive clinical genetic testing. Improvements in data quality and costs have heightened interest in using long-read sequencing (LRS) to streamline clinical genomic testing, but the absence of control datasets for variant filtering and prioritization has made tertiary analysis of LRS data challenging. To address this, the 1000 Genomes Project ONT Sequencing Consortium aims to generate LRS data from at least 800 of the 1000 Genomes Project samples. Our goal is to use LRS to identify a broader spectrum of variation so we may improve our understanding of normal patterns of human variation. Here, we present data from analysis of the first 100 samples, representing all 5 superpopulations and 19 subpopulations. These samples, sequenced to an average depth of coverage of 37x and sequence read N50 of 54 kbp, have high concordance with previous studies for identifying single nucleotide and indel variants outside of homopolymer regions. Using multiple structural variant (SV) callers, we identify an average of 24,543 high-confidence SVs per genome, including shared and private SVs likely to disrupt gene function as well as pathogenic expansions within disease-associated repeats that were not detected using short reads. Evaluation of methylation signatures revealed expected patterns at known imprinted loci, samples with skewed X-inactivation patterns, and novel differentially methylated regions. All raw sequencing data, processed data, and summary statistics are publicly available, providing a valuable resource for the clinical genetics community to discover pathogenic SVs.
Collapse
Affiliation(s)
- Jonas A. Gustafson
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, WA, USA
- Molecular and Cellular Biology Program, University of Washington, Seattle, WA, USA
| | - Sophia B. Gibson
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, WA, USA
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Nikhita Damaraju
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, WA, USA
- Institute for Public Health Genetics, University of Washington, Seattle, WA, USA
| | - Miranda PG Zalusky
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, WA, USA
| | - Kendra Hoekzema
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - David Twesigomwe
- Sydney Brenner Institute for Molecular Bioscience, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa
| | - Lei Yang
- Pacific Northwest Research Institute, Seattle, WA, USA
| | | | | | - Wouter De Coster
- Applied and Translational Neurogenomics Group, VIB Center for Molecular Neurology, VIB, Antwerp, Belgium
- Department of Biomedical Sciences, University of Antwerp, Antwerp, Belgium
| | - Nathan D. Olson
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Andrea Guarracino
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
- Human Technopole, Milan, Italy
| | - Qiuhui Li
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Angela L. Miller
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, WA, USA
| | - Joy Goffena
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, WA, USA
| | - Zachery Anderson
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, WA, USA
| | - Sophie HR Storz
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, WA, USA
| | - Sydney A. Ward
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, WA, USA
| | - Maisha Sinha
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, WA, USA
| | - Claudia Gonzaga-Jauregui
- International Laboratory for Human Genome Research, Laboratorio Internacional de Investigación sobre el Genoma Humano, Universidad Nacional Autónoma de México
| | - Wayne E. Clarke
- New York Genome Center, New York, NY, USA
- Outlier Informatics Inc., Saskatoon, SK, Canada
| | | | | | | | | | | | - Mahler Revsine
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | | | - Cate R. Paschal
- Department of Laboratories, Seattle Children’s Hospital, Seattle, WA, USA
- Department of Laboratory Medicine and Pathology, University of Washington, Seattle, WA, USA
| | - Christina Zakarian
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Sara Goodwin
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | | | - Esther Robb
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | | | | | | | | | - Fritz J. Sedlazeck
- Human Genome Sequencing Center Baylor College of Medicine, Houston, TX, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Justin M. Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | | | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Mikhail Kolmogorov
- Cancer Data Science Laboratory, National Cancer Institute, NIH, Bethesda, MD, USA
| | - Michael C. Schatz
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Richard N. McLaughlin
- Molecular and Cellular Biology Program, University of Washington, Seattle, WA, USA
- Pacific Northwest Research Institute, Seattle, WA, USA
| | - Harriet Dashnow
- Department of Human Genetics, University of Utah, Salt Lake City, UT, USA
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, USA
| | | | - Matt Loose
- Deep Seq, School of Life Sciences, University of Nottingham, Nottingham, England
| | - Miten Jain
- Department of Bioengineering, Department of Physics, Khoury College of Computer Sciences, Northeastern University, Boston, MA
| | - Evan E. Eichler
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
- Brotman Baty Institute for Precision Medicine, University of Washington, Seattle, WA, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| | - Danny E. Miller
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, WA, USA
- Department of Laboratory Medicine and Pathology, University of Washington, Seattle, WA, USA
- Brotman Baty Institute for Precision Medicine, University of Washington, Seattle, WA, USA
| |
Collapse
|
24
|
Linderman MD, Wallace J, van der Heyde A, Wieman E, Brey D, Shi Y, Hansen P, Shamsi Z, Liu J, Gelb BD, Bashir A. NPSV-deep: a deep learning method for genotyping structural variants in short read genome sequencing data. Bioinformatics 2024; 40:btae129. [PMID: 38444093 PMCID: PMC10955255 DOI: 10.1093/bioinformatics/btae129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Revised: 01/15/2024] [Accepted: 03/04/2024] [Indexed: 03/07/2024] Open
Abstract
MOTIVATION Structural variants (SVs) play a causal role in numerous diseases but can be difficult to detect and accurately genotype (determine zygosity) with short-read genome sequencing data (SRS). Improving SV genotyping accuracy in SRS data, particularly for the many SVs first detected with long-read sequencing, will improve our understanding of genetic variation. RESULTS NPSV-deep is a deep learning-based approach for genotyping previously reported insertion and deletion SVs that recasts this task as an image similarity problem. NPSV-deep predicts the SV genotype based on the similarity between pileup images generated from the actual SRS data and matching SRS simulations. We show that NPSV-deep consistently matches or improves upon the state-of-the-art for SV genotyping accuracy across different SV call sets, samples and variant types, including a 25% reduction in genotyping errors for the Genome-in-a-Bottle (GIAB) high-confidence SVs. NPSV-deep is not limited to the SVs as described; it improves deletion genotyping concordance a further 1.5 percentage points for GIAB SVs (92%) by automatically correcting imprecise/incorrectly described SVs. AVAILABILITY AND IMPLEMENTATION Python/C++ source code and pre-trained models freely available at https://github.com/mlinderm/npsv2.
Collapse
Affiliation(s)
- Michael D Linderman
- Department of Computer Science, Middlebury College, Middlebury, VT 05753, United States
| | - Jacob Wallace
- Department of Computer Science, Middlebury College, Middlebury, VT 05753, United States
| | - Alderik van der Heyde
- Department of Computer Science, Middlebury College, Middlebury, VT 05753, United States
| | - Eliza Wieman
- Department of Computer Science, Middlebury College, Middlebury, VT 05753, United States
| | - Daniel Brey
- Department of Computer Science, Middlebury College, Middlebury, VT 05753, United States
| | - Yiran Shi
- Department of Computer Science, Middlebury College, Middlebury, VT 05753, United States
| | - Peter Hansen
- Department of Computer Science, Middlebury College, Middlebury, VT 05753, United States
| | | | | | - Bruce D Gelb
- Mindich Child Health and Development Institute and the Departments of Pediatrics and Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Ali Bashir
- Google, Mountain View, CA 94043, United States
| |
Collapse
|
25
|
Audano PA, Beck CR. Small polymorphisms are a source of ancestral bias in structural variant breakpoint placement. Genome Res 2024; 34:7-19. [PMID: 38176712 PMCID: PMC10904011 DOI: 10.1101/gr.278203.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2023] [Accepted: 01/02/2024] [Indexed: 01/06/2024]
Abstract
High-quality genome assemblies and sophisticated algorithms have increased sensitivity for a wide range of variant types, and breakpoint accuracy for structural variants (SVs, ≥50 bp) has improved to near base pair precision. Despite these advances, many SV breakpoint locations are subject to systematic bias affecting variant representation. To understand why SV breakpoints are inconsistent across samples, we reanalyzed 64 phased haplotypes constructed from long-read assemblies released by the Human Genome Structural Variation Consortium (HGSVC). We identify 882 SV insertions and 180 SV deletions with variable breakpoints not anchored in tandem repeats (TRs) or segmental duplications (SDs). SVs called from aligned sequencing reads increase breakpoint disagreements by 2×-16×. Sequence accuracy had a minimal impact on breakpoints, but we observe a strong effect of ancestry. We confirm that SNP and indel polymorphisms are enriched at shifted breakpoints and are also absent from variant callsets. Breakpoint homology increases the likelihood of imprecise SV calls and the distance they are shifted, and tandem duplications are the most heavily affected SVs. Because graph genome methods normalize SV calls across samples, we investigated graphs generated by two different methods and find the resulting breakpoints are subject to other technical biases affecting breakpoint accuracy. The breakpoint inconsistencies we characterize affect ∼5% of the SVs called in a human genome and can impact variant interpretation and annotation. These limitations underscore a need for algorithm development to improve SV databases, mitigate the impact of ancestry on breakpoints, and increase the value of callsets for investigating breakpoint features.
Collapse
Affiliation(s)
- Peter A Audano
- The Jackson Laboratory for Genomic Medicine, Farmington, Connecticut 06032, USA
| | - Christine R Beck
- The Jackson Laboratory for Genomic Medicine, Farmington, Connecticut 06032, USA;
- Department of Genetics and Genome Sciences, Institute for Systems Genomics, University of Connecticut Health Center, Farmington, Connecticut 06030, USA
| |
Collapse
|
26
|
Zheng Z, Zhu M, Zhang J, Liu X, Hou L, Liu W, Yuan S, Luo C, Yao X, Liu J, Yang Y. A sequence-aware merger of genomic structural variations at population scale. Nat Commun 2024; 15:960. [PMID: 38307885 PMCID: PMC10837428 DOI: 10.1038/s41467-024-45244-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Accepted: 01/18/2024] [Indexed: 02/04/2024] Open
Abstract
Merging structural variations (SVs) at the population level presents a significant challenge, yet it is essential for conducting comprehensive genotypic analyses, especially in the era of pangenomics. Here, we introduce PanPop, a tool that utilizes an advanced sequence-aware SV merging algorithm to efficiently merge SVs of various types. We demonstrate that PanPop can merge and optimize the majority of multiallelic SVs into informative biallelic variants. We show its superior precision and lower rates of missing data compared to alternative software solutions. Our approach not only enables the filtering of SVs by leveraging multiple SV callers for enhanced accuracy but also facilitates the accurate merging of large-scale population SVs. These capabilities of PanPop will help to accelerate future SV-related studies.
Collapse
Affiliation(s)
- Zeyu Zheng
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou, China
| | - Mingjia Zhu
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou, China
| | - Jin Zhang
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou, China
| | - Xinfeng Liu
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou, China
| | - Liqiang Hou
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou, China
| | - Wenyu Liu
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou, China
| | - Shuai Yuan
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou, China
| | - Changhong Luo
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou, China
| | - Xinhao Yao
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou, China
| | - Jianquan Liu
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou, China.
| | - Yongzhi Yang
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou, China.
| |
Collapse
|
27
|
Groza C, Schwendinger-Schreck C, Cheung WA, Farrow EG, Thiffault I, Lake J, Rizzo WB, Evrony G, Curran T, Bourque G, Pastinen T. Pangenome graphs improve the analysis of structural variants in rare genetic diseases. Nat Commun 2024; 15:657. [PMID: 38253606 PMCID: PMC10803329 DOI: 10.1038/s41467-024-44980-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2023] [Accepted: 01/10/2024] [Indexed: 01/24/2024] Open
Abstract
Rare DNA alterations that cause heritable diseases are only partially resolvable by clinical next-generation sequencing due to the difficulty of detecting structural variation (SV) in all genomic contexts. Long-read, high fidelity genome sequencing (HiFi-GS) detects SVs with increased sensitivity and enables assembling personal and graph genomes. We leverage standard reference genomes, public assemblies (n = 94) and a large collection of HiFi-GS data from a rare disease program (Genomic Answers for Kids, GA4K, n = 574 assemblies) to build a graph genome representing a unified SV callset in GA4K, identify common variation and prioritize SVs that are more likely to cause genetic disease (MAF < 0.01). Using graphs, we obtain a higher level of reproducibility than the standard reference approach. We observe over 200,000 SV alleles unique to GA4K, including nearly 1000 rare variants that impact coding sequence. With improved specificity for rare SVs, we isolate 30 candidate SVs in phenotypically prioritized genes, including known disease SVs. We isolate a novel diagnostic SV in KMT2E, demonstrating use of personal assemblies coupled with pangenome graphs for rare disease genomics. The community may interrogate our pangenome with additional assemblies to discover new SVs within the allele frequency spectrum relevant to genetic diseases.
Collapse
Affiliation(s)
- Cristian Groza
- Quantitative Life Sciences, McGill University, Montréal, QC, Canada
| | | | - Warren A Cheung
- Genomic Medicine Center, Children's Mercy Hospital and Research Institute, KC, MO, USA
| | - Emily G Farrow
- Genomic Medicine Center, Children's Mercy Hospital and Research Institute, KC, MO, USA
| | - Isabelle Thiffault
- Genomic Medicine Center, Children's Mercy Hospital and Research Institute, KC, MO, USA
| | | | - William B Rizzo
- Child Health Research Institute, Department of Pediatrics, Nebraska Medical Center, Omaha, NE, USA
| | - Gilad Evrony
- Center for Human Genetics and Genomics, Department of Pediatrics, Neuroscience & Physiology, New York University Grossman School of Medicine, New York, NY, USA
| | - Tom Curran
- Children's Mercy Research Institute, Kansas City, MO, USA
| | - Guillaume Bourque
- Canadian Center for Computational Genomics, McGill University, Montréal, QC, Canada.
- Department of Human Genetics, McGill University, Montréal, QC, Canada.
- Institute for the Advanced Study of Human Biology (WPI-ASHBi), Kyoto University, Kyoto, Japan.
- Victor Phillip Dahdaleh Institute of Genomic Medicine at McGill University, Montréal, QC, Canada.
| | - Tomi Pastinen
- Genomic Medicine Center, Children's Mercy Hospital and Research Institute, KC, MO, USA.
| |
Collapse
|
28
|
Gaitán N, Duitama J. A graph clustering algorithm for detection and genotyping of structural variants from long reads. Gigascience 2024; 13:giad112. [PMID: 38206589 PMCID: PMC10783151 DOI: 10.1093/gigascience/giad112] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2023] [Revised: 08/02/2023] [Accepted: 12/08/2023] [Indexed: 01/12/2024] Open
Abstract
BACKGROUND Structural variants (SVs) are genomic polymorphisms defined by their length (>50 bp). The usual types of SVs are deletions, insertions, translocations, inversions, and copy number variants. SV detection and genotyping is fundamental given the role of SVs in phenomena such as phenotypic variation and evolutionary events. Thus, methods to identify SVs using long-read sequencing data have been recently developed. FINDINGS We present an accurate and efficient algorithm to predict germline SVs from long-read sequencing data. The algorithm starts collecting evidence (signatures) of SVs from read alignments. Then, signatures are clustered based on a Euclidean graph with coordinates calculated from lengths and genomic positions. Clustering is performed by the DBSCAN algorithm, which provides the advantage of delimiting clusters with high resolution. Clusters are transformed into SVs and a Bayesian model allows to precisely genotype SVs based on their supporting evidence. This algorithm is integrated into the single sample variants detector of the Next Generation Sequencing Experience Platform, which facilitates the integration with other functionalities for genomics analysis. We performed multiple benchmark experiments, including simulation and real data, representing different genome profiles, sequencing technologies (PacBio HiFi, ONT), and read depths. CONCLUSION The results show that our approach outperformed state-of-the-art tools on germline SV calling and genotyping, especially at low depths, and in error-prone repetitive regions. We believe this work significantly contributes to the development of bioinformatic strategies to maximize the use of long-read sequencing technologies.
Collapse
Affiliation(s)
- Nicolás Gaitán
- Systems and Computing Engineering Department, Universidad de Los Andes, Bogotá 111711, Colombia
| | - Jorge Duitama
- Systems and Computing Engineering Department, Universidad de Los Andes, Bogotá 111711, Colombia
| |
Collapse
|
29
|
Chaisson MJP, Sulovari A, Valdmanis PN, Miller DE, Eichler EE. Advances in the discovery and analyses of human tandem repeats. Emerg Top Life Sci 2023; 7:361-381. [PMID: 37905568 PMCID: PMC10806765 DOI: 10.1042/etls20230074] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2023] [Revised: 10/18/2023] [Accepted: 10/18/2023] [Indexed: 11/02/2023]
Abstract
Long-read sequencing platforms provide unparalleled access to the structure and composition of all classes of tandemly repeated DNA from STRs to satellite arrays. This review summarizes our current understanding of their organization within the human genome, their importance with respect to disease, as well as the advances and challenges in understanding their genetic diversity and functional effects. Novel computational methods are being developed to visualize and associate these complex patterns of human variation with disease, expression, and epigenetic differences. We predict accurate characterization of this repeat-rich form of human variation will become increasingly relevant to both basic and clinical human genetics.
Collapse
Affiliation(s)
- Mark J P Chaisson
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, U.S.A
- The Genomic and Epigenomic Regulation Program, USC Norris Cancer Center, University of Southern California, Los Angeles, CA 90089, U.S.A
| | - Arvis Sulovari
- Computational Biology, Cajal Neuroscience Inc, Seattle, WA 98102, U.S.A
| | - Paul N Valdmanis
- Division of Medical Genetics, Department of Medicine, University of Washington School of Medicine, Seattle, WA 98195, U.S.A
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA 98195, U.S.A
- Department of Laboratory Medicine and Pathology, University of Washington, Seattle, WA 98195, U.S.A
| | - Danny E Miller
- Department of Laboratory Medicine and Pathology, University of Washington, Seattle, WA 98195, U.S.A
- Brotman Baty Institute for Precision Medicine, University of Washington, Seattle, WA 98195, U.S.A
- Department of Pediatrics, University of Washington, Seattle, WA 98195, U.S.A
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA 98195, U.S.A
- Howard Hughes Medical Institute, University of Washington, Seattle, WA 98195, U.S.A
| |
Collapse
|
30
|
Zhao Y, Su C, He B, Nie R, Wang Y, Ma J, Song J, Yang Q, Hao J. Dispersal from the Qinghai-Tibet plateau by a high-altitude butterfly is associated with rapid expansion and reorganization of its genome. Nat Commun 2023; 14:8190. [PMID: 38081828 PMCID: PMC10713551 DOI: 10.1038/s41467-023-44023-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2023] [Accepted: 11/28/2023] [Indexed: 12/18/2023] Open
Abstract
Parnassius glacialis is a typical "Out of the QTP" alpine butterfly that originated on the Qinghai-Tibet Plateau (QTP) and dispersed into relatively low-altitude mountainous. Here we assemble a chromosome-level genome of P. glacialis and resequence 9 populations in order to explore the genome evolution and local adaptation of this species. These results indicated that the rapid accumulation and slow unequal recombination of transposable elements (TEs) contributed to the formation of its large genome. Several ribosomal gene families showed extensive expansion and selective evolution through transposon-mediated processed pseudogenes. Additionally, massive structural variations (SVs) of TEs affected the genetic differentiation of low-altitude populations. These low-altitude populations might have experienced a genetic bottleneck in the past and harbor genes with selective signatures which may be responsible for the potential adaptation to low-altitude environments. These results provide a foundation for understanding genome evolution and local adaptation for "Out of the QTP" of P. glacialis.
Collapse
Affiliation(s)
- Youjie Zhao
- College of Life Sciences, Anhui Normal University, Wuhu, 241000, China
- College of Big Data and Intelligent Engineering, Southwest Forestry University, Kunming, 650224, Yunnan, China
| | - Chengyong Su
- College of Life Sciences, Anhui Normal University, Wuhu, 241000, China
| | - Bo He
- College of Life Sciences, Anhui Normal University, Wuhu, 241000, China
| | - Ruie Nie
- College of Life Sciences, Anhui Normal University, Wuhu, 241000, China
| | - Yunliang Wang
- College of Life Sciences, Anhui Normal University, Wuhu, 241000, China
| | - Junye Ma
- State Key Laboratory of Palaeobiology and Stratigraphy, Center for Excellence in Life and Palaeoenvironment, Nanjing Institute of Geology and Paleontology, Chinese Academy of Sciences, Nanjing, 210008, China
| | - Jingyu Song
- College of Animal Science, Shandong Agricultural University, Taian, 271000, China
| | - Qun Yang
- State Key Laboratory of Palaeobiology and Stratigraphy, Center for Excellence in Life and Palaeoenvironment, Nanjing Institute of Geology and Paleontology, Chinese Academy of Sciences, Nanjing, 210008, China.
- Nanjing College, University of Chinese Academy of Sciences, Nanjing, 211135, China.
| | - Jiasheng Hao
- College of Life Sciences, Anhui Normal University, Wuhu, 241000, China.
| |
Collapse
|
31
|
Jia P, Dong L, Yang X, Wang B, Bush SJ, Wang T, Lin J, Wang S, Zhao X, Xu T, Che Y, Dang N, Ren L, Zhang Y, Wang X, Liang F, Wang Y, Ruan J, Xia H, Zheng Y, Shi L, Lv Y, Wang J, Ye K. Haplotype-resolved assemblies and variant benchmark of a Chinese Quartet. Genome Biol 2023; 24:277. [PMID: 38049885 PMCID: PMC10694985 DOI: 10.1186/s13059-023-03116-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2023] [Accepted: 11/21/2023] [Indexed: 12/06/2023] Open
Abstract
BACKGROUND Recent state-of-the-art sequencing technologies enable the investigation of challenging regions in the human genome and expand the scope of variant benchmarking datasets. Herein, we sequence a Chinese Quartet, comprising two monozygotic twin daughters and their biological parents, using four short and long sequencing platforms (Illumina, BGI, PacBio, and Oxford Nanopore Technology). RESULTS The long reads from the monozygotic twin daughters are phased into paternal and maternal haplotypes using the parent-child genetic map and for each haplotype. We also use long reads to generate haplotype-resolved whole-genome assemblies with completeness and continuity exceeding that of GRCh38. Using this Quartet, we comprehensively catalogue the human variant landscape, generating a dataset of 3,962,453 SNVs, 886,648 indels (< 50 bp), 9726 large deletions (≥ 50 bp), 15,600 large insertions (≥ 50 bp), 40 inversions, 31 complex structural variants, and 68 de novo mutations which are shared between the monozygotic twin daughters. Variants underrepresented in previous benchmarks owing to their complexity-including those located at long repeat regions, complex structural variants, and de novo mutations-are systematically examined in this study. CONCLUSIONS In summary, this study provides high-quality haplotype-resolved assemblies and a comprehensive set of benchmarking resources for two Chinese monozygotic twin samples which, relative to existing benchmarks, offers expanded genomic coverage and insight into complex variant categories.
Collapse
Affiliation(s)
- Peng Jia
- National Local Joint Engineering Research Center for Precision Surgery & Regenerative Medicine, Center for Mathematical Medical, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an, 710061, China
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
| | - Lianhua Dong
- National Institute of Metrology, Beijing, 100029, China
| | - Xiaofei Yang
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
- School of Computer Science and Technology, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
- Genome Institute, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an, 710061, China
| | - Bo Wang
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
| | - Stephen J Bush
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
| | - Tingjie Wang
- National Local Joint Engineering Research Center for Precision Surgery & Regenerative Medicine, Center for Mathematical Medical, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an, 710061, China
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
| | - Jiadong Lin
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
| | - Songbo Wang
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
| | - Xixi Zhao
- National Local Joint Engineering Research Center for Precision Surgery & Regenerative Medicine, Center for Mathematical Medical, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an, 710061, China
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
- School of Computer Science and Technology, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
| | - Tun Xu
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
| | - Yizhuo Che
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
| | - Ningxin Dang
- Genome Institute, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an, 710061, China
| | - Luyao Ren
- State Key Laboratory of Genetic Engineering, Human Phenome Institute, School of Life Sciences and Shanghai Cancer Center, Fudan University, Shanghai, 200438, China
| | - Yujing Zhang
- National Institute of Metrology, Beijing, 100029, China
| | - Xia Wang
- National Institute of Metrology, Beijing, 100029, China
| | - Fan Liang
- GrandOmics Biosciences, Beijing, 100089, China
| | - Yang Wang
- GrandOmics Biosciences, Beijing, 100089, China
| | - Jue Ruan
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, 518120, China
| | - Han Xia
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
| | - Yuanting Zheng
- State Key Laboratory of Genetic Engineering, Human Phenome Institute, School of Life Sciences and Shanghai Cancer Center, Fudan University, Shanghai, 200438, China
| | - Leming Shi
- State Key Laboratory of Genetic Engineering, Human Phenome Institute, School of Life Sciences and Shanghai Cancer Center, Fudan University, Shanghai, 200438, China
| | - Yi Lv
- National Local Joint Engineering Research Center for Precision Surgery & Regenerative Medicine, Center for Mathematical Medical, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an, 710061, China.
| | - Jing Wang
- National Institute of Metrology, Beijing, 100029, China.
| | - Kai Ye
- National Local Joint Engineering Research Center for Precision Surgery & Regenerative Medicine, Center for Mathematical Medical, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an, 710061, China.
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China.
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China.
- Genome Institute, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an, 710061, China.
- School of Life Science and Technology, Xi'an Jiaotong University, Xi'an 710049, China.
- Faculty of Science, Leiden University, Leiden, 2311EZ, The Netherlands.
| |
Collapse
|
32
|
Reis ALM, Rapadas M, Hammond JM, Gamaarachchi H, Stevanovski I, Ayuputeri Kumaheri M, Chintalaphani SR, Dissanayake DSB, Siggs OM, Hewitt AW, Llamas B, Brown A, Baynam G, Mann GJ, McMorran BJ, Easteal S, Hermes A, Jenkins MR, Patel HR, Deveson IW. The landscape of genomic structural variation in Indigenous Australians. Nature 2023; 624:602-610. [PMID: 38093003 PMCID: PMC10733147 DOI: 10.1038/s41586-023-06842-7] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Accepted: 11/07/2023] [Indexed: 12/20/2023]
Abstract
Indigenous Australians harbour rich and unique genomic diversity. However, Aboriginal and Torres Strait Islander ancestries are historically under-represented in genomics research and almost completely missing from reference datasets1-3. Addressing this representation gap is critical, both to advance our understanding of global human genomic diversity and as a prerequisite for ensuring equitable outcomes in genomic medicine. Here we apply population-scale whole-genome long-read sequencing4 to profile genomic structural variation across four remote Indigenous communities. We uncover an abundance of large insertion-deletion variants (20-49 bp; n = 136,797), structural variants (50 b-50 kb; n = 159,912) and regions of variable copy number (>50 kb; n = 156). The majority of variants are composed of tandem repeat or interspersed mobile element sequences (up to 90%) and have not been previously annotated (up to 62%). A large fraction of structural variants appear to be exclusive to Indigenous Australians (12% lower-bound estimate) and most of these are found in only a single community, underscoring the need for broad and deep sampling to achieve a comprehensive catalogue of genomic structural variation across the Australian continent. Finally, we explore short tandem repeats throughout the genome to characterize allelic diversity at 50 known disease loci5, uncover hundreds of novel repeat expansion sites within protein-coding genes, and identify unique patterns of diversity and constraint among short tandem repeat sequences. Our study sheds new light on the dimensions and dynamics of genomic structural variation within and beyond Australia.
Collapse
Affiliation(s)
- Andre L M Reis
- Genomics and Inherited Disease Program, Garvan Institute of Medical Research, Sydney, New South Wales, Australia
- Centre for Population Genomics, Garvan Institute of Medical Research and Murdoch Children's Research Institute, Darlinghurst, New South Wales, Australia
- Faculty of Medicine, University of New South Wales, Sydney, New South Wales, Australia
| | - Melissa Rapadas
- Genomics and Inherited Disease Program, Garvan Institute of Medical Research, Sydney, New South Wales, Australia
- Centre for Population Genomics, Garvan Institute of Medical Research and Murdoch Children's Research Institute, Darlinghurst, New South Wales, Australia
| | - Jillian M Hammond
- Genomics and Inherited Disease Program, Garvan Institute of Medical Research, Sydney, New South Wales, Australia
- Centre for Population Genomics, Garvan Institute of Medical Research and Murdoch Children's Research Institute, Darlinghurst, New South Wales, Australia
| | - Hasindu Gamaarachchi
- Genomics and Inherited Disease Program, Garvan Institute of Medical Research, Sydney, New South Wales, Australia
- Centre for Population Genomics, Garvan Institute of Medical Research and Murdoch Children's Research Institute, Darlinghurst, New South Wales, Australia
- School of Computer Science and Engineering, University of New South Wales, Sydney, New South Wales, Australia
| | - Igor Stevanovski
- Genomics and Inherited Disease Program, Garvan Institute of Medical Research, Sydney, New South Wales, Australia
- Centre for Population Genomics, Garvan Institute of Medical Research and Murdoch Children's Research Institute, Darlinghurst, New South Wales, Australia
| | - Meutia Ayuputeri Kumaheri
- Genomics and Inherited Disease Program, Garvan Institute of Medical Research, Sydney, New South Wales, Australia
- Centre for Population Genomics, Garvan Institute of Medical Research and Murdoch Children's Research Institute, Darlinghurst, New South Wales, Australia
| | - Sanjog R Chintalaphani
- Genomics and Inherited Disease Program, Garvan Institute of Medical Research, Sydney, New South Wales, Australia
- Centre for Population Genomics, Garvan Institute of Medical Research and Murdoch Children's Research Institute, Darlinghurst, New South Wales, Australia
- Faculty of Medicine, University of New South Wales, Sydney, New South Wales, Australia
| | - Duminda S B Dissanayake
- National Centre for Indigenous Genomics, John Curtin School of Medical Research, Australian National University, Canberra, Australian Capital Territory, Australia
- Institute for Applied Ecology, University of Canberra, Canberra, Australian Capital Territory, Australia
| | - Owen M Siggs
- Genomics and Inherited Disease Program, Garvan Institute of Medical Research, Sydney, New South Wales, Australia
- Centre for Population Genomics, Garvan Institute of Medical Research and Murdoch Children's Research Institute, Darlinghurst, New South Wales, Australia
- Department of Ophthalmology, Flinders University, Bedford Park, South Australia, Australia
| | - Alex W Hewitt
- Menzies Institute for Medical Research, University of Tasmania, Hobart, Tasmania, Australia
| | - Bastien Llamas
- National Centre for Indigenous Genomics, John Curtin School of Medical Research, Australian National University, Canberra, Australian Capital Territory, Australia
- Australian Centre for Ancient DNA, School of Biological Sciences and Environment Institute, University of Adelaide, Adelaide, South Australia, Australia
- ARC Centre of Excellence for Australian Biodiversity and Heritage, University of Adelaide, Adelaide, South Australia, Australia
- Indigenous Genomics, Telethon Kids Institute, Adelaide, South Australia, Australia
| | - Alex Brown
- National Centre for Indigenous Genomics, John Curtin School of Medical Research, Australian National University, Canberra, Australian Capital Territory, Australia
- Indigenous Genomics, Telethon Kids Institute, Adelaide, South Australia, Australia
| | - Gareth Baynam
- Telethon Kids Institute and Division of Paediatrics, Faculty of Health and Medical Sciences, University of Western Australia, Perth, Western Australia, Australia
- Genetic Services of Western Australia, Western Australian Department of Health, Perth, Western Australia, Australia
- Western Australian Register of Developmental Anomalies, Western Australian Department of Health, Perth, Western Australia, Australia
| | - Graham J Mann
- National Centre for Indigenous Genomics, John Curtin School of Medical Research, Australian National University, Canberra, Australian Capital Territory, Australia
| | - Brendan J McMorran
- National Centre for Indigenous Genomics, John Curtin School of Medical Research, Australian National University, Canberra, Australian Capital Territory, Australia
| | - Simon Easteal
- National Centre for Indigenous Genomics, John Curtin School of Medical Research, Australian National University, Canberra, Australian Capital Territory, Australia
| | - Azure Hermes
- National Centre for Indigenous Genomics, John Curtin School of Medical Research, Australian National University, Canberra, Australian Capital Territory, Australia
| | - Misty R Jenkins
- Immunology Division, The Walter and Eliza Hall Institute of Medical Research, Parkville, Victoria, Australia
| | - Hardip R Patel
- National Centre for Indigenous Genomics, John Curtin School of Medical Research, Australian National University, Canberra, Australian Capital Territory, Australia.
| | - Ira W Deveson
- Genomics and Inherited Disease Program, Garvan Institute of Medical Research, Sydney, New South Wales, Australia.
- Centre for Population Genomics, Garvan Institute of Medical Research and Murdoch Children's Research Institute, Darlinghurst, New South Wales, Australia.
- Faculty of Medicine, University of New South Wales, Sydney, New South Wales, Australia.
| |
Collapse
|
33
|
He Y, Chu Y, Guo S, Hu J, Li R, Zheng Y, Ma X, Du Z, Zhao L, Yu W, Xue J, Bian W, Yang F, Chen X, Zhang P, Wu R, Ma Y, Shao C, Chen J, Wang J, Li J, Wu J, Hu X, Long Q, Jiang M, Ye H, Song S, Li G, Wei Y, Xu Y, Ma Y, Chen Y, Wang K, Bao J, Xi W, Wang F, Ni W, Zhang M, Yu Y, Li S, Kang Y, Gao Z. T2T-YAO: A Telomere-to-telomere Assembled Diploid Reference Genome for Han Chinese. GENOMICS, PROTEOMICS & BIOINFORMATICS 2023; 21:1085-1100. [PMID: 37595788 PMCID: PMC11082261 DOI: 10.1016/j.gpb.2023.08.001] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Revised: 08/01/2023] [Accepted: 08/08/2023] [Indexed: 08/20/2023]
Abstract
Since its initial release in 2001, the human reference genome has undergone continuous improvement in quality, and the recently released telomere-to-telomere (T2T) version - T2T-CHM13 - reaches its highest level of continuity and accuracy after 20 years of effort by working on a simplified, nearly homozygous genome of a hydatidiform mole cell line. Here, to provide an authentic complete diploid human genome reference for the Han Chinese, the largest population in the world, we assembled the genome of a male Han Chinese individual, T2T-YAO, which includes T2T assemblies of all the 22 + X + M and 22 + Y chromosomes in both haploids. The quality of T2T-YAO is much better than those of all currently available diploid assemblies, and its haploid version, T2T-YAO-hp, generated by selecting the better assembly for each autosome, reaches the top quality of fewer than one error per 29.5 Mb, even higher than that of T2T-CHM13. Derived from an individual living in the aboriginal region of the Han population, T2T-YAO shows clear ancestry and potential genetic continuity from the ancient ancestors. Each haplotype of T2T-YAO possesses ∼ 330-Mb exclusive sequences, ∼ 3100 unique genes, and tens of thousands of nucleotide and structural variations as compared with CHM13, highlighting the necessity of a population-stratified reference genome. The construction of T2T-YAO, an accurate and authentic representative of the Chinese population, would enable precise delineation of genomic variations and advance our understandings in the hereditability of diseases and phenotypes, especially within the context of the unique variations of the Chinese population.
Collapse
Affiliation(s)
- Yukun He
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China
| | - Yanan Chu
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China
| | - Shuming Guo
- Linfen Clinical Medicine Research Center, Linfen 041000, China; Institute of Chest and Lung Diseases, Shanxi Medical University, Taiyuan 030001, China
| | - Jiang Hu
- GrandOmics Biosciences Co., Ltd, Wuhan 430076, China
| | - Ran Li
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Yali Zheng
- Department of Respiratory, Critical Care and Sleep Medicine, Xiang'an Hospital of Xiamen University, School of Medicine, Xiamen University, Xiamen 361101, China
| | - Xinqian Ma
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Zhenglin Du
- Institute of PSI Genomics, Wenzhou 325024, China
| | - Lili Zhao
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Wenyi Yu
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Jianbo Xue
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Wenjie Bian
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Feifei Yang
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Xi Chen
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Pingan Zhang
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Rihan Wu
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Yifan Ma
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Changjun Shao
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China
| | - Jing Chen
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China
| | - Jian Wang
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China
| | - Jiwei Li
- Department of Respiratory, Critical Care and Sleep Medicine, Xiang'an Hospital of Xiamen University, School of Medicine, Xiamen University, Xiamen 361101, China
| | - Jing Wu
- Department of Respiratory, Critical Care and Sleep Medicine, Xiang'an Hospital of Xiamen University, School of Medicine, Xiamen University, Xiamen 361101, China
| | - Xiaoyi Hu
- Department of Respiratory, Critical Care and Sleep Medicine, Xiang'an Hospital of Xiamen University, School of Medicine, Xiamen University, Xiamen 361101, China
| | - Qiuyue Long
- Department of Respiratory, Critical Care and Sleep Medicine, Xiang'an Hospital of Xiamen University, School of Medicine, Xiamen University, Xiamen 361101, China
| | - Mingzheng Jiang
- Department of Respiratory, Critical Care and Sleep Medicine, Xiang'an Hospital of Xiamen University, School of Medicine, Xiamen University, Xiamen 361101, China
| | - Hongli Ye
- Department of Respiratory, Critical Care and Sleep Medicine, Xiang'an Hospital of Xiamen University, School of Medicine, Xiamen University, Xiamen 361101, China
| | - Shixu Song
- Department of Respiratory, Critical Care and Sleep Medicine, Xiang'an Hospital of Xiamen University, School of Medicine, Xiamen University, Xiamen 361101, China
| | - Guangyao Li
- Linfen Clinical Medicine Research Center, Linfen 041000, China
| | - Yue Wei
- Linfen Clinical Medicine Research Center, Linfen 041000, China
| | - Yu Xu
- Beijing Jishuitan Hospital, Capital Medical University, Beijing 100035, China
| | - Yanliang Ma
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Yanwen Chen
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Keqiang Wang
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Jing Bao
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Wen Xi
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Fang Wang
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Wentao Ni
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Moqin Zhang
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Yan Yu
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Shengnan Li
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Yu Kang
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; University of Chinese Academy of Sciences, Beijing 100490, China.
| | - Zhancheng Gao
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Institute of Chest and Lung Diseases, Shanxi Medical University, Taiyuan 030001, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China.
| |
Collapse
|
34
|
Ren L, Duan X, Dong L, Zhang R, Yang J, Gao Y, Peng R, Hou W, Liu Y, Li J, Yu Y, Zhang N, Shang J, Liang F, Wang D, Chen H, Sun L, Hao L, Scherer A, Nordlund J, Xiao W, Xu J, Tong W, Hu X, Jia P, Ye K, Li J, Jin L, Hong H, Wang J, Fan S, Fang X, Zheng Y, Shi L. Quartet DNA reference materials and datasets for comprehensively evaluating germline variant calling performance. Genome Biol 2023; 24:270. [PMID: 38012772 PMCID: PMC10680274 DOI: 10.1186/s13059-023-03109-2] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Accepted: 11/13/2023] [Indexed: 11/29/2023] Open
Abstract
BACKGROUND Genomic DNA reference materials are widely recognized as essential for ensuring data quality in omics research. However, relying solely on reference datasets to evaluate the accuracy of variant calling results is incomplete, as they are limited to benchmark regions. Therefore, it is important to develop DNA reference materials that enable the assessment of variant detection performance across the entire genome. RESULTS We established a DNA reference material suite from four immortalized cell lines derived from a family of parents and monozygotic twins. Comprehensive reference datasets of 4.2 million small variants and 15,000 structural variants were integrated and certified for evaluating the reliability of germline variant calls inside the benchmark regions. Importantly, the genetic built-in-truth of the Quartet family design enables estimation of the precision of variant calls outside the benchmark regions. Using the Quartet reference materials along with study samples, batch effects are objectively monitored and alleviated by training a machine learning model with the Quartet reference datasets to remove potential artifact calls. Moreover, the matched RNA and protein reference materials and datasets from the Quartet project enables cross-omics validation of variant calls from multiomics data. CONCLUSIONS The Quartet DNA reference materials and reference datasets provide a unique resource for objectively assessing the quality of germline variant calls throughout the whole-genome regions and improving the reliability of large-scale genomic profiling.
Collapse
Affiliation(s)
- Luyao Ren
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Fudan University, Shanghai, China
| | - Xiaoke Duan
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Fudan University, Shanghai, China
| | | | - Rui Zhang
- National Center for Clinical Laboratories, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing Hospital, Beijing, China
| | - Jingcheng Yang
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Fudan University, Shanghai, China
- Greater Bay Area Institute of Precision Medicine, Guangzhou, Guangdong, China
| | - Yuechen Gao
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Fudan University, Shanghai, China
| | - Rongxue Peng
- National Center for Clinical Laboratories, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing Hospital, Beijing, China
| | - Wanwan Hou
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Fudan University, Shanghai, China
| | - Yaqing Liu
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Fudan University, Shanghai, China
| | - Jingjing Li
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Fudan University, Shanghai, China
- Nextomics Biosciences Institute, Wuhan, Hubei, China
| | - Ying Yu
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Fudan University, Shanghai, China
| | - Naixin Zhang
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Fudan University, Shanghai, China
| | - Jun Shang
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Fudan University, Shanghai, China
| | - Fan Liang
- Nextomics Biosciences Institute, Wuhan, Hubei, China
| | - Depeng Wang
- Nextomics Biosciences Institute, Wuhan, Hubei, China
| | - Hui Chen
- OrigiMed Co., Ltd, Shanghai, China
| | - Lele Sun
- Sequanta Technologies Co., Ltd, Shanghai, China
| | | | - Andreas Scherer
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland
- EATRIS ERIC-European Infrastructure for Translational Medicine, Amsterdam, the Netherlands
| | - Jessica Nordlund
- EATRIS ERIC-European Infrastructure for Translational Medicine, Amsterdam, the Netherlands
- Department of Medical Sciences, Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| | - Wenming Xiao
- Office of Oncologic Diseases, Office of New Drugs, Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Spring, MD, USA
| | - Joshua Xu
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, USA
| | - Weida Tong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, USA
| | - Xin Hu
- Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Peng Jia
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China
| | - Kai Ye
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China
| | - Jinming Li
- National Center for Clinical Laboratories, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing Hospital, Beijing, China
| | - Li Jin
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Fudan University, Shanghai, China
| | - Huixiao Hong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, USA
| | - Jing Wang
- National Institute of Metrology, Beijing, China.
| | - Shaohua Fan
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Fudan University, Shanghai, China.
| | - Xiang Fang
- National Institute of Metrology, Beijing, China.
| | - Yuanting Zheng
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Fudan University, Shanghai, China.
| | - Leming Shi
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Fudan University, Shanghai, China
- Shanghai Cancer Center, Fudan University, Shanghai, China
- International Human Phenome Institutes, Shanghai, China
| |
Collapse
|
35
|
Kolmogorov M, Billingsley KJ, Mastoras M, Meredith M, Monlong J, Lorig-Roach R, Asri M, Alvarez Jerez P, Malik L, Dewan R, Reed X, Genner RM, Daida K, Behera S, Shafin K, Pesout T, Prabakaran J, Carnevali P, Yang J, Rhie A, Scholz SW, Traynor BJ, Miga KH, Jain M, Timp W, Phillippy AM, Chaisson M, Sedlazeck FJ, Blauwendraat C, Paten B. Scalable Nanopore sequencing of human genomes provides a comprehensive view of haplotype-resolved variation and methylation. Nat Methods 2023; 20:1483-1492. [PMID: 37710018 PMCID: PMC11222905 DOI: 10.1038/s41592-023-01993-x] [Citation(s) in RCA: 40] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2023] [Accepted: 08/04/2023] [Indexed: 09/16/2023]
Abstract
Long-read sequencing technologies substantially overcome the limitations of short-reads but have not been considered as a feasible replacement for population-scale projects, being a combination of too expensive, not scalable enough or too error-prone. Here we develop an efficient and scalable wet lab and computational protocol, Napu, for Oxford Nanopore Technologies long-read sequencing that seeks to address those limitations. We applied our protocol to cell lines and brain tissue samples as part of a pilot project for the National Institutes of Health Center for Alzheimer's and Related Dementias. Using a single PromethION flow cell, we can detect single nucleotide polymorphisms with F1-score comparable to Illumina short-read sequencing. Small indel calling remains difficult within homopolymers and tandem repeats, but achieves good concordance to Illumina indel calls elsewhere. Further, we can discover structural variants with F1-score on par with state-of-the-art de novo assembly methods. Our protocol phases small and structural variants at megabase scales and produces highly accurate, haplotype-specific methylation calls.
Collapse
Affiliation(s)
- Mikhail Kolmogorov
- Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA.
| | - Kimberley J Billingsley
- Center for Alzheimer's and Related Dementias, National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA.
- Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA.
| | - Mira Mastoras
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | | | - Jean Monlong
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | | | - Mobin Asri
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | - Pilar Alvarez Jerez
- Center for Alzheimer's and Related Dementias, National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
| | - Laksh Malik
- Center for Alzheimer's and Related Dementias, National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
| | - Ramita Dewan
- Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA
| | - Xylena Reed
- Center for Alzheimer's and Related Dementias, National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
| | - Rylee M Genner
- Center for Alzheimer's and Related Dementias, National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
| | - Kensuke Daida
- Center for Alzheimer's and Related Dementias, National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
- Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA
| | - Sairam Behera
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | | | - Trevor Pesout
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | - Jeshuwin Prabakaran
- Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
- Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, MD, USA
| | | | - Jianzhi Yang
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Sonja W Scholz
- Neurodegenerative Diseases Research Unit, National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
- Department of Neurology, Johns Hopkins University Medical Center, Baltimore, MD, USA
| | - Bryan J Traynor
- Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA
- Department of Neurology, Johns Hopkins University Medical Center, Baltimore, MD, USA
| | - Karen H Miga
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | - Miten Jain
- Department of Bioengineering, Northeastern University, Boston, MA, USA
- Department of Physics, Northeastern University, Boston, MA, USA
| | - Winston Timp
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Mark Chaisson
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Cornelis Blauwendraat
- Center for Alzheimer's and Related Dementias, National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA.
- Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA.
| | | |
Collapse
|
36
|
Rhie A, Nurk S, Cechova M, Hoyt SJ, Taylor DJ, Altemose N, Hook PW, Koren S, Rautiainen M, Alexandrov IA, Allen J, Asri M, Bzikadze AV, Chen NC, Chin CS, Diekhans M, Flicek P, Formenti G, Fungtammasan A, Garcia Giron C, Garrison E, Gershman A, Gerton JL, Grady PGS, Guarracino A, Haggerty L, Halabian R, Hansen NF, Harris R, Hartley GA, Harvey WT, Haukness M, Heinz J, Hourlier T, Hubley RM, Hunt SE, Hwang S, Jain M, Kesharwani RK, Lewis AP, Li H, Logsdon GA, Lucas JK, Makalowski W, Markovic C, Martin FJ, Mc Cartney AM, McCoy RC, McDaniel J, McNulty BM, Medvedev P, Mikheenko A, Munson KM, Murphy TD, Olsen HE, Olson ND, Paulin LF, Porubsky D, Potapova T, Ryabov F, Salzberg SL, Sauria MEG, Sedlazeck FJ, Shafin K, Shepelev VA, Shumate A, Storer JM, Surapaneni L, Taravella Oill AM, Thibaud-Nissen F, Timp W, Tomaszkiewicz M, Vollger MR, Walenz BP, Watwood AC, Weissensteiner MH, Wenger AM, Wilson MA, Zarate S, Zhu Y, Zook JM, Eichler EE, O'Neill RJ, Schatz MC, Miga KH, Makova KD, Phillippy AM. The complete sequence of a human Y chromosome. Nature 2023; 621:344-354. [PMID: 37612512 PMCID: PMC10752217 DOI: 10.1038/s41586-023-06457-y] [Citation(s) in RCA: 150] [Impact Index Per Article: 75.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2022] [Accepted: 07/19/2023] [Indexed: 08/25/2023]
Abstract
The human Y chromosome has been notoriously difficult to sequence and assemble because of its complex repeat structure that includes long palindromes, tandem repeats and segmental duplications1-3. As a result, more than half of the Y chromosome is missing from the GRCh38 reference sequence and it remains the last human chromosome to be finished4,5. Here, the Telomere-to-Telomere (T2T) consortium presents the complete 62,460,029-base-pair sequence of a human Y chromosome from the HG002 genome (T2T-Y) that corrects multiple errors in GRCh38-Y and adds over 30 million base pairs of sequence to the reference, showing the complete ampliconic structures of gene families TSPY, DAZ and RBMY; 41 additional protein-coding genes, mostly from the TSPY family; and an alternating pattern of human satellite 1 and 3 blocks in the heterochromatic Yq12 region. We have combined T2T-Y with a previous assembly of the CHM13 genome4 and mapped available population variation, clinical variants and functional genomics data to produce a complete and comprehensive reference sequence for all 24 human chromosomes.
Collapse
Affiliation(s)
- Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Sergey Nurk
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
- Oxford Nanopore Technologies Inc., Oxford, UK
| | - Monika Cechova
- Faculty of Informatics, Masaryk University, Brno, Czech Republic
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Savannah J Hoyt
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
| | - Dylan J Taylor
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Nicolas Altemose
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, USA
| | - Paul W Hook
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Mikko Rautiainen
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Ivan A Alexandrov
- Federal Research Center of Biotechnology of the Russian Academy of Sciences, Moscow, Russia
- Center for Algorithmic Biotechnology, Saint Petersburg State University, St Petersburg, Russia
- Department of Anatomy and Anthropology and Department of Human Molecular Genetics and Biochemistry, Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv-Yafo, Israel
| | - Jamie Allen
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Mobin Asri
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Andrey V Bzikadze
- Graduate Program in Bioinformatics and Systems Biology, University of California, San Diego, CA, USA
| | - Nae-Chyun Chen
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Chen-Shan Chin
- GeneDX Holdings Corp, Stamford, CT, USA
- Foundation of Biological Data Science, Belmont, CA, USA
| | - Mark Diekhans
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
- Department of Genetics, University of Cambridge, Cambridge, UK
| | | | | | - Carlos Garcia Giron
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Ariel Gershman
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Jennifer L Gerton
- Stowers Institute for Medical Research, Kansas City, MO, USA
- University of Kansas Medical Center, Kansas City, MO, USA
| | - Patrick G S Grady
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
| | - Andrea Guarracino
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
- Genomics Research Centre, Human Technopole, Milan, Italy
| | - Leanne Haggerty
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Reza Halabian
- Institute of Bioinformatics, Faculty of Medicine, University of Münster, Münster, Germany
| | - Nancy F Hansen
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
- Cancer Genetics and Comparative Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Robert Harris
- Department of Biology, Pennsylvania State University, University Park, PA, USA
| | - Gabrielle A Hartley
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
| | - William T Harvey
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Marina Haukness
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Jakob Heinz
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Thibaut Hourlier
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | | | - Sarah E Hunt
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Stephen Hwang
- XDBio Program, Johns Hopkins University, Baltimore, MD, USA
| | - Miten Jain
- Department of Bioengineering, Department of Physics, Northeastern University, Boston, MA, USA
| | - Rupesh K Kesharwani
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX, USA
| | - Alexandra P Lewis
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Heng Li
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Glennis A Logsdon
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Julian K Lucas
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Wojciech Makalowski
- Institute of Bioinformatics, Faculty of Medicine, University of Münster, Münster, Germany
| | - Christopher Markovic
- Genome Technology Access Center at the McDonnell Genome Institute, Washington University, St. Louis, MO, USA
| | - Fergal J Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Ann M Mc Cartney
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Rajiv C McCoy
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Jennifer McDaniel
- Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Brandy M McNulty
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Paul Medvedev
- Department of Computer Science and Engineering, Pennsylvania State University, University Park, PA, USA
- Department of Biochemistry and Molecular Biology, Pennsylvania State University, University Park, PA, USA
- Center for Computational Biology and Bioinformatics, Pennsylvania State University, University Park, PA, USA
| | - Alla Mikheenko
- Center for Algorithmic Biotechnology, Saint Petersburg State University, St Petersburg, Russia
- UCL Queen Square Institute of Neurology, UCL, London, UK
| | - Katherine M Munson
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Terence D Murphy
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Hugh E Olsen
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Nathan D Olson
- Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Luis F Paulin
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX, USA
| | - David Porubsky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Tamara Potapova
- Stowers Institute for Medical Research, Kansas City, MO, USA
| | - Fedor Ryabov
- Masters Program in National Research University Higher School of Economics, Moscow, Russia
| | - Steven L Salzberg
- Departments of Biomedical Engineering, Computer Science, and Biostatistics, Johns Hopkins University, Baltimore, MD, USA
| | | | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX, USA
- Department of Computer Science, Rice University, Houston, TX, USA
| | | | | | - Alaina Shumate
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | | | - Likhitha Surapaneni
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Angela M Taravella Oill
- Center for Evolution and Medicine, School of Life Sciences, Arizona State University, Tempe, AZ, USA
| | - Françoise Thibaud-Nissen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Winston Timp
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Marta Tomaszkiewicz
- Department of Biology, Pennsylvania State University, University Park, PA, USA
- Department of Biomedical Engineering, Pennsylvania State University, State College, PA, USA
| | - Mitchell R Vollger
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Brian P Walenz
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Allison C Watwood
- Department of Biology, Pennsylvania State University, University Park, PA, USA
| | | | | | - Melissa A Wilson
- Center for Evolution and Medicine, School of Life Sciences, Arizona State University, Tempe, AZ, USA
| | - Samantha Zarate
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Yiming Zhu
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX, USA
| | - Justin M Zook
- Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Investigator, Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| | - Rachel J O'Neill
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
- Institute for Systems Genomics, University of Connecticut, Storrs, CT, USA
- Department of Genetics and Genome Sciences, UConn Health, Farmington, CT, USA
| | - Michael C Schatz
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Karen H Miga
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Kateryna D Makova
- Department of Biology, Pennsylvania State University, University Park, PA, USA
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.
| |
Collapse
|
37
|
Ren J, Gu B, Chaisson MJP. vamos: variable-number tandem repeats annotation using efficient motif sets. Genome Biol 2023; 24:175. [PMID: 37501141 PMCID: PMC10373352 DOI: 10.1186/s13059-023-03010-y] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2022] [Accepted: 07/06/2023] [Indexed: 07/29/2023] Open
Abstract
Roughly 3% of the human genome is composed of variable-number tandem repeats (VNTRs): arrays of motifs at least six bases. These loci are highly polymorphic, yet current approaches that define and merge variants based on alignment breakpoints do not capture their full diversity. Here we present a method vamos: VNTR Annotation using efficient Motif Sets that instead annotates VNTR using repeat composition under different levels of motif diversity. Using vamos we estimate 7.4-16.7 alleles per locus when applied to 74 haplotype-resolved human assemblies, compared to breakpoint-based approaches that estimate 4.0-5.5 alleles per locus.
Collapse
Affiliation(s)
- Jingwen Ren
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, US
| | - Bida Gu
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, US
| | - Mark J. P. Chaisson
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, US
| |
Collapse
|
38
|
Tsouris A, Brach G, Friedrich A, Hou J, Schacherer J. Diallel panel reveals a significant impact of low-frequency genetic variants on gene expression variation in yeast. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.07.21.550015. [PMID: 37503053 PMCID: PMC10370210 DOI: 10.1101/2023.07.21.550015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/29/2023]
Abstract
Unraveling the genetic sources of gene expression variation is essential to better understand the origins of phenotypic diversity in natural populations. Genome-wide association studies identified thousands of variants involved in gene expression variation, however, variants detected only explain part of the heritability. In fact, variants such as low-frequency and structural variants (SVs) are poorly captured in association studies. To assess the impact of these variants on gene expression variation, we explored a half-diallel panel composed of 323 hybrids originated from pairwise crosses of 26 natural Saccharomyces cerevisiae isolates. Using short- and long-read sequencing strategies, we established an exhaustive catalog of single nucleotide polymorphisms (SNPs) and SVs for this panel. Combining this dataset with the transcriptomes of all hybrids, we comprehensively mapped SNPs and SVs associated with gene expression variation. While SVs impact gene expression variation, SNPs exhibit a higher effect size with an overrepresentation of low-frequency variants compared to common ones. These results reinforce the importance of dissecting the heritability of complex traits with a comprehensive catalog of genetic variants at the population level.
Collapse
Affiliation(s)
- Andreas Tsouris
- Université de Strasbourg, CNRS, GMGM UMR 7156, Strasbourg, France
| | - Gauthier Brach
- Université de Strasbourg, CNRS, GMGM UMR 7156, Strasbourg, France
| | - Anne Friedrich
- Université de Strasbourg, CNRS, GMGM UMR 7156, Strasbourg, France
| | - Jing Hou
- Université de Strasbourg, CNRS, GMGM UMR 7156, Strasbourg, France
| | - Joseph Schacherer
- Université de Strasbourg, CNRS, GMGM UMR 7156, Strasbourg, France
- Institut Universitaire de France (IUF), Paris, France
| |
Collapse
|
39
|
Leonard AS, Crysnanto D, Mapel XM, Bhati M, Pausch H. Graph construction method impacts variation representation and analyses in a bovine super-pangenome. Genome Biol 2023; 24:124. [PMID: 37217946 PMCID: PMC10204317 DOI: 10.1186/s13059-023-02969-y] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2022] [Accepted: 05/10/2023] [Indexed: 05/24/2023] Open
Abstract
BACKGROUND Several models and algorithms have been proposed to build pangenomes from multiple input assemblies, but their impact on variant representation, and consequently downstream analyses, is largely unknown. RESULTS We create multi-species super-pangenomes using pggb, cactus, and minigraph with the Bos taurus taurus reference sequence and eleven haplotype-resolved assemblies from taurine and indicine cattle, bison, yak, and gaur. We recover 221 k nonredundant structural variations (SVs) from the pangenomes, of which 135 k (61%) are common to all three. SVs derived from assembly-based calling show high agreement with the consensus calls from the pangenomes (96%), but validate only a small proportion of variations private to each graph. Pggb and cactus, which also incorporate base-level variation, have approximately 95% exact matches with assembly-derived small variant calls, which significantly improves the edit rate when realigning assemblies compared to minigraph. We use the three pangenomes to investigate 9566 variable number tandem repeats (VNTRs), finding 63% have identical predicted repeat counts in the three graphs, while minigraph can over or underestimate the count given its approximate coordinate system. We examine a highly variable VNTR locus and show that repeat unit copy number impacts the expression of proximal genes and non-coding RNA. CONCLUSIONS Our findings indicate good consensus between the three pangenome methods but also show their individual strengths and weaknesses that need to be considered when analysing different types of variants from multiple input assemblies.
Collapse
Affiliation(s)
- Alexander S Leonard
- Animal Genomics, ETH Zurich, Universitaetstrasse 2, 8092, Zurich, Switzerland.
| | - Danang Crysnanto
- Animal Genomics, ETH Zurich, Universitaetstrasse 2, 8092, Zurich, Switzerland
| | - Xena M Mapel
- Animal Genomics, ETH Zurich, Universitaetstrasse 2, 8092, Zurich, Switzerland
| | - Meenu Bhati
- Animal Genomics, ETH Zurich, Universitaetstrasse 2, 8092, Zurich, Switzerland
| | - Hubert Pausch
- Animal Genomics, ETH Zurich, Universitaetstrasse 2, 8092, Zurich, Switzerland.
| |
Collapse
|
40
|
Lin J, Jia P, Wang S, Kosters W, Ye K. Comparison and benchmark of structural variants detected from long read and long-read assembly. Brief Bioinform 2023:7169138. [PMID: 37200087 DOI: 10.1093/bib/bbad188] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Revised: 04/25/2023] [Accepted: 04/26/2023] [Indexed: 05/20/2023] Open
Abstract
Structural variant (SV) detection is essential for genomic studies, and long-read sequencing technologies have advanced our capacity to detect SVs directly from read or de novo assembly, also known as read-based and assembly-based strategy. However, to date, no independent studies have compared and benchmarked the two strategies. Here, on the basis of SVs detected by 20 read-based and eight assembly-based detection pipelines from six datasets of HG002 genome, we investigated the factors that influence the two strategies and assessed their performance with well-curated SVs. We found that up to 80% of the SVs could be detected by both strategies among different long-read datasets, whereas variant type, size, and breakpoint detected by read-based strategy were greatly affected by aligners. For the high-confident insertions and deletions at non-tandem repeat regions, a remarkable subset of them (82% in assembly-based calls and 93% in read-based calls), accounting for around 4000 SVs, could be captured by both reads and assemblies. However, discordance between two strategies was largely caused by complex SVs and inversions, which resulted from inconsistent alignment of reads and assemblies at these loci. Finally, benchmarking with SVs at medically relevant genes, the recall of read-based strategy reached 77% on 5X coverage data, whereas assembly-based strategy required 20X coverage data to achieve similar performance. Therefore, integrating SVs from read and assembly is suggested for general-purpose detection because of inconsistently detected complex SVs and inversions, whereas assembly-based strategy is optional for applications with limited resources.
Collapse
Affiliation(s)
- Jiadong Lin
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China
- Genome Institute, the First Affiliated Hospital of Xi'an Jiaotong University, Xi'an 710061 China
- Leiden Institute of Advanced Computer Science, Faculty of Science, Leiden University, Leiden 2311 EZ, The Netherlands
| | - Peng Jia
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China
| | - Songbo Wang
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China
| | - Walter Kosters
- Leiden Institute of Advanced Computer Science, Faculty of Science, Leiden University, Leiden 2311 EZ, The Netherlands
| | - Kai Ye
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China
- Genome Institute, the First Affiliated Hospital of Xi'an Jiaotong University, Xi'an 710061 China
- The School of Life Science and Technology, Xi'an Jiaotong University, Xi'an 710049, China
- Faculty of Science, Leiden University, Leiden 2311 , The Netherlands
| |
Collapse
|
41
|
Liao WW, Asri M, Ebler J, Doerr D, Haukness M, Hickey G, Lu S, Lucas JK, Monlong J, Abel HJ, Buonaiuto S, Chang XH, Cheng H, Chu J, Colonna V, Eizenga JM, Feng X, Fischer C, Fulton RS, Garg S, Groza C, Guarracino A, Harvey WT, Heumos S, Howe K, Jain M, Lu TY, Markello C, Martin FJ, Mitchell MW, Munson KM, Mwaniki MN, Novak AM, Olsen HE, Pesout T, Porubsky D, Prins P, Sibbesen JA, Sirén J, Tomlinson C, Villani F, Vollger MR, Antonacci-Fulton LL, Baid G, Baker CA, Belyaeva A, Billis K, Carroll A, Chang PC, Cody S, Cook DE, Cook-Deegan RM, Cornejo OE, Diekhans M, Ebert P, Fairley S, Fedrigo O, Felsenfeld AL, Formenti G, Frankish A, Gao Y, Garrison NA, Giron CG, Green RE, Haggerty L, Hoekzema K, Hourlier T, Ji HP, Kenny EE, Koenig BA, Kolesnikov A, Korbel JO, Kordosky J, Koren S, Lee H, Lewis AP, Magalhães H, Marco-Sola S, Marijon P, McCartney A, McDaniel J, Mountcastle J, Nattestad M, Nurk S, Olson ND, Popejoy AB, Puiu D, Rautiainen M, Regier AA, Rhie A, Sacco S, Sanders AD, Schneider VA, Schultz BI, Shafin K, Smith MW, Sofia HJ, Abou Tayoun AN, Thibaud-Nissen F, Tricomi FF, Wagner J, Walenz B, Wood JMD, Zimin AV, Bourque G, Chaisson MJP, Flicek P, Phillippy AM, Zook JM, Eichler EE, Haussler D, Wang T, Jarvis ED, Miga KH, Garrison E, Marschall T, Hall IM, Li H, Paten B. A draft human pangenome reference. Nature 2023; 617:312-324. [PMID: 37165242 PMCID: PMC10172123 DOI: 10.1038/s41586-023-05896-x] [Citation(s) in RCA: 363] [Impact Index Per Article: 181.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2022] [Accepted: 02/28/2023] [Indexed: 05/12/2023]
Abstract
Here the Human Pangenome Reference Consortium presents a first draft of the human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals1. These assemblies cover more than 99% of the expected sequence in each genome and are more than 99% accurate at the structural and base pair levels. Based on alignments of the assemblies, we generate a draft pangenome that captures known variants and haplotypes and reveals new alleles at structurally complex loci. We also add 119 million base pairs of euchromatic polymorphic sequences and 1,115 gene duplications relative to the existing reference GRCh38. Roughly 90 million of the additional base pairs are derived from structural variation. Using our draft pangenome to analyse short-read data reduced small variant discovery errors by 34% and increased the number of structural variants detected per haplotype by 104% compared with GRCh38-based workflows, which enabled the typing of the vast majority of structural variant alleles per sample.
Collapse
Affiliation(s)
- Wen-Wei Liao
- Department of Genetics, Yale University School of Medicine, New Haven, CT, USA
- Center for Genomic Health, Yale University School of Medicine, New Haven, CT, USA
- Division of Biology and Biomedical Sciences, Washington University School of Medicine, St. Louis, MO, USA
| | - Mobin Asri
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Jana Ebler
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
| | - Daniel Doerr
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
| | - Marina Haukness
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Glenn Hickey
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Shuangjia Lu
- Department of Genetics, Yale University School of Medicine, New Haven, CT, USA
- Center for Genomic Health, Yale University School of Medicine, New Haven, CT, USA
| | - Julian K Lucas
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Jean Monlong
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Haley J Abel
- Division of Oncology, Department of Internal Medicine, Washington University School of Medicine, St. Louis, MO, USA
| | - Silvia Buonaiuto
- Institute of Genetics and Biophysics, National Research Council, Naples, Italy
| | - Xian H Chang
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Haoyu Cheng
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Justin Chu
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Vincenza Colonna
- Institute of Genetics and Biophysics, National Research Council, Naples, Italy
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Jordan M Eizenga
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Xiaowen Feng
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Christian Fischer
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Robert S Fulton
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
- Department of Genetics, Washington University School of Medicine, St. Louis, MO, USA
| | - Shilpa Garg
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Copenhagen, Denmark
| | - Cristian Groza
- Quantitative Life Sciences, McGill University, Montréal, Québec, Canada
| | - Andrea Guarracino
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
- Genomics Research Centre, Human Technopole, Milan, Italy
| | - William T Harvey
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Simon Heumos
- Quantitative Biology Center (QBiC), University of Tübingen, Tübingen, Germany
- Biomedical Data Science, Department of Computer Science, University of Tübingen, Tübingen, Germany
| | - Kerstin Howe
- Tree of Life, Wellcome Sanger Institute, Hinxton, Cambridge, UK
| | - Miten Jain
- Northeastern University, Boston, MA, USA
| | - Tsung-Yu Lu
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Charles Markello
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Fergal J Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | | | - Katherine M Munson
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | | | - Adam M Novak
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Hugh E Olsen
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Trevor Pesout
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - David Porubsky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Pjotr Prins
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Jonas A Sibbesen
- Center for Health Data Science, University of Copenhagen, Copenhagen, Denmark
| | - Jouni Sirén
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Chad Tomlinson
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
| | - Flavia Villani
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Mitchell R Vollger
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Division of Medical Genetics, University of Washington School of Medicine, Seattle, WA, USA
| | | | | | - Carl A Baker
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | | | - Konstantinos Billis
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | | | | | - Sarah Cody
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
| | | | - Robert M Cook-Deegan
- Barrett and O'Connor Washington Center, Arizona State University, Washington, DC, USA
| | - Omar E Cornejo
- Department of Ecology and Evolutionary Biology, University of California, Santa Cruz, CA, USA
| | - Mark Diekhans
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Peter Ebert
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
- Core Unit Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
| | - Susan Fairley
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Olivier Fedrigo
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
| | - Adam L Felsenfeld
- National Institutes of Health (NIH)-National Human Genome Research Institute, Bethesda, MD, USA
| | - Giulio Formenti
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
| | - Adam Frankish
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Yan Gao
- Center for Computational and Genomic Medicine, The Children's Hospital of Philadelphia, Philadelphia, PA, USA
| | - Nanibaa' A Garrison
- Institute for Society and Genetics, College of Letters and Science, University of California, Los Angeles, CA, USA
- Institute for Precision Health, David Geffen School of Medicine, University of California, Los Angeles, CA, USA
- Division of General Internal Medicine and Health Services Research, David Geffen School of Medicine, University of California, Los Angeles, CA, USA
| | - Carlos Garcia Giron
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Richard E Green
- Department of Biomolecular Engineering, University of California, Santa Cruz, CA, USA
- Dovetail Genomics, Scotts Valley, CA, USA
| | - Leanne Haggerty
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Kendra Hoekzema
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Thibaut Hourlier
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Hanlee P Ji
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Eimear E Kenny
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Barbara A Koenig
- Program in Bioethics and Institute for Human Genetics, University of California, San Francisco, CA, USA
| | | | - Jan O Korbel
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
- Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
| | - Jennifer Kordosky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - HoJoon Lee
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Alexandra P Lewis
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Hugo Magalhães
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
| | - Santiago Marco-Sola
- Computer Sciences Department, Barcelona Supercomputing Center, Barcelona, Spain
- Departament d'Arquitectura de Computadors i Sistemes Operatius, Universitat Autònoma de Barcelona, Barcelona, Spain
| | - Pierre Marijon
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
| | - Ann McCartney
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Jennifer McDaniel
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | | | | | - Sergey Nurk
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Nathan D Olson
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Alice B Popejoy
- Department of Public Health Sciences, University of California, Davis, CA, USA
| | - Daniela Puiu
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Mikko Rautiainen
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Allison A Regier
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
| | - Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Samuel Sacco
- Department of Ecology and Evolutionary Biology, University of California, Santa Cruz, CA, USA
| | - Ashley D Sanders
- Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Berlin, Germany
| | - Valerie A Schneider
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Baergen I Schultz
- National Institutes of Health (NIH)-National Human Genome Research Institute, Bethesda, MD, USA
| | | | - Michael W Smith
- National Institutes of Health (NIH)-National Human Genome Research Institute, Bethesda, MD, USA
| | - Heidi J Sofia
- National Institutes of Health (NIH)-National Human Genome Research Institute, Bethesda, MD, USA
| | - Ahmad N Abou Tayoun
- Al Jalila Genomics Center of Excellence, Al Jalila Children's Specialty Hospital, Dubai, UAE
- Center for Genomic Discovery, Mohammed Bin Rashid University of Medicine and Health Sciences, Dubai, UAE
| | - Françoise Thibaud-Nissen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Francesca Floriana Tricomi
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Justin Wagner
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Brian Walenz
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | | | - Aleksey V Zimin
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Guillaume Bourque
- Department of Human Genetics, McGill University, Montréal, Québec, Canada
- Canadian Center for Computational Genomics, McGill University, Montréal, Québec, Canada
- Institute for the Advanced Study of Human Biology (WPI-ASHBi), Kyoto University, Kyoto, Japan
| | - Mark J P Chaisson
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Justin M Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - David Haussler
- Genomics Institute, University of California, Santa Cruz, CA, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - Ting Wang
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
- Department of Genetics, Washington University School of Medicine, St. Louis, MO, USA
| | - Erich D Jarvis
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
- Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, USA
| | - Karen H Miga
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA.
| | - Tobias Marschall
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany.
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany.
| | - Ira M Hall
- Department of Genetics, Yale University School of Medicine, New Haven, CT, USA.
- Center for Genomic Health, Yale University School of Medicine, New Haven, CT, USA.
| | - Heng Li
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA.
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| | - Benedict Paten
- Genomics Institute, University of California, Santa Cruz, CA, USA.
| |
Collapse
|
42
|
Olson ND, Wagner J, Dwarshuis N, Miga KH, Sedlazeck FJ, Salit M, Zook JM. Variant calling and benchmarking in an era of complete human genome sequences. Nat Rev Genet 2023:10.1038/s41576-023-00590-0. [PMID: 37059810 DOI: 10.1038/s41576-023-00590-0] [Citation(s) in RCA: 31] [Impact Index Per Article: 15.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/22/2023] [Indexed: 04/16/2023]
Abstract
Genetic variant calling from DNA sequencing has enabled understanding of germline variation in hundreds of thousands of humans. Sequencing technologies and variant-calling methods have advanced rapidly, routinely providing reliable variant calls in most of the human genome. We describe how advances in long reads, deep learning, de novo assembly and pangenomes have expanded access to variant calls in increasingly challenging, repetitive genomic regions, including medically relevant regions, and how new benchmark sets and benchmarking methods illuminate their strengths and limitations. Finally, we explore the possible future of more complete characterization of human genome variation in light of the recent completion of a telomere-to-telomere human genome reference assembly and human pangenomes, and we consider the innovations needed to benchmark their newly accessible repetitive regions and complex variants.
Collapse
Affiliation(s)
- Nathan D Olson
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Justin Wagner
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Nathan Dwarshuis
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Karen H Miga
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Fritz J Sedlazeck
- Baylor College of Medicine, Human Genome Sequencing Center, Houston, TX, USA
| | | | - Justin M Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA.
| |
Collapse
|