151
|
van Dijk EL, Naquin D, Gorrichon K, Jaszczyszyn Y, Ouazahrou R, Thermes C, Hernandez C. Genomics in the long-read sequencing era. Trends Genet 2023; 39:649-671. [PMID: 37230864 DOI: 10.1016/j.tig.2023.04.006] [Citation(s) in RCA: 57] [Impact Index Per Article: 28.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Revised: 04/21/2023] [Accepted: 04/25/2023] [Indexed: 05/27/2023]
Abstract
Long-read sequencing (LRS) technologies have provided extremely powerful tools to explore genomes. While in the early years these methods suffered technical limitations, they have recently made significant progress in terms of read length, throughput, and accuracy and bioinformatics tools have strongly improved. Here, we aim to review the current status of LRS technologies, the development of novel methods, and the impact on genomics research. We will explore the most impactful recent findings made possible by these technologies focusing on high-resolution sequencing of genomes and transcriptomes and the direct detection of DNA and RNA modifications. We will also discuss how LRS methods promise a more comprehensive understanding of human genetic variation, transcriptomics, and epigenetics for the coming years.
Collapse
Affiliation(s)
- Erwin L van Dijk
- Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198, Gif-sur-Yvette, France.
| | - Delphine Naquin
- Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198, Gif-sur-Yvette, France
| | - Kévin Gorrichon
- National Center of Human Genomics Research (CNRGH), 91000 Évry-Courcouronnes, France
| | - Yan Jaszczyszyn
- Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198, Gif-sur-Yvette, France
| | - Rania Ouazahrou
- Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198, Gif-sur-Yvette, France
| | - Claude Thermes
- Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198, Gif-sur-Yvette, France
| | - Céline Hernandez
- Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198, Gif-sur-Yvette, France
| |
Collapse
|
152
|
Bzikadze AV, Pevzner PA. UniAligner: a parameter-free framework for fast sequence alignment. Nat Methods 2023; 20:1346-1354. [PMID: 37580559 DOI: 10.1038/s41592-023-01970-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2022] [Accepted: 07/05/2023] [Indexed: 08/16/2023]
Abstract
Even though the recent advances in 'complete genomics' revealed the previously inaccessible genomic regions, analysis of variations in centromeres and other extra-long tandem repeats (ETRs) faces an algorithmic challenge since there are currently no tools for accurate sequence comparison of ETRs. Counterintuitively, the classical alignment approaches, such as the Smith-Waterman algorithm, fail to construct biologically adequate alignments of ETRs. We present UniAligner-the parameter-free sequence alignment algorithm with sequence-dependent alignment scoring that automatically changes for any pair of compared sequences. UniAligner prioritizes matches of rare substrings that are more likely to be relevant to the evolutionary relationship between two sequences. We apply UniAligner to estimate the mutation rates in human centromeres, and quantify the extremely high rate of large duplications and deletions in centromeres. This high rate suggests that centromeres may represent some of the most rapidly evolving regions of the human genome with respect to their structural organization.
Collapse
Affiliation(s)
- Andrey V Bzikadze
- Graduate Program in Bioinformatics and Systems Biology, University of California, San Diego, La Jolla, CA, USA
| | - Pavel A Pevzner
- Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA, USA.
| |
Collapse
|
153
|
Aradhya S, Facio FM, Metz H, Manders T, Colavin A, Kobayashi Y, Nykamp K, Johnson B, Nussbaum RL. Applications of artificial intelligence in clinical laboratory genomics. AMERICAN JOURNAL OF MEDICAL GENETICS. PART C, SEMINARS IN MEDICAL GENETICS 2023; 193:e32057. [PMID: 37507620 DOI: 10.1002/ajmg.c.32057] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Revised: 07/13/2023] [Accepted: 07/19/2023] [Indexed: 07/30/2023]
Abstract
The transition from analog to digital technologies in clinical laboratory genomics is ushering in an era of "big data" in ways that will exceed human capacity to rapidly and reproducibly analyze those data using conventional approaches. Accurately evaluating complex molecular data to facilitate timely diagnosis and management of genomic disorders will require supportive artificial intelligence methods. These are already being introduced into clinical laboratory genomics to identify variants in DNA sequencing data, predict the effects of DNA variants on protein structure and function to inform clinical interpretation of pathogenicity, link phenotype ontologies to genetic variants identified through exome or genome sequencing to help clinicians reach diagnostic answers faster, correlate genomic data with tumor staging and treatment approaches, utilize natural language processing to identify critical published medical literature during analysis of genomic data, and use interactive chatbots to identify individuals who qualify for genetic testing or to provide pre-test and post-test education. With careful and ethical development and validation of artificial intelligence for clinical laboratory genomics, these advances are expected to significantly enhance the abilities of geneticists to translate complex data into clearly synthesized information for clinicians to use in managing the care of their patients at scale.
Collapse
Affiliation(s)
- Swaroop Aradhya
- Invitae Corporation, San Francisco, California, USA
- Adjunct Clinical Faculty, Department of Pathology, Stanford University School of Medicine, Stanford, California, USA
| | | | - Hillery Metz
- Invitae Corporation, San Francisco, California, USA
| | - Toby Manders
- Invitae Corporation, San Francisco, California, USA
| | | | | | - Keith Nykamp
- Invitae Corporation, San Francisco, California, USA
| | | | - Robert L Nussbaum
- Invitae Corporation, San Francisco, California, USA
- Volunteer Faculty, School of Medicine, University of California San Francisco, San Francisco, California, USA
| |
Collapse
|
154
|
Rhie A, Nurk S, Cechova M, Hoyt SJ, Taylor DJ, Altemose N, Hook PW, Koren S, Rautiainen M, Alexandrov IA, Allen J, Asri M, Bzikadze AV, Chen NC, Chin CS, Diekhans M, Flicek P, Formenti G, Fungtammasan A, Garcia Giron C, Garrison E, Gershman A, Gerton JL, Grady PGS, Guarracino A, Haggerty L, Halabian R, Hansen NF, Harris R, Hartley GA, Harvey WT, Haukness M, Heinz J, Hourlier T, Hubley RM, Hunt SE, Hwang S, Jain M, Kesharwani RK, Lewis AP, Li H, Logsdon GA, Lucas JK, Makalowski W, Markovic C, Martin FJ, Mc Cartney AM, McCoy RC, McDaniel J, McNulty BM, Medvedev P, Mikheenko A, Munson KM, Murphy TD, Olsen HE, Olson ND, Paulin LF, Porubsky D, Potapova T, Ryabov F, Salzberg SL, Sauria MEG, Sedlazeck FJ, Shafin K, Shepelev VA, Shumate A, Storer JM, Surapaneni L, Taravella Oill AM, Thibaud-Nissen F, Timp W, Tomaszkiewicz M, Vollger MR, Walenz BP, Watwood AC, Weissensteiner MH, Wenger AM, Wilson MA, Zarate S, Zhu Y, Zook JM, Eichler EE, O'Neill RJ, Schatz MC, Miga KH, Makova KD, Phillippy AM. The complete sequence of a human Y chromosome. Nature 2023; 621:344-354. [PMID: 37612512 PMCID: PMC10752217 DOI: 10.1038/s41586-023-06457-y] [Citation(s) in RCA: 183] [Impact Index Per Article: 91.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2022] [Accepted: 07/19/2023] [Indexed: 08/25/2023]
Abstract
The human Y chromosome has been notoriously difficult to sequence and assemble because of its complex repeat structure that includes long palindromes, tandem repeats and segmental duplications1-3. As a result, more than half of the Y chromosome is missing from the GRCh38 reference sequence and it remains the last human chromosome to be finished4,5. Here, the Telomere-to-Telomere (T2T) consortium presents the complete 62,460,029-base-pair sequence of a human Y chromosome from the HG002 genome (T2T-Y) that corrects multiple errors in GRCh38-Y and adds over 30 million base pairs of sequence to the reference, showing the complete ampliconic structures of gene families TSPY, DAZ and RBMY; 41 additional protein-coding genes, mostly from the TSPY family; and an alternating pattern of human satellite 1 and 3 blocks in the heterochromatic Yq12 region. We have combined T2T-Y with a previous assembly of the CHM13 genome4 and mapped available population variation, clinical variants and functional genomics data to produce a complete and comprehensive reference sequence for all 24 human chromosomes.
Collapse
Affiliation(s)
- Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Sergey Nurk
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
- Oxford Nanopore Technologies Inc., Oxford, UK
| | - Monika Cechova
- Faculty of Informatics, Masaryk University, Brno, Czech Republic
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Savannah J Hoyt
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
| | - Dylan J Taylor
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Nicolas Altemose
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, USA
| | - Paul W Hook
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Mikko Rautiainen
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Ivan A Alexandrov
- Federal Research Center of Biotechnology of the Russian Academy of Sciences, Moscow, Russia
- Center for Algorithmic Biotechnology, Saint Petersburg State University, St Petersburg, Russia
- Department of Anatomy and Anthropology and Department of Human Molecular Genetics and Biochemistry, Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv-Yafo, Israel
| | - Jamie Allen
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Mobin Asri
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Andrey V Bzikadze
- Graduate Program in Bioinformatics and Systems Biology, University of California, San Diego, CA, USA
| | - Nae-Chyun Chen
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Chen-Shan Chin
- GeneDX Holdings Corp, Stamford, CT, USA
- Foundation of Biological Data Science, Belmont, CA, USA
| | - Mark Diekhans
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
- Department of Genetics, University of Cambridge, Cambridge, UK
| | | | | | - Carlos Garcia Giron
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Ariel Gershman
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Jennifer L Gerton
- Stowers Institute for Medical Research, Kansas City, MO, USA
- University of Kansas Medical Center, Kansas City, MO, USA
| | - Patrick G S Grady
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
| | - Andrea Guarracino
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
- Genomics Research Centre, Human Technopole, Milan, Italy
| | - Leanne Haggerty
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Reza Halabian
- Institute of Bioinformatics, Faculty of Medicine, University of Münster, Münster, Germany
| | - Nancy F Hansen
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
- Cancer Genetics and Comparative Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Robert Harris
- Department of Biology, Pennsylvania State University, University Park, PA, USA
| | - Gabrielle A Hartley
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
| | - William T Harvey
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Marina Haukness
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Jakob Heinz
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Thibaut Hourlier
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | | | - Sarah E Hunt
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Stephen Hwang
- XDBio Program, Johns Hopkins University, Baltimore, MD, USA
| | - Miten Jain
- Department of Bioengineering, Department of Physics, Northeastern University, Boston, MA, USA
| | - Rupesh K Kesharwani
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX, USA
| | - Alexandra P Lewis
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Heng Li
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Glennis A Logsdon
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Julian K Lucas
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Wojciech Makalowski
- Institute of Bioinformatics, Faculty of Medicine, University of Münster, Münster, Germany
| | - Christopher Markovic
- Genome Technology Access Center at the McDonnell Genome Institute, Washington University, St. Louis, MO, USA
| | - Fergal J Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Ann M Mc Cartney
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Rajiv C McCoy
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Jennifer McDaniel
- Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Brandy M McNulty
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Paul Medvedev
- Department of Computer Science and Engineering, Pennsylvania State University, University Park, PA, USA
- Department of Biochemistry and Molecular Biology, Pennsylvania State University, University Park, PA, USA
- Center for Computational Biology and Bioinformatics, Pennsylvania State University, University Park, PA, USA
| | - Alla Mikheenko
- Center for Algorithmic Biotechnology, Saint Petersburg State University, St Petersburg, Russia
- UCL Queen Square Institute of Neurology, UCL, London, UK
| | - Katherine M Munson
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Terence D Murphy
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Hugh E Olsen
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Nathan D Olson
- Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Luis F Paulin
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX, USA
| | - David Porubsky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Tamara Potapova
- Stowers Institute for Medical Research, Kansas City, MO, USA
| | - Fedor Ryabov
- Masters Program in National Research University Higher School of Economics, Moscow, Russia
| | - Steven L Salzberg
- Departments of Biomedical Engineering, Computer Science, and Biostatistics, Johns Hopkins University, Baltimore, MD, USA
| | | | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX, USA
- Department of Computer Science, Rice University, Houston, TX, USA
| | | | | | - Alaina Shumate
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | | | - Likhitha Surapaneni
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Angela M Taravella Oill
- Center for Evolution and Medicine, School of Life Sciences, Arizona State University, Tempe, AZ, USA
| | - Françoise Thibaud-Nissen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Winston Timp
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Marta Tomaszkiewicz
- Department of Biology, Pennsylvania State University, University Park, PA, USA
- Department of Biomedical Engineering, Pennsylvania State University, State College, PA, USA
| | - Mitchell R Vollger
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Brian P Walenz
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Allison C Watwood
- Department of Biology, Pennsylvania State University, University Park, PA, USA
| | | | | | - Melissa A Wilson
- Center for Evolution and Medicine, School of Life Sciences, Arizona State University, Tempe, AZ, USA
| | - Samantha Zarate
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Yiming Zhu
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX, USA
| | - Justin M Zook
- Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Investigator, Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| | - Rachel J O'Neill
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
- Institute for Systems Genomics, University of Connecticut, Storrs, CT, USA
- Department of Genetics and Genome Sciences, UConn Health, Farmington, CT, USA
| | - Michael C Schatz
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Karen H Miga
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Kateryna D Makova
- Department of Biology, Pennsylvania State University, University Park, PA, USA
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.
| |
Collapse
|
155
|
Lee H, Greer SU, Pavlichin DS, Zhou B, Urban AE, Weissman T, Ji HP. Pan-conserved segment tags identify ultra-conserved sequences across assemblies in the human pangenome. CELL REPORTS METHODS 2023; 3:100543. [PMID: 37671027 PMCID: PMC10475782 DOI: 10.1016/j.crmeth.2023.100543] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/17/2022] [Revised: 04/14/2023] [Accepted: 07/06/2023] [Indexed: 09/07/2023]
Abstract
The human pangenome, a new reference sequence, addresses many limitations of the current GRCh38 reference. The first release is based on 94 high-quality haploid assemblies from individuals with diverse backgrounds. We employed a k-mer indexing strategy for comparative analysis across multiple assemblies, including the pangenome reference, GRCh38, and CHM13, a telomere-to-telomere reference assembly. Our k-mer indexing approach enabled us to identify a valuable collection of universally conserved sequences across all assemblies, referred to as "pan-conserved segment tags" (PSTs). By examining intervals between these segments, we discerned highly conserved genomic segments and those with structurally related polymorphisms. We found 60,764 polymorphic intervals with unique geo-ethnic features in the pangenome reference. In this study, we utilized ultra-conserved sequences (PSTs) to forge a link between human pangenome assemblies and reference genomes. This methodology enables the examination of any sequence of interest within the pangenome, using the reference genome as a comparative framework.
Collapse
Affiliation(s)
- HoJoon Lee
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Stephanie U. Greer
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Dmitri S. Pavlichin
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Bo Zhou
- Department of Psychiatry and Behavioral Sciences, Stanford University School of Medicine, Stanford, CA 94305, USA
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Alexander E. Urban
- Department of Psychiatry and Behavioral Sciences, Stanford University School of Medicine, Stanford, CA 94305, USA
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Tsachy Weissman
- Department of Electrical Engineering, Stanford University, Palo Alto, CA 94304, USA
| | - Hanlee P. Ji
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA
- Department of Electrical Engineering, Stanford University, Palo Alto, CA 94304, USA
| |
Collapse
|
156
|
Abstract
The p-arms of the five human acrocentric chromosomes bear nucleolar organizer regions (NORs) comprising ribosomal gene (rDNA) repeats that are organized in a homogeneous tandem array and transcribed in a telomere-to-centromere direction. Precursor ribosomal RNA transcripts are processed and assembled into ribosomal subunits, the nucleolus being the physical manifestation of this process. I review current understanding of nucleolar chromosome biology and describe current exploration into a role for the NOR chromosomal context. Full DNA sequences for acrocentric p-arms are now emerging, aided by the current revolution in long-read sequencing and genome assembly. Acrocentric p-arms vary from 10.1 to 16.7 Mb, accounting for ∼2.2% of the genome. Bordering rDNA arrays, distal junctions, and proximal junctions are shared among the p-arms, with distal junctions showing evidence of functionality. The remaining p-arm sequences comprise multiple satellite DNA classes and segmental duplications that facilitate recombination between heterologous chromosomes, which is likely also involved in Robertsonian translocations.
Collapse
Affiliation(s)
- Brian McStay
- Centre for Chromosome Biology, College of Science and Engineering, University of Galway, Galway, Ireland;
| |
Collapse
|
157
|
Abstract
DNA sequencing has revolutionized medicine over recent decades. However, analysis of large structural variation and repetitive DNA, a hallmark of human genomes, has been limited by short-read technology, with read lengths of 100-300 bp. Long-read sequencing (LRS) permits routine sequencing of human DNA fragments tens to hundreds of kilobase pairs in size, using both real-time sequencing by synthesis and nanopore-based direct electronic sequencing. LRS permits analysis of large structural variation and haplotypic phasing in human genomes and has enabled the discovery and characterization of rare pathogenic structural variants and repeat expansions. It has also recently enabled the assembly of a complete, gapless human genome that includes previously intractable regions, such as highly repetitive centromeres and homologous acrocentric short arms. With the addition of protocols for targeted enrichment, direct epigenetic DNA modification detection, and long-range chromatin profiling, LRS promises to launch a new era of understanding of genetic diversity and pathogenic mutations in human populations.
Collapse
Affiliation(s)
- Peter E Warburton
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA; ,
- Center for Advanced Genomics Technology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Robert P Sebra
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA; ,
- Center for Advanced Genomics Technology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- Black Family Stem Cell Institute, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- Icahn Genomics Institute, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| |
Collapse
|
158
|
Knoppers BM, Bernier A, Bowers S, Kirby E. Open Data in the Era of the GDPR: Lessons from the Human Cell Atlas. Annu Rev Genomics Hum Genet 2023; 24:369-391. [PMID: 36791787 DOI: 10.1146/annurev-genom-101322-113255] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/17/2023]
Abstract
The Human Cell Atlas (HCA) is striving to build an open community that is inclusive of all researchers adhering to its principles and as open as possible with respect to data access and use. However, open data sharing can pose certain challenges. For instance, being a global initiative, the HCA must contend with a patchwork of local and regional privacy rules. A notable example is the implementation of the European Union General Data Protection Regulation (GDPR), which caused some concern in the biomedical and genomic data-sharing community. We examine how the HCA's large, international group of researchers is investing tremendous efforts into ensuring appropriate sharing of data. We describe the HCA's objectives and governance, how it defines open data sharing, and ethico-legal challenges encountered early in its development; in particular, we describe the challenges prompted by the GDPR. Finally, we broaden the discussion to address tools and strategies that can be used to address ethical data governance.
Collapse
Affiliation(s)
- Bartha Maria Knoppers
- Centre of Genomics and Policy, School of Biomedical Sciences, Faculty of Medicine and Health Sciences, McGill University, Montreal, Quebec, Canada; , ,
| | - Alexander Bernier
- Centre of Genomics and Policy, School of Biomedical Sciences, Faculty of Medicine and Health Sciences, McGill University, Montreal, Quebec, Canada; , ,
| | | | - Emily Kirby
- Centre of Genomics and Policy, School of Biomedical Sciences, Faculty of Medicine and Health Sciences, McGill University, Montreal, Quebec, Canada; , ,
| |
Collapse
|
159
|
Shiraishi Y, Koya J, Chiba K, Okada A, Arai Y, Saito Y, Shibata T, Kataoka K. Precise characterization of somatic complex structural variations from tumor/control paired long-read sequencing data with nanomonsv. Nucleic Acids Res 2023; 51:e74. [PMID: 37336583 PMCID: PMC10415145 DOI: 10.1093/nar/gkad526] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2023] [Revised: 05/23/2023] [Accepted: 06/07/2023] [Indexed: 06/21/2023] Open
Abstract
We present our novel software, nanomonsv, for detecting somatic structural variations (SVs) using tumor and matched control long-read sequencing data with a single-base resolution. The current version of nanomonsv includes two detection modules, Canonical SV module, and Single breakend SV module. Using tumor/control paired long-read sequencing data from three cancer and their matched lymphoblastoid lines, we demonstrate that Canonical SV module can identify somatic SVs that can be captured by short-read technologies with higher precision and recall than existing methods. In addition, we have developed a workflow to classify mobile element insertions while elucidating their in-depth properties, such as 5' truncations, internal inversions, as well as source sites for 3' transductions. Furthermore, Single breakend SV module enables the detection of complex SVs that can only be identified by long-reads, such as SVs involving highly-repetitive centromeric sequences, and LINE1- and virus-mediated rearrangements. In summary, our approaches applied to cancer long-read sequencing data can reveal various features of somatic SVs and will lead to a better understanding of mutational processes and functional consequences of somatic SVs.
Collapse
Affiliation(s)
- Yuichi Shiraishi
- Division of Genome Analysis Platform Development, National Cancer Center Research Institute, Tokyo, Japan
| | - Junji Koya
- Division of Molecular Oncology, National Cancer Center Research Institute, Tokyo, Japan
| | - Kenichi Chiba
- Division of Genome Analysis Platform Development, National Cancer Center Research Institute, Tokyo, Japan
| | - Ai Okada
- Division of Genome Analysis Platform Development, National Cancer Center Research Institute, Tokyo, Japan
| | - Yasuhito Arai
- Division of Cancer Genomics, National Cancer Center Research Institute, Tokyo, Japan
| | - Yuki Saito
- Division of Molecular Oncology, National Cancer Center Research Institute, Tokyo, Japan
- Department of Gastroenterology, Keio University School of Medicine, Tokyo, Japan
| | - Tatsuhiro Shibata
- Division of Cancer Genomics, National Cancer Center Research Institute, Tokyo, Japan
- Laboratory of Molecular Medicine, The Institute of Medical Science, The University of Tokyo, Tokyo, Japan
| | - Keisuke Kataoka
- Division of Molecular Oncology, National Cancer Center Research Institute, Tokyo, Japan
- Department of Hematology, Keio University School of Medicine, Tokyo, Japan
| |
Collapse
|
160
|
Ball RL, Bogue MA, Liang H, Srivastava A, Ashbrook DG, Lamoureux A, Gerring MW, Hatoum AS, Kim M, He H, Emerson J, Berger AK, Walton DO, Sheppard K, Kassaby BE, Castellanos F, Kunde-Ramamoorthy G, Lu L, Bluis J, Desai S, Sundberg BA, Peltz G, Fang Z, Churchill GA, Williams RW, Agrawal A, Bult CJ, Philip VM, Chesler EJ. GenomeMUSter mouse genetic variation service enables multi-trait, multi-population data integration and analyses. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.08.08.552506. [PMID: 37609331 PMCID: PMC10441370 DOI: 10.1101/2023.08.08.552506] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/24/2023]
Abstract
Hundreds of inbred laboratory mouse strains and intercross populations have been used to functionalize genetic variants that contribute to disease. Thousands of disease relevant traits have been characterized in mice and made publicly available. New strains and populations including the Collaborative Cross, expanded BXD and inbred wild-derived strains add to set of complex disease mouse models, genetic mapping resources and sensitized backgrounds against which to evaluate engineered mutations. The genome sequences of many inbred strains, along with dense genotypes from others could allow integrated analysis of trait - variant associations across populations, but these analyses are not feasible due to the sparsity of genotypes available. Moreover, the data are not readily interoperable with other resources. To address these limitations, we created a uniformly dense data resource by harmonizing multiple variant datasets. Missing genotypes were imputed using the Viterbi algorithm with a data-driven technique that incorporates local phylogenetic information, an approach that is extensible to other model organism species. The result is a web- and programmatically-accessible data service called GenomeMUSter ( https://muster.jax.org ), comprising allelic data covering 657 strains at 106.8M segregating sites. Interoperation with phenotype databases, analytic tools and other resources enable a wealth of applications including multi-trait, multi-population meta-analysis. We demonstrate this in a cross-species comparison of the meta-analysis of Type 2 Diabetes and of substance use disorders, resulting in the more specific characterization of the role of human variant effects in light of mouse phenotype data. Other applications include refinement of mapped loci and prioritization of strain backgrounds for disease modeling to further unlock extant mouse diversity for genetic and genomic studies in health and disease.
Collapse
|
161
|
Cruz LA, Cooke Bailey JN, Crawford DC. Importance of Diversity in Precision Medicine: Generalizability of Genetic Associations Across Ancestry Groups Toward Better Identification of Disease Susceptibility Variants. Annu Rev Biomed Data Sci 2023; 6:339-356. [PMID: 37196357 PMCID: PMC10720270 DOI: 10.1146/annurev-biodatasci-122220-113250] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/19/2023]
Abstract
Genome-wide association studies (GWAS) revolutionized our understanding of common genetic variation and its impact on common human disease and traits. Developed and adopted in the mid-2000s, GWAS led to searchable genotype-phenotype catalogs and genome-wide datasets available for further data mining and analysis for the eventual development of translational applications. The GWAS revolution was swift and specific, including almost exclusively populations of European descent, to the neglect of the majority of the world's genetic diversity. In this narrative review, we recount the GWAS landscape of the early years that established a genotype-phenotype catalog that is now universally understood to be inadequate for a complete understanding of complex human genetics. We then describe approaches taken to augment the genotype-phenotype catalog, including the study populations, collaborative consortia, and study design approaches aimed to generalize and then ultimately discover genome-wide associations in non-European descent populations. The collaborations and data resources established in the efforts to diversify genomic findings undoubtedly provide the foundations of the next chapters of genetic association studies with the advent of budget-friendly whole-genome sequencing.
Collapse
Affiliation(s)
- Lauren A Cruz
- Department of Population and Quantitative Health Sciences, Case Western Reserve University, Cleveland, Ohio, USA;
- Cleveland Institute for Computational Biology, Case Western Reserve University, Cleveland, Ohio, USA
| | - Jessica N Cooke Bailey
- Department of Population and Quantitative Health Sciences, Case Western Reserve University, Cleveland, Ohio, USA;
- Department of Genetics and Genome Sciences, Case Western Reserve University, Cleveland, Ohio, USA
- Cleveland Institute for Computational Biology, Case Western Reserve University, Cleveland, Ohio, USA
| | - Dana C Crawford
- Department of Population and Quantitative Health Sciences, Case Western Reserve University, Cleveland, Ohio, USA;
- Department of Genetics and Genome Sciences, Case Western Reserve University, Cleveland, Ohio, USA
- Cleveland Institute for Computational Biology, Case Western Reserve University, Cleveland, Ohio, USA
| |
Collapse
|
162
|
Choi J, Kim S, Kim J, Son HY, Yoo SK, Kim CU, Park YJ, Moon S, Cha B, Jeon MC, Park K, Yun JM, Cho B, Kim N, Kim C, Kwon NJ, Park YJ, Matsuda F, Momozawa Y, Kubo M, Biobank Japan Project, Kim HJ, Park JH, Seo JS, Kim JI, Im SW. A whole-genome reference panel of 14,393 individuals for East Asian populations accelerates discovery of rare functional variants. SCIENCE ADVANCES 2023; 9:eadg6319. [PMID: 37556544 PMCID: PMC10411914 DOI: 10.1126/sciadv.adg6319] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/11/2023] [Accepted: 07/06/2023] [Indexed: 08/11/2023]
Abstract
Underrepresentation of non-European (EUR) populations hinders growth of global precision medicine. Resources such as imputation reference panels that match the study population are necessary to find low-frequency variants with substantial effects. We created a reference panel consisting of 14,393 whole-genome sequences including more than 11,000 Asian individuals. Genome-wide association studies were conducted using the reference panel and a population-specific genotype array of 72,298 subjects for eight phenotypes. This panel yields improved imputation accuracy of rare and low-frequency variants within East Asian populations compared with the largest reference panel. Thirty-nine previously unidentified associations were found, and more than half of the variants were East Asian specific. We discovered genes with rare protein-altering variants, including LTBP1 for height and GPR75 for body mass index, as well as putative regulatory mechanisms for rare noncoding variants with cell type-specific effects. We suggest that this dataset will add to the potential value of Asian precision medicine.
Collapse
Affiliation(s)
- Jaeyong Choi
- Department of Biomedical Sciences, Seoul National University College of Medicine, Seoul, Republic of Korea
| | | | - Juhyun Kim
- Department of Biomedical Sciences, Seoul National University College of Medicine, Seoul, Republic of Korea
| | - Ho-Young Son
- Genomic Medicine Institute, Medical Research Center, Seoul National University, Seoul, Republic of Korea
| | - Seong-Keun Yoo
- The Marc and Jennifer Lipschultz Precision Immunology Institute, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | | | - Young Jun Park
- Department of Translational Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea
| | - Sungji Moon
- Interdisciplinary Program in Cancer Biology, Seoul National University College of Medicine, Seoul, Republic of Korea
- Cancer Research Institute, Seoul National University, Seoul, Republic of Korea
| | - Bukyoung Cha
- Genomic Medicine Institute, Medical Research Center, Seoul National University, Seoul, Republic of Korea
| | - Min Chul Jeon
- Department of Biomedical Sciences, Seoul National University College of Medicine, Seoul, Republic of Korea
| | - Kyunghyuk Park
- Genomic Medicine Institute, Medical Research Center, Seoul National University, Seoul, Republic of Korea
| | - Jae Moon Yun
- Department of Family Medicine, Seoul National University Hospital, Seoul, Republic of Korea
| | - Belong Cho
- Department of Family Medicine, Seoul National University Hospital, Seoul, Republic of Korea
- Department of Family Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea
| | | | | | | | - Young Joo Park
- Genomic Medicine Institute, Medical Research Center, Seoul National University, Seoul, Republic of Korea
- Department of Internal Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea
- Department of Molecular Medicine and Biopharmaceutical Sciences, Graduate School of Convergence Science and Technology, Seoul National University, Seoul, Republic of Korea
| | - Fumihiko Matsuda
- Center for Genomic Medicine, Kyoto University Graduate School of Medicine, Kyoto, Japan
| | | | - Michiaki Kubo
- RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
| | | | - Hyun-Jin Kim
- National Cancer Control Institute, National Cancer Center, Goyang, Republic of Korea
| | - Jin-Ho Park
- Department of Family Medicine, Seoul National University Hospital, Seoul, Republic of Korea
- Department of Family Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea
| | - Jeong-Sun Seo
- Macrogen Inc., Seoul, Republic of Korea
- Asian Genome Center, Seoul National University Bundang Hospital, Gyeonggi, Republic of Korea
| | - Jong-Il Kim
- Department of Biomedical Sciences, Seoul National University College of Medicine, Seoul, Republic of Korea
- Genomic Medicine Institute, Medical Research Center, Seoul National University, Seoul, Republic of Korea
- Cancer Research Institute, Seoul National University, Seoul, Republic of Korea
- Department of Biochemistry and Molecular Biology, Seoul National University College of Medicine, Seoul, Republic of Korea
| | - Sun-Wha Im
- Department of Biochemistry and Molecular Biology, Kangwon National University School of Medicine, Gangwon, Republic of Korea
| |
Collapse
|
163
|
Abstract
Within the next decade, the genomes of 1.8 million eukaryotic species will be sequenced. Identifying genes in these sequences is essential to understand the biology of the species. This is challenging due to the transcriptional complexity of eukaryotic genomes, which encode hundreds of thousands of transcripts of multiple types. Among these, a small set of protein-coding mRNAs play a disproportionately large role in defining phenotypes. Due to their sequence conservation, orthology can be established, making it possible to define the universal catalog of eukaryotic protein-coding genes. This catalog should substantially contribute to uncovering the genomic events underlying the emergence of eukaryotic phenotypes. This piece briefly reviews the basics of protein-coding gene prediction, discusses challenges in finalizing annotation of the human genome, and proposes strategies for producing annotations across the eukaryotic Tree of Life. This lays the groundwork for obtaining the catalog of all genes-the Earth's code of life.
Collapse
Affiliation(s)
- Roderic Guigó
- Bioinformatics and Genomics, Center for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology (BIST), Dr. Aiguader 88, 08003 Barcelona, Catalonia
- Universitat Pompeu Fabra (UPF), Barcelona, Catalonia
| |
Collapse
|
164
|
Yu H, Zheng Z, Su J, Lam TW, Luo R. Boosting variant-calling performance with multi-platform sequencing data using Clair3-MP. BMC Bioinformatics 2023; 24:308. [PMID: 37537536 PMCID: PMC10401749 DOI: 10.1186/s12859-023-05434-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Accepted: 07/31/2023] [Indexed: 08/05/2023] Open
Abstract
BACKGROUND With the continuous advances in third-generation sequencing technology and the increasing affordability of next-generation sequencing technology, sequencing data from different sequencing technology platforms is becoming more common. While numerous benchmarking studies have been conducted to compare variant-calling performance across different platforms and approaches, little attention has been paid to the potential of leveraging the strengths of different platforms to optimize overall performance, especially integrating Oxford Nanopore and Illumina sequencing data. RESULTS We investigated the impact of multi-platform data on the performance of variant calling through carefully designed experiments with a deep learning-based variant caller named Clair3-MP (Multi-Platform). Through our research, we not only demonstrated the capability of ONT-Illumina data for improved variant calling, but also identified the optimal scenarios for utilizing ONT-Illumina data. In addition, we revealed that the improvement in variant calling using ONT-Illumina data comes from an improvement in difficult genomic regions, such as the large low-complexity regions and segmental and collapse duplication regions. Moreover, Clair3-MP can incorporate reference genome stratification information to achieve a small but measurable improvement in variant calling. Clair3-MP is accessible as an open-source project at: https://github.com/HKU-BAL/Clair3-MP . CONCLUSIONS These insights have important implications for researchers and practitioners alike, providing valuable guidance for improving the reliability and efficiency of genomic analysis in diverse applications.
Collapse
Affiliation(s)
- Huijing Yu
- Department of Computer Science, The University of Hong Kong, Pok Fu Lam, Hong Kong SAR, China
| | - Zhenxian Zheng
- Department of Computer Science, The University of Hong Kong, Pok Fu Lam, Hong Kong SAR, China
| | - Junhao Su
- Department of Computer Science, The University of Hong Kong, Pok Fu Lam, Hong Kong SAR, China.
| | - Tak-Wah Lam
- Department of Computer Science, The University of Hong Kong, Pok Fu Lam, Hong Kong SAR, China.
| | - Ruibang Luo
- Department of Computer Science, The University of Hong Kong, Pok Fu Lam, Hong Kong SAR, China.
| |
Collapse
|
165
|
Chin CS, Behera S, Khalak A, Sedlazeck FJ, Sudmant PH, Wagner J, Zook JM. Multiscale analysis of pangenomes enables improved representation of genomic diversity for repetitive and clinically relevant genes. Nat Methods 2023; 20:1213-1221. [PMID: 37365340 PMCID: PMC10406601 DOI: 10.1038/s41592-023-01914-y] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2022] [Accepted: 05/17/2023] [Indexed: 06/28/2023]
Abstract
Advancements in sequencing technologies and assembly methods enable the regular production of high-quality genome assemblies characterizing complex regions. However, challenges remain in efficiently interpreting variation at various scales, from smaller tandem repeats to megabase rearrangements, across many human genomes. We present a PanGenome Research Tool Kit (PGR-TK) enabling analyses of complex pangenome structural and haplotype variation at multiple scales. We apply the graph decomposition methods in PGR-TK to the class II major histocompatibility complex demonstrating the importance of the human pangenome for analyzing complicated regions. Moreover, we investigate the Y-chromosome genes, DAZ1/DAZ2/DAZ3/DAZ4, of which structural variants have been linked to male infertility, and X-chromosome genes OPN1LW and OPN1MW linked to eye disorders. We further showcase PGR-TK across 395 complex repetitive medically important genes. This highlights the power of PGR-TK to resolve complex variation in regions of the genome that were previously too complex to analyze.
Collapse
Affiliation(s)
- Chen-Shan Chin
- GeneDX, Stamford, CT, USA.
- Foundation of Biological Data Science, Belmont, CA, USA.
| | - Sairam Behera
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Asif Khalak
- Foundation of Biological Data Science, Belmont, CA, USA
| | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Peter H Sudmant
- Department of Integrative Biology, University of California Berkeley, Berkeley, CA, USA
| | - Justin Wagner
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Justin M Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| |
Collapse
|
166
|
Cuomo ASE, Nathan A, Raychaudhuri S, MacArthur DG, Powell JE. Single-cell genomics meets human genetics. Nat Rev Genet 2023; 24:535-549. [PMID: 37085594 PMCID: PMC10784789 DOI: 10.1038/s41576-023-00599-5] [Citation(s) in RCA: 56] [Impact Index Per Article: 28.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/29/2023] [Indexed: 04/23/2023]
Abstract
Single-cell genomic technologies are revealing the cellular composition, identities and states in tissues at unprecedented resolution. They have now scaled to the point that it is possible to query samples at the population level, across thousands of individuals. Combining single-cell information with genotype data at this scale provides opportunities to link genetic variation to the cellular processes underpinning key aspects of human biology and disease. This strategy has potential implications for disease diagnosis, risk prediction and development of therapeutic solutions. But, effectively integrating large-scale single-cell genomic data, genetic variation and additional phenotypic data will require advances in data generation and analysis methods. As single-cell genetics begins to emerge as a field in its own right, we review its current state and the challenges and opportunities ahead.
Collapse
Affiliation(s)
- Anna S E Cuomo
- Garvan Institute of Medical Research, Darlinghurst, Sydney, New South Wales, Australia.
- Centre for Population Genomics, Garvan Institute of Medical Research, Sydney, New South Wales, Australia.
- Centre for Population Genomics, Murdoch Children's Research Institute, Melbourne, Victoria, Australia.
| | - Aparna Nathan
- Center for Data Sciences, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA
- Divisions of Rheumatology and Genetics, Department of Medicine, Brigham and Women's Hospital, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Soumya Raychaudhuri
- Center for Data Sciences, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA
- Divisions of Rheumatology and Genetics, Department of Medicine, Brigham and Women's Hospital, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Daniel G MacArthur
- Centre for Population Genomics, Garvan Institute of Medical Research, Sydney, New South Wales, Australia
- Centre for Population Genomics, Murdoch Children's Research Institute, Melbourne, Victoria, Australia
| | - Joseph E Powell
- Garvan Institute of Medical Research, Darlinghurst, Sydney, New South Wales, Australia.
- UNSW Cellular Genomics Futures Institute, University of New South Wales, Sydney, New South Wales, Australia.
| |
Collapse
|
167
|
Ahsan MU, Liu Q, Perdomo JE, Fang L, Wang K. A survey of algorithms for the detection of genomic structural variants from long-read sequencing data. Nat Methods 2023; 20:1143-1158. [PMID: 37386186 PMCID: PMC11208083 DOI: 10.1038/s41592-023-01932-w] [Citation(s) in RCA: 28] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2021] [Accepted: 05/31/2023] [Indexed: 07/01/2023]
Abstract
As long-read sequencing technologies are becoming increasingly popular, a number of methods have been developed for the discovery and analysis of structural variants (SVs) from long reads. Long reads enable detection of SVs that could not be previously detected from short-read sequencing, but computational methods must adapt to the unique challenges and opportunities presented by long-read sequencing. Here, we summarize over 50 long-read-based methods for SV detection, genotyping and visualization, and discuss how new telomere-to-telomere genome assemblies and pangenome efforts can improve the accuracy and drive the development of SV callers in the future.
Collapse
Affiliation(s)
- Mian Umair Ahsan
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, USA
| | - Qian Liu
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, USA
| | - Jonathan Elliot Perdomo
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, USA
- School of Biomedical Engineering, Drexel University, Philadelphia, PA, USA
| | - Li Fang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, USA
- Department of Genetics and Biomedical Informatics, Zhongshan School of Medicine, Sun Yat-sen University, Guangzhou, China
| | - Kai Wang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, USA.
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
168
|
Soto DC, Uribe-Salazar JM, Shew CJ, Sekar A, McGinty S, Dennis MY. Genomic structural variation: A complex but important driver of human evolution. AMERICAN JOURNAL OF BIOLOGICAL ANTHROPOLOGY 2023; 181 Suppl 76:118-144. [PMID: 36794631 PMCID: PMC10329998 DOI: 10.1002/ajpa.24713] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/02/2022] [Revised: 01/21/2023] [Accepted: 02/05/2023] [Indexed: 02/17/2023]
Abstract
Structural variants (SVs)-including duplications, deletions, and inversions of DNA-can have significant genomic and functional impacts but are technically difficult to identify and assay compared with single-nucleotide variants. With the aid of new genomic technologies, it has become clear that SVs account for significant differences across and within species. This phenomenon is particularly well-documented for humans and other primates due to the wealth of sequence data available. In great apes, SVs affect a larger number of nucleotides than single-nucleotide variants, with many identified SVs exhibiting population and species specificity. In this review, we highlight the importance of SVs in human evolution by (1) how they have shaped great ape genomes resulting in sensitized regions associated with traits and diseases, (2) their impact on gene functions and regulation, which subsequently has played a role in natural selection, and (3) the role of gene duplications in human brain evolution. We further discuss how to incorporate SVs in research, including the strengths and limitations of various genomic approaches. Finally, we propose future considerations in integrating existing data and biospecimens with the ever-expanding SV compendium propelled by biotechnology advancements.
Collapse
Affiliation(s)
- Daniela C. Soto
- Genome Center, MIND Institute, and Department of Biochemistry & Molecular Medicine, University of California, Davis, CA, USA
- Integrative Genetics and Genomics Graduate Group, University of California, Davis, CA, USA
| | - José M. Uribe-Salazar
- Genome Center, MIND Institute, and Department of Biochemistry & Molecular Medicine, University of California, Davis, CA, USA
- Integrative Genetics and Genomics Graduate Group, University of California, Davis, CA, USA
| | - Colin J. Shew
- Genome Center, MIND Institute, and Department of Biochemistry & Molecular Medicine, University of California, Davis, CA, USA
- Integrative Genetics and Genomics Graduate Group, University of California, Davis, CA, USA
| | - Aarthi Sekar
- Genome Center, MIND Institute, and Department of Biochemistry & Molecular Medicine, University of California, Davis, CA, USA
- Integrative Genetics and Genomics Graduate Group, University of California, Davis, CA, USA
| | - Sean McGinty
- Genome Center, MIND Institute, and Department of Biochemistry & Molecular Medicine, University of California, Davis, CA, USA
- Integrative Genetics and Genomics Graduate Group, University of California, Davis, CA, USA
| | - Megan Y. Dennis
- Genome Center, MIND Institute, and Department of Biochemistry & Molecular Medicine, University of California, Davis, CA, USA
- Integrative Genetics and Genomics Graduate Group, University of California, Davis, CA, USA
| |
Collapse
|
169
|
Chaung K, Baharav TZ, Henderson G, Zheludev IN, Wang PL, Salzman J. SPLASH: a statistical, reference-free genomic algorithm unifies biological discovery. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2022.06.24.497555. [PMID: 35794890 PMCID: PMC9258296 DOI: 10.1101/2022.06.24.497555] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/09/2022]
Abstract
Today's genomics workflows typically require alignment to a reference sequence, which limits discovery. We introduce a new unifying paradigm, SPLASH (Statistically Primary aLignment Agnostic Sequence Homing), an approach that directly analyzes raw sequencing data to detect a signature of regulation: sample-specific sequence variation. The approach, which includes a new statistical test, is computationally efficient and can be run at scale. SPLASH unifies detection of myriad forms of sequence variation. We demonstrate that SPLASH identifies complex mutation patterns in SARS-CoV-2 strains, discovers regulated RNA isoforms at the single cell level, documents the vast sequence diversity of adaptive immune receptors, and uncovers biology in non-model organisms undocumented in their reference genomes: geographic and seasonal variation and diatom association in eelgrass, an oceanic plant impacted by climate change, and tissue-specific transcripts in octopus. SPLASH is a new unifying approach to genomic analysis that enables an expansive scope of discovery without metadata or references.
Collapse
Affiliation(s)
- Kaitlin Chaung
- Department of Biomedical Data Science, Stanford University, Stanford, 94305, USA
- Department of Biochemistry, Stanford University, Stanford, 94305, USA
| | - Tavor Z. Baharav
- Department of Electrical Engineering, Stanford University, Stanford, 94305, USA
| | - George Henderson
- Department of Biomedical Data Science, Stanford University, Stanford, 94305, USA
- Department of Biochemistry, Stanford University, Stanford, 94305, USA
| | - Ivan N. Zheludev
- Department of Biochemistry, Stanford University, Stanford, 94305, USA
| | - Peter L. Wang
- Department of Biomedical Data Science, Stanford University, Stanford, 94305, USA
- Department of Biochemistry, Stanford University, Stanford, 94305, USA
| | - Julia Salzman
- Department of Biomedical Data Science, Stanford University, Stanford, 94305, USA
- Department of Biochemistry, Stanford University, Stanford, 94305, USA
- Department of Statistics (by courtesy), Stanford University, Stanford, 94305, USA
| |
Collapse
|
170
|
Wang B, Dang N, Yang X, Xu S, Ye K. The human pangenome reference: the beginning of a new era for genomics. Sci Bull (Beijing) 2023; 68:1484-1487. [PMID: 37353434 DOI: 10.1016/j.scib.2023.06.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/25/2023]
Affiliation(s)
- Bo Wang
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China; MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China
| | - Ningxin Dang
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China; Genome Institute, the First Affiliated Hospital of Xi'an Jiaotong University, Xi'an 710061, China
| | - Xiaofei Yang
- School of Computer Science and Technology, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China.
| | - Shuhua Xu
- State Key Laboratory of Genetic Engineering, Human Phenome Institute, Zhangjiang Fudan International Innovation Center, Center for Evolutionary Biology, School of Life Sciences, Fudan University, Shanghai 200032, China; Ministry of Education Key Laboratory of Contemporary Anthropology, Collaborative Innovation Center for Genetics and Development, Fudan University, Shanghai 200032, China; Department of Liver Surgery and Transplantation, Liver Cancer Institute, Zhongshan Hospital, Fudan University, Shanghai 200032, China; School of Life Science and Technology, ShanghaiTech University, Shanghai 201210, China.
| | - Kai Ye
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China; MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China; Genome Institute, the First Affiliated Hospital of Xi'an Jiaotong University, Xi'an 710061, China.
| |
Collapse
|
171
|
Smith LA, Cahill JA, Graim K. Equitable machine learning counteracts ancestral bias in precision medicine, improving outcomes for all. RESEARCH SQUARE 2023:rs.3.rs-3168446. [PMID: 37546907 PMCID: PMC10402189 DOI: 10.21203/rs.3.rs-3168446/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 08/08/2023]
Abstract
Gold standard genomic datasets severely under-represent non-European populations, leading to inequities and a limited understanding of human disease [1-8]. Therapeutics and outcomes remain hidden because we lack insights that we could gain from analyzing ancestry-unbiased genomic data. To address this significant gap, we present PhyloFrame, the first-ever machine learning method for equitable genomic precision medicine. PhyloFrame corrects for ancestral bias by integrating big data tissue-specific functional interaction networks, global population variation data, and disease-relevant transcriptomic data. Application of PhyloFrame to breast, thyroid, and uterine cancers shows marked improvements in predictive power across all ancestries, less model overfitting, and a higher likelihood of identifying known cancer-related genes. The ability to provide accurate predictions for underrepresented groups, in particular, is substantially increased. These results demonstrate how AI can mitigate ancestral bias in training data and contribute to equitable representation in medical research.
Collapse
Affiliation(s)
- Leslie A Smith
- Department of Computer & Information Science & Engineering, University of Florida, 432 Newell Dr, Gainesville, 32611, FL, USA
| | - James A Cahill
- Environmental Engineering Sciences Department, University of Florida, 432 Newell Dr, Gainesville, 32611, FL, USA
| | - Kiley Graim
- Department of Computer & Information Science & Engineering, University of Florida, 432 Newell Dr, Gainesville, 32611, FL, USA
| |
Collapse
|
172
|
The Impact of Genomic Variation on Function (IGVF) Consortium. ARXIV 2023:arXiv:2307.13708v1. [PMID: 37547663 PMCID: PMC10402186] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 08/08/2023]
Abstract
Our genomes influence nearly every aspect of human biology from molecular and cellular functions to phenotypes in health and disease. Human genetics studies have now associated hundreds of thousands of differences in our DNA sequence ("genomic variation") with disease risk and other phenotypes, many of which could reveal novel mechanisms of human biology and uncover the basis of genetic predispositions to diseases, thereby guiding the development of new diagnostics and therapeutics. Yet, understanding how genomic variation alters genome function to influence phenotype has proven challenging. To unlock these insights, we need a systematic and comprehensive catalog of genome function and the molecular and cellular effects of genomic variants. Toward this goal, the Impact of Genomic Variation on Function (IGVF) Consortium will combine approaches in single-cell mapping, genomic perturbations, and predictive modeling to investigate the relationships among genomic variation, genome function, and phenotypes. Through systematic comparisons and benchmarking of experimental and computational methods, we aim to create maps across hundreds of cell types and states describing how coding variants alter protein activity, how noncoding variants change the regulation of gene expression, and how both coding and noncoding variants may connect through gene regulatory and protein interaction networks. These experimental data, computational predictions, and accompanying standards and pipelines will be integrated into an open resource that will catalyze community efforts to explore genome function and the impact of genetic variation on human biology and disease across populations.
Collapse
|
173
|
Yang X, Wang X, Zou Y, Zhang S, Xia M, Fu L, Vollger MR, Chen NC, Taylor DJ, Harvey WT, Logsdon GA, Meng D, Shi J, McCoy RC, Schatz MC, Li W, Eichler EE, Lu Q, Mao Y. Characterization of large-scale genomic differences in the first complete human genome. Genome Biol 2023; 24:157. [PMID: 37403156 PMCID: PMC10320979 DOI: 10.1186/s13059-023-02995-w] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Accepted: 06/23/2023] [Indexed: 07/06/2023] Open
Abstract
BACKGROUND The first telomere-to-telomere (T2T) human genome assembly (T2T-CHM13) release is a milestone in human genomics. The T2T-CHM13 genome assembly extends our understanding of telomeres, centromeres, segmental duplication, and other complex regions. The current human genome reference (GRCh38) has been widely used in various human genomic studies. However, the large-scale genomic differences between these two important genome assemblies are not characterized in detail yet. RESULTS Here, in addition to the previously reported "non-syntenic" regions, we find 67 additional large-scale discrepant regions and precisely categorize them into four structural types with a newly developed website tool called SynPlotter. The discrepant regions (~ 21.6 Mbp) excluding telomeric and centromeric regions are highly structurally polymorphic in humans, where the deletions or duplications are likely associated with various human diseases, such as immune and neurodevelopmental disorders. The analyses of a newly identified discrepant region-the KLRC gene cluster-show that the depletion of KLRC2 by a single-deletion event is associated with natural killer cell differentiation in ~ 20% of humans. Meanwhile, the rapid amino acid replacements observed within KLRC3 are probably a result of natural selection in primate evolution. CONCLUSION Our study provides a foundation for understanding the large-scale structural genomic differences between the two crucial human reference genomes, and is thereby important for future human genomics studies.
Collapse
Affiliation(s)
- Xiangyu Yang
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Xuankai Wang
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Yawen Zou
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Shilong Zhang
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Manying Xia
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Lianting Fu
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Mitchell R Vollger
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Nae-Chyun Chen
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Dylan J Taylor
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
| | - William T Harvey
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Glennis A Logsdon
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Dan Meng
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Junfeng Shi
- Shanghai Engineering Research Center of Advanced Dental Technology and Materials, Shanghai, China
- Shanghai Key Laboratory of Stomatology, Shanghai Ninth People's Hospital, College of Stomatology, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Rajiv C McCoy
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Michael C Schatz
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Weidong Li
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| | - Qing Lu
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Yafei Mao
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China.
- Shanghai Key Laboratory of Stomatology, Shanghai Ninth People's Hospital, College of Stomatology, Shanghai Jiao Tong University School of Medicine, Shanghai, China.
| |
Collapse
|
174
|
Choo ZN, Imieliński M. Faster detection of somatic structural variants. Nat Biomed Eng 2023; 7:847-848. [PMID: 37217613 DOI: 10.1038/s41551-023-01039-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Affiliation(s)
- Zi-Ning Choo
- Tri-Institutional MD-PhD Program, Weill Cornell Medicine, New York, NY, USA
- New York Genome Center, New York, NY, USA
| | - Marcin Imieliński
- New York Genome Center, New York, NY, USA.
- Perlmutter Cancer Center, NYU Grossman School of Medicine, New York, NY, USA.
| |
Collapse
|
175
|
Hamza A, El-Sissy C, Yousfi N, Martins PV, Rafat C, Masliah-Planchon J, Frémeaux-Bacchi V, Mesnard L. The absence of CFHR3 and CFHR1 genes from the T2T-CHM13 assembly can limit the molecular diagnosis of complement-related diseases. Eur J Hum Genet 2023; 31:730-732. [PMID: 37032353 PMCID: PMC10325998 DOI: 10.1038/s41431-023-01350-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Revised: 03/14/2023] [Accepted: 03/20/2023] [Indexed: 04/11/2023] Open
Affiliation(s)
- Abderaouf Hamza
- Department of Genetics, Institut Curie, PSL Research University, Paris, France
| | - Carine El-Sissy
- Department of Biological Immunology, Hôpital Européen Georges Pompidou, Assistance Publique-Hôpitaux de Paris, Paris, France
| | - Nadhir Yousfi
- Unité Mixte de Recherche S1155, Institut National de la Santé et de la Recherche Médicale (INSERM), Paris, France
| | - Paula Vieira Martins
- Department of Biological Immunology, Hôpital Européen Georges Pompidou, Assistance Publique-Hôpitaux de Paris, Paris, France
| | - Cédric Rafat
- Service de Soins Intensifs Néphrologiques et Rein Aigu (SINRA), French Intensive Renal Network, Hôpital Tenon, Assistance Publique-Hôpitaux de Paris, Paris, France
- Faculté de Médecine, Sorbonne Université, Paris, France
| | | | - Véronique Frémeaux-Bacchi
- Department of Biological Immunology, Hôpital Européen Georges Pompidou, Assistance Publique-Hôpitaux de Paris, Paris, France
- Unité Mixte de Recherche S1138, Institut National de la Santé et de la Recherche Médicale (INSERM), Centre de Recherche des Cordeliers, Paris, France
| | - Laurent Mesnard
- Unité Mixte de Recherche S1155, Institut National de la Santé et de la Recherche Médicale (INSERM), Paris, France.
- Service de Soins Intensifs Néphrologiques et Rein Aigu (SINRA), French Intensive Renal Network, Hôpital Tenon, Assistance Publique-Hôpitaux de Paris, Paris, France.
- Faculté de Médecine, Sorbonne Université, Paris, France.
- Institut des Sciences du Calcul et des Données, Sorbonne Université, Paris, France.
| |
Collapse
|
176
|
Laufer VA, Glover TW, Wilson TE. Applications of advanced technologies for detecting genomic structural variation. MUTATION RESEARCH. REVIEWS IN MUTATION RESEARCH 2023; 792:108475. [PMID: 37931775 PMCID: PMC10792551 DOI: 10.1016/j.mrrev.2023.108475] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Revised: 09/07/2023] [Accepted: 11/02/2023] [Indexed: 11/08/2023]
Abstract
Chromosomal structural variation (SV) encompasses a heterogenous class of genetic variants that exerts strong influences on human health and disease. Despite their importance, many structural variants (SVs) have remained poorly characterized at even a basic level, a discrepancy predicated upon the technical limitations of prior genomic assays. However, recent advances in genomic technology can identify and localize SVs accurately, opening new questions regarding SV risk factors and their impacts in humans. Here, we first define and classify human SVs and their generative mechanisms, highlighting characteristics leveraged by various SV assays. We next examine the first-ever gapless assembly of the human genome and the technical process of assembling it, which required third-generation sequencing technologies to resolve structurally complex loci. The new portions of that "telomere-to-telomere" and subsequent pangenome assemblies highlight aspects of SV biology likely to develop in the near-term. We consider the strengths and limitations of the most promising new SV technologies and when they or longstanding approaches are best suited to meeting salient goals in the study of human SV in population-scale genomics research, clinical, and public health contexts. It is a watershed time in our understanding of human SV when new approaches are expected to fundamentally change genomic applications.
Collapse
Affiliation(s)
- Vincent A Laufer
- Department of Pathology, University of Michigan Medical School, Ann Arbor, MI 48109, USA.
| | - Thomas W Glover
- Department of Pathology, University of Michigan Medical School, Ann Arbor, MI 48109, USA; Department of Human Genetics, University of Michigan Medical School, Ann Arbor, MI 48109, USA.
| | - Thomas E Wilson
- Department of Pathology, University of Michigan Medical School, Ann Arbor, MI 48109, USA; Department of Human Genetics, University of Michigan Medical School, Ann Arbor, MI 48109, USA.
| |
Collapse
|
177
|
Gao Y, Yang X, Chen H, Tan X, Yang Z, Deng L, Wang B, Kong S, Li S, Cui Y, Lei C, Wang Y, Pan Y, Ma S, Sun H, Zhao X, Shi Y, Yang Z, Wu D, Wu S, Zhao X, Shi B, Jin L, Hu Z, Lu Y, Chu J, Ye K, Xu S. A pangenome reference of 36 Chinese populations. Nature 2023; 619:112-121. [PMID: 37316654 PMCID: PMC10322713 DOI: 10.1038/s41586-023-06173-7] [Citation(s) in RCA: 61] [Impact Index Per Article: 30.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2022] [Accepted: 05/05/2023] [Indexed: 06/16/2023]
Abstract
Human genomics is witnessing an ongoing paradigm shift from a single reference sequence to a pangenome form, but populations of Asian ancestry are underrepresented. Here we present data from the first phase of the Chinese Pangenome Consortium, including a collection of 116 high-quality and haplotype-phased de novo assemblies based on 58 core samples representing 36 minority Chinese ethnic groups. With an average 30.65× high-fidelity long-read sequence coverage, an average contiguity N50 of more than 35.63 megabases and an average total size of 3.01 gigabases, the CPC core assemblies add 189 million base pairs of euchromatic polymorphic sequences and 1,367 protein-coding gene duplications to GRCh38. We identified 15.9 million small variants and 78,072 structural variants, of which 5.9 million small variants and 34,223 structural variants were not reported in a recently released pangenome reference1. The Chinese Pangenome Consortium data demonstrate a remarkable increase in the discovery of novel and missing sequences when individuals are included from underrepresented minority ethnic groups. The missing reference sequences were enriched with archaic-derived alleles and genes that confer essential functions related to keratinization, response to ultraviolet radiation, DNA repair, immunological responses and lifespan, implying great potential for shedding new light on human evolution and recovering missing heritability in complex disease mapping.
Collapse
Affiliation(s)
- Yang Gao
- State Key Laboratory of Genetic Engineering, Human Phenome Institute, Zhangjiang Fudan International Innovation Center, Center for Evolutionary Biology, School of Life Sciences, Fudan University, Shanghai, China
- Ministry of Education Key Laboratory of Contemporary Anthropology, Collaborative Innovation Center for Genetics and Development, Fudan University, Shanghai, China
- Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
- School of Life Science and Technology, ShanghaiTech University, Shanghai, China
| | - Xiaofei Yang
- School of Computer Science and Technology, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China
- Genome Institute, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an, China
| | - Hao Chen
- Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Xinjiang Tan
- Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Zhaoqing Yang
- Department of Medical Genetics, Institute of Medical Biology, Chinese Academy of Medical Sciences, Kunming, China
| | - Lian Deng
- State Key Laboratory of Genetic Engineering, Human Phenome Institute, Zhangjiang Fudan International Innovation Center, Center for Evolutionary Biology, School of Life Sciences, Fudan University, Shanghai, China
| | - Baonan Wang
- Ministry of Education Key Laboratory of Contemporary Anthropology, Collaborative Innovation Center for Genetics and Development, Fudan University, Shanghai, China
| | - Shuang Kong
- Ministry of Education Key Laboratory of Contemporary Anthropology, Collaborative Innovation Center for Genetics and Development, Fudan University, Shanghai, China
| | - Songyang Li
- Ministry of Education Key Laboratory of Contemporary Anthropology, Collaborative Innovation Center for Genetics and Development, Fudan University, Shanghai, China
| | - Yuhang Cui
- Ministry of Education Key Laboratory of Contemporary Anthropology, Collaborative Innovation Center for Genetics and Development, Fudan University, Shanghai, China
| | - Chang Lei
- State Key Laboratory of Genetic Engineering, Human Phenome Institute, Zhangjiang Fudan International Innovation Center, Center for Evolutionary Biology, School of Life Sciences, Fudan University, Shanghai, China
| | - Yimin Wang
- Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Yuwen Pan
- Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Sen Ma
- Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Hao Sun
- Department of Medical Genetics, Institute of Medical Biology, Chinese Academy of Medical Sciences, Kunming, China
| | - Xiaohan Zhao
- Ministry of Education Key Laboratory of Contemporary Anthropology, Collaborative Innovation Center for Genetics and Development, Fudan University, Shanghai, China
| | - Yingbing Shi
- State Key Laboratory of Genetic Engineering, Human Phenome Institute, Zhangjiang Fudan International Innovation Center, Center for Evolutionary Biology, School of Life Sciences, Fudan University, Shanghai, China
| | - Ziyi Yang
- State Key Laboratory of Genetic Engineering, Human Phenome Institute, Zhangjiang Fudan International Innovation Center, Center for Evolutionary Biology, School of Life Sciences, Fudan University, Shanghai, China
| | - Dongdong Wu
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, China
| | - Shaoyuan Wu
- Jiangsu Key Laboratory of Phylogenomics & Comparative Genomics, International Joint Center of Genomics of Jiangsu Province School of Life Sciences, Jiangsu Normal University, Xuzhou, China
| | - Xingming Zhao
- Institute of Science and Technology for Brain-Inspired Intelligence, Ministry of Education Key (MOE) Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, MOE Frontiers Center for Brain Science Fudan University, Shanghai, China
| | - Binyin Shi
- Department of Endocrinology, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an, China
| | - Li Jin
- State Key Laboratory of Genetic Engineering, Human Phenome Institute, Zhangjiang Fudan International Innovation Center, Center for Evolutionary Biology, School of Life Sciences, Fudan University, Shanghai, China
- Ministry of Education Key Laboratory of Contemporary Anthropology, Collaborative Innovation Center for Genetics and Development, Fudan University, Shanghai, China
| | - Zhibin Hu
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, China
- Jiangsu Key Lab of Cancer Biomarkers, Prevention and Treatment, Collaborative Innovation Center for Cancer Personalized Medicine, Center for Global Health, School of Public Health, Nanjing Medical University, Nanjing, China
| | - Yan Lu
- State Key Laboratory of Genetic Engineering, Human Phenome Institute, Zhangjiang Fudan International Innovation Center, Center for Evolutionary Biology, School of Life Sciences, Fudan University, Shanghai, China.
| | - Jiayou Chu
- Department of Medical Genetics, Institute of Medical Biology, Chinese Academy of Medical Sciences, Kunming, China.
| | - Kai Ye
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China.
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China.
- School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, China.
| | - Shuhua Xu
- State Key Laboratory of Genetic Engineering, Human Phenome Institute, Zhangjiang Fudan International Innovation Center, Center for Evolutionary Biology, School of Life Sciences, Fudan University, Shanghai, China.
- Ministry of Education Key Laboratory of Contemporary Anthropology, Collaborative Innovation Center for Genetics and Development, Fudan University, Shanghai, China.
- School of Life Science and Technology, ShanghaiTech University, Shanghai, China.
- Jiangsu Key Laboratory of Phylogenomics & Comparative Genomics, International Joint Center of Genomics of Jiangsu Province School of Life Sciences, Jiangsu Normal University, Xuzhou, China.
- Department of Liver Surgery and Transplantation Liver Cancer Institute, Zhongshan Hospital, Fudan University, Shanghai, China.
- Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, China.
| |
Collapse
|
178
|
Prodanov T, Bansal V. A multilocus approach for accurate variant calling in low-copy repeats using whole-genome sequencing. Bioinformatics 2023; 39:i279-i287. [PMID: 37387146 PMCID: PMC10311303 DOI: 10.1093/bioinformatics/btad268] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION Low-copy repeats (LCRs) or segmental duplications are long segments of duplicated DNA that cover > 5% of the human genome. Existing tools for variant calling using short reads exhibit low accuracy in LCRs due to ambiguity in read mapping and extensive copy number variation. Variants in more than 150 genes overlapping LCRs are associated with risk for human diseases. METHODS We describe a short-read variant calling method, ParascopyVC, that performs variant calling jointly across all repeat copies and utilizes reads independent of mapping quality in LCRs. To identify candidate variants, ParascopyVC aggregates reads mapped to different repeat copies and performs polyploid variant calling. Subsequently, paralogous sequence variants that can differentiate repeat copies are identified using population data and used for estimating the genotype of variants for each repeat copy. RESULTS On simulated whole-genome sequence data, ParascopyVC achieved higher precision (0.997) and recall (0.807) than three state-of-the-art variant callers (best precision = 0.956 for DeepVariant and best recall = 0.738 for GATK) in 167 LCR regions. Benchmarking of ParascopyVC using the genome-in-a-bottle high-confidence variant calls for HG002 genome showed that it achieved a very high precision of 0.991 and a high recall of 0.909 across LCR regions, significantly better than FreeBayes (precision = 0.954 and recall = 0.822), GATK (precision = 0.888 and recall = 0.873) and DeepVariant (precision = 0.983 and recall = 0.861). ParascopyVC demonstrated a consistently higher accuracy (mean F1 = 0.947) than other callers (best F1 = 0.908) across seven human genomes. AVAILABILITY AND IMPLEMENTATION ParascopyVC is implemented in Python and is freely available at https://github.com/tprodanov/ParascopyVC.
Collapse
Affiliation(s)
- Timofey Prodanov
- Bioinformatics and Systems Biology Graduate Program, University of California San Diego, La Jolla, CA 92093, United States
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf 40225, Germany
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf 40225, Germany
| | - Vikas Bansal
- School of Medicine, University of California San Diego, La Jolla, CA 92093, United States
| |
Collapse
|
179
|
Smith TPL, Bickhart DM, Boichard D, Chamberlain AJ, Djikeng A, Jiang Y, Low WY, Pausch H, Demyda-Peyrás S, Prendergast J, Schnabel RD, Rosen BD. The Bovine Pangenome Consortium: democratizing production and accessibility of genome assemblies for global cattle breeds and other bovine species. Genome Biol 2023; 24:139. [PMID: 37337218 DOI: 10.1186/s13059-023-02975-0] [Citation(s) in RCA: 20] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2022] [Accepted: 05/19/2023] [Indexed: 06/21/2023] Open
Abstract
The Bovine Pangenome Consortium (BPC) is an international collaboration dedicated to the assembly of cattle genomes to develop a more complete representation of cattle genomic diversity. The goal of the BPC is to provide genome assemblies and a community-agreed pangenome representation to replace breed-specific reference assemblies for cattle genomics. The BPC invites partners sharing our vision to participate in the production of these assemblies and the development of a common, community-approved, pangenome reference as a public resource for the research community ( https://bovinepangenome.github.io/ ). This community-driven resource will provide the context for comparison between studies and the future foundation for cattle genomic selection.
Collapse
Affiliation(s)
- Timothy P L Smith
- US Meat Animal Research Center, USDA-ARS, Clay Center, NE, 68933, USA
| | | | - Didier Boichard
- Université Paris-Saclay, INRAE, AgroParisTech, GABI, 78350, Jouy-en-Josas, France
| | - Amanda J Chamberlain
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC, 3083, Australia
- School of Applied Systems Biology, La Trobe University, Bundoora, VIC, 3083, Australia
| | - Appolinaire Djikeng
- Centre for Tropical Livestock Genetics and Health, ILRI Kenya, Nairobi, 30709-00100, Kenya
- Centre for Tropical Livestock Genetics and Health, Easter Bush, Midlothian, EH25 9RG, UK
| | - Yu Jiang
- Center for Ruminant Genetics and Evolution, Northwest A&F University, Yangling, 712100, China
| | - Wai Y Low
- The Davies Research Centre, School of Animal and Veterinary Sciences, University of Adelaide, Roseworthy, SA, 5371, Australia
| | - Hubert Pausch
- Animal Genomics, ETH Zurich, Universitaetstrasse 2, 8092, Zurich, Switzerland
| | - Sebastian Demyda-Peyrás
- Departamento de Producción Animal, Facultad de Ciencias Veterinarias, Universidad Nacional de La Plata, 1900, La Plata, Argentina
- Consejo Superior de Investigaciones Científicas Y Tecnológicas (CONICET), CCT-La Plata, 1900, La Plata, Argentina
| | - James Prendergast
- Centre for Tropical Livestock Genetics and Health, Easter Bush, Midlothian, EH25 9RG, UK
- The Roslin Institute, University of Edinburgh, Easter Bush, Midlothian, EH25 9RG, UK
| | - Robert D Schnabel
- Division of Animal Sciences, University of Missouri, Columbia, MO, 65211, USA
| | - Benjamin D Rosen
- Animal Genomics and Improvement Laboratory, USDA-ARS, Beltsville, MD, 20705, USA.
| |
Collapse
|
180
|
Abondio P, Cilli E, Luiselli D. Human Pangenomics: Promises and Challenges of a Distributed Genomic Reference. Life (Basel) 2023; 13:1360. [PMID: 37374141 DOI: 10.3390/life13061360] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2023] [Revised: 06/02/2023] [Accepted: 06/08/2023] [Indexed: 06/29/2023] Open
Abstract
A pangenome is a collection of the common and unique genomes that are present in a given species. It combines the genetic information of all the genomes sampled, resulting in a large and diverse range of genetic material. Pangenomic analysis offers several advantages compared to traditional genomic research. For example, a pangenome is not bound by the physical constraints of a single genome, so it can capture more genetic variability. Thanks to the introduction of the concept of pangenome, it is possible to use exceedingly detailed sequence data to study the evolutionary history of two different species, or how populations within a species differ genetically. In the wake of the Human Pangenome Project, this review aims at discussing the advantages of the pangenome around human genetic variation, which are then framed around how pangenomic data can inform population genetics, phylogenetics, and public health policy by providing insights into the genetic basis of diseases or determining personalized treatments, targeting the specific genetic profile of an individual. Moreover, technical limitations, ethical concerns, and legal considerations are discussed.
Collapse
Affiliation(s)
- Paolo Abondio
- Laboratory of Ancient DNA, Department of Cultural Heritage, University of Bologna, Via degli Ariani 1, 48121 Ravenna, Italy
| | - Elisabetta Cilli
- Laboratory of Ancient DNA, Department of Cultural Heritage, University of Bologna, Via degli Ariani 1, 48121 Ravenna, Italy
| | - Donata Luiselli
- Laboratory of Ancient DNA, Department of Cultural Heritage, University of Bologna, Via degli Ariani 1, 48121 Ravenna, Italy
| |
Collapse
|
181
|
Li H, Zhu X, Yang Y, Wang W, Mao A, Li J, Bao S, Li J. Long-read sequencing: An effective method for genetic analysis of CYP21A2 variation in congenital adrenal hyperplasia. Clin Chim Acta 2023:117419. [PMID: 37276943 DOI: 10.1016/j.cca.2023.117419] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2023] [Revised: 05/09/2023] [Accepted: 06/02/2023] [Indexed: 06/07/2023]
Abstract
BACKGROUND The sequence similarity between CYP21A2 gene and its inactive pseudogene CYP21A1P, and copy number variation (CNV) caused by unequal crossover, make it challenging to characterize the CYP21A2 gene through traditional methods. This study aimed to evaluate the clinical utility of the long-read sequencing (LRS) method in carrier screening and genetic diagnosis of congenital adrenal hyperplasia (CAH) by comparing the efficiency of the LRS method with the conventional multiplex ligation-dependent probe amplification (MLPA) plus Sanger sequencing approaches in CYP21A2 analysis. METHODS In a retrospective study, full sequence analysis of the CYP21A2 and CYP21A1P was performed for three pedigrees through long-range locus-specific PCR followed by LRS based on the Pacific Biosciences (PacBio, California, USA) single-molecule real-time (SMRT) platform, and the results were compared with those obtained from next-generation sequencing (NGS)-based whole exome sequencing (WES) and the traditional methods of MLPA plus Sanger sequencing. RESULTS The LRS method successfully identified seven CYP21A2 variants , including three single nucleotide variants (NM_000500.9:c.1451G>C p.(Arg484Pro), c.293-13A/C>G (IVS2-13A/C>G), c.518T>A p.(Ile173Asn)), one 111-bp polynucleotide insertion, one set of 3'URT variants (NM_000500.9:c.*368T>C, c.*390A>G, c.*440C>T, c.*443T>C) and two types of chimeric genes and straightforwardly depicted the inheritance patterns of these variants within families. Moreover, the LRS method enabled us to determine the cis-trans configuration of multiple variants in one assay, without the need to analyze additional family samples. Compared with traditional methods, this LRS method can achieve a precise, comprehensive and intuitive result in the genetic diagnosis of 21-hydroxylase deficiency (21-OHD). CONCLUSION The LRS method is comprehensive in CYP21A2 analysis and intuitive in result presentation, which holds substantial promise in clinical application as a crucial tool for carrier screening and genetic diagnosis of CAH.
Collapse
Affiliation(s)
- Huijun Li
- Department of Obstetrics and Gynecology, Nanjing Drum Tower Hospital, Affiliated Hospital of Medical School, Nanjing University, Nanjing, China
| | - Xiangyu Zhu
- Department of Obstetrics and Gynecology, Nanjing Drum Tower Hospital, Affiliated Hospital of Medical School, Nanjing University, Nanjing, China.
| | - Ying Yang
- Department of Obstetrics and Gynecology, Nanjing Drum Tower Hospital, Affiliated Hospital of Medical School, Nanjing University, Nanjing, China
| | - Wanjun Wang
- Department of Obstetrics and Gynecology, Nanjing Drum Tower Hospital, Affiliated Hospital of Medical School, Nanjing University, Nanjing, China
| | - Aiping Mao
- Berry Genomics Corporation, Beijing, 102200, China
| | - Jiaqi Li
- Berry Genomics Corporation, Beijing, 102200, China
| | - Shilai Bao
- State Key Laboratory of Molecular Developmental Biology, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing, China
| | - Jie Li
- Department of Obstetrics and Gynecology, Nanjing Drum Tower Hospital, Affiliated Hospital of Medical School, Nanjing University, Nanjing, China.
| |
Collapse
|
182
|
George SS, Pimkin M, Paralkar VR. Construction and validation of customized genomes for human and mouse ribosomal DNA mapping. J Biol Chem 2023; 299:104766. [PMID: 37121547 PMCID: PMC10245113 DOI: 10.1016/j.jbc.2023.104766] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2023] [Revised: 04/19/2023] [Accepted: 04/21/2023] [Indexed: 05/02/2023] Open
Abstract
rRNAs are transcribed from ribosomal DNA (rDNA) repeats, the most intensively transcribed loci in the genome. Due to their repetitive nature, there is a lack of genome assemblies suitable for rDNA mapping, creating a vacuum in our understanding of how the most abundant RNA in the cell is regulated. Our recent work revealed binding of numerous mammalian transcription and chromatin factors to rDNA. Several of these factors were known to play critical roles in development, tissue function, and malignancy, but their potential roles in rDNA regulation remained unexplored. This demonstrated the blind spot into which rDNA has fallen in genetic and epigenetic studies and highlighted an unmet need for public rDNA-optimized genome assemblies. Here, we customized five human and mouse assemblies-hg19 (GRCh37), hg38 (GRCh38), hs1 (T2T-CHM13), mm10 (GRCm38), and mm39 (GRCm39)-to render them suitable for rDNA mapping. The standard builds of these genomes contain numerous fragmented or repetitive rDNA loci. We identified and masked all rDNA-like regions, added a single rDNA reference sequence of the appropriate species as a ∼45 kb chromosome designated "chromosome R," and created annotation files to aid visualization of rDNA features in browser tracks. We validated these customized genomes for mapping of known rDNA-binding proteins and present a simple workflow for mapping chromatin immunoprecipitation-sequencing datasets. Customized genome assemblies, annotation files, positive and negative control tracks, and Snapgene files of standard rDNA reference sequences have been deposited to GitHub. These resources make rDNA mapping and visualization more readily accessible to a broad audience.
Collapse
Affiliation(s)
- Subin S George
- Institute for Biomedical Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, USA
| | - Maxim Pimkin
- Cancer and Blood Disorders Center, Harvard Medical School, Dana-Farber Cancer Institute and Boston Children's Hospital, Boston, Massachusetts, USA
| | - Vikram R Paralkar
- Division of Hematology and Oncology, Department of Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, USA; Department of Cell and Developmental Biology, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, USA; Abramson Family Cancer Research Institute, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, USA.
| |
Collapse
|
183
|
Song L, Bai G, Liu XS, Li B, Li H. Efficient and accurate KIR and HLA genotyping with massively parallel sequencing data. Genome Res 2023; 33:923-931. [PMID: 37169596 PMCID: PMC10519407 DOI: 10.1101/gr.277585.122] [Citation(s) in RCA: 20] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2022] [Accepted: 05/04/2023] [Indexed: 05/13/2023]
Abstract
Killer cell immunoglobulin like receptor (KIR) genes and human leukocyte antigen (HLA) genes play important roles in innate and adaptive immunity. They are highly polymorphic and cannot be genotyped with standard variant calling pipelines. Compared with HLA genes, many KIR genes are similar to each other in sequences and may be absent in the chromosomes. Therefore, although many tools have been developed to genotype HLA genes using common sequencing data, none of them work for KIR genes. Even specialized KIR genotypers could not resolve all the KIR genes. Here we describe T1K, a novel computational method for the efficient and accurate inference of KIR or HLA alleles from RNA-seq, whole-genome sequencing, or whole-exome sequencing data. T1K jointly considers alleles across all genotyped genes, so it can reliably identify present genes and distinguish homologous genes, including the challenging KIR2DL5A/KIR2DL5B genes. This model also benefits HLA genotyping, where T1K achieves high accuracy in benchmarks. Moreover, T1K can call novel single-nucleotide variants and process single-cell data. Applying T1K to tumor single-cell RNA-seq data, we found that KIR2DL4 expression was enriched in tumor-specific CD8+ T cells. T1K may open the opportunity for HLA and KIR genotyping across various sequencing applications.
Collapse
Affiliation(s)
- Li Song
- Department of Data Science, Dana-Farber Cancer Institute, Boston, Massachusetts 02215, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts 02115, USA
| | - Gali Bai
- Department of Data Science, Dana-Farber Cancer Institute, Boston, Massachusetts 02215, USA
| | - X Shirley Liu
- Department of Data Science, Dana-Farber Cancer Institute, Boston, Massachusetts 02215, USA
| | - Bo Li
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, Texas 75390, USA
| | - Heng Li
- Department of Data Science, Dana-Farber Cancer Institute, Boston, Massachusetts 02215, USA;
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts 02115, USA
| |
Collapse
|
184
|
Leonard AS, Crysnanto D, Mapel XM, Bhati M, Pausch H. Graph construction method impacts variation representation and analyses in a bovine super-pangenome. Genome Biol 2023; 24:124. [PMID: 37217946 PMCID: PMC10204317 DOI: 10.1186/s13059-023-02969-y] [Citation(s) in RCA: 24] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2022] [Accepted: 05/10/2023] [Indexed: 05/24/2023] Open
Abstract
BACKGROUND Several models and algorithms have been proposed to build pangenomes from multiple input assemblies, but their impact on variant representation, and consequently downstream analyses, is largely unknown. RESULTS We create multi-species super-pangenomes using pggb, cactus, and minigraph with the Bos taurus taurus reference sequence and eleven haplotype-resolved assemblies from taurine and indicine cattle, bison, yak, and gaur. We recover 221 k nonredundant structural variations (SVs) from the pangenomes, of which 135 k (61%) are common to all three. SVs derived from assembly-based calling show high agreement with the consensus calls from the pangenomes (96%), but validate only a small proportion of variations private to each graph. Pggb and cactus, which also incorporate base-level variation, have approximately 95% exact matches with assembly-derived small variant calls, which significantly improves the edit rate when realigning assemblies compared to minigraph. We use the three pangenomes to investigate 9566 variable number tandem repeats (VNTRs), finding 63% have identical predicted repeat counts in the three graphs, while minigraph can over or underestimate the count given its approximate coordinate system. We examine a highly variable VNTR locus and show that repeat unit copy number impacts the expression of proximal genes and non-coding RNA. CONCLUSIONS Our findings indicate good consensus between the three pangenome methods but also show their individual strengths and weaknesses that need to be considered when analysing different types of variants from multiple input assemblies.
Collapse
Affiliation(s)
- Alexander S Leonard
- Animal Genomics, ETH Zurich, Universitaetstrasse 2, 8092, Zurich, Switzerland.
| | - Danang Crysnanto
- Animal Genomics, ETH Zurich, Universitaetstrasse 2, 8092, Zurich, Switzerland
| | - Xena M Mapel
- Animal Genomics, ETH Zurich, Universitaetstrasse 2, 8092, Zurich, Switzerland
| | - Meenu Bhati
- Animal Genomics, ETH Zurich, Universitaetstrasse 2, 8092, Zurich, Switzerland
| | - Hubert Pausch
- Animal Genomics, ETH Zurich, Universitaetstrasse 2, 8092, Zurich, Switzerland.
| |
Collapse
|
185
|
Kille B, Garrison E, Treangen TJ, Phillippy AM. Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.16.540882. [PMID: 37325780 PMCID: PMC10268037 DOI: 10.1101/2023.05.16.540882] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/17/2023]
Abstract
Motivation The Jaccard similarity on k -mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced sequence representations, tools such as MashMap can scale to massive numbers of pairwise comparisons while still providing useful similarity estimates. However, due to their reliance on minimizer winnowing, previous versions of MashMap were shown to be biased and inconsistent estimators of Jaccard similarity. This directly impacts downstream tools that rely on the accuracy of these estimates. Results To address this, we propose the minmer winnowing scheme, which generalizes the minimizer scheme by use of a rolling minhash with multiple sampled k -mers per window. We show both theoretically and empirically that minmers yield an unbiased estimator of local Jaccard similarity, and we implement this scheme in an updated version of MashMap. The minmer-based implementation is over 10 times faster than the minimizer-based version under the default ANI threshold, making it well-suited for large-scale comparative genomics applications.
Collapse
Affiliation(s)
- Bryce Kille
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Todd J Treangen
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| |
Collapse
|
186
|
Groza C, Chen X, Pacis A, Simon MM, Pramatarova A, Aracena KA, Pastinen T, Barreiro LB, Bourque G. Genome graphs detect human polymorphisms in active epigenomic state during influenza infection. CELL GENOMICS 2023; 3:100294. [PMID: 37228750 PMCID: PMC10203048 DOI: 10.1016/j.xgen.2023.100294] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/10/2022] [Revised: 07/26/2022] [Accepted: 03/09/2023] [Indexed: 05/27/2023]
Abstract
Genetic variants, including mobile element insertions (MEIs), are known to impact the epigenome. We hypothesized that genome graphs, which encapsulate genetic diversity, could reveal missing epigenomic signals. To test this, we sequenced the epigenome of monocyte-derived macrophages from 35 ancestrally diverse individuals before and after influenza infection, allowing us to investigate the role of MEIs in immunity. We characterized genetic variants and MEIs using linked reads and built a genome graph. Mapping epigenetic data revealed 2.3%-3% novel peaks for H3K4me1, H3K27ac chromatin immunoprecipitation sequencing (ChIP-seq), and ATAC-seq. Additionally, the use of a genome graph modified some quantitative trait loci estimates and revealed 375 polymorphic MEIs in an active epigenomic state. Among these is an AluYh3 polymorphism whose chromatin state changed after infection and was associated with the expression of TRIM25, a gene that restricts influenza RNA synthesis. Our results demonstrate that graph genomes can reveal regulatory regions that would have been overlooked by other approaches.
Collapse
Affiliation(s)
- Cristian Groza
- Quantitative Life Sciences, McGill University, Montréal, QC, Canada
| | - Xun Chen
- Institute for the Advanced Study of Human Biology (WPI-ASHBi), Kyoto University, Kyoto, Japan
| | - Alain Pacis
- Canadian Centre for Computational Genomics, McGill University, Montréal, QC, Canada
| | - Marie-Michelle Simon
- Victor Phillip Dahdaleh Institute of Genomic Medicine at McGill University, Montréal, QC, Canada
| | - Albena Pramatarova
- Victor Phillip Dahdaleh Institute of Genomic Medicine at McGill University, Montréal, QC, Canada
| | | | - Tomi Pastinen
- Genomic Medicine Center, Children’s Mercy Hospital and Research Institute, Kansas City, MO, USA
| | - Luis B. Barreiro
- Committee on Genetics, Genomics, and Systems Biology, University of Chicago, Chicago, IL, USA
- Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, IL, USA
- Committee on Immunology, University of Chicago, Chicago, IL, USA
| | - Guillaume Bourque
- Institute for the Advanced Study of Human Biology (WPI-ASHBi), Kyoto University, Kyoto, Japan
- Canadian Centre for Computational Genomics, McGill University, Montréal, QC, Canada
- Victor Phillip Dahdaleh Institute of Genomic Medicine at McGill University, Montréal, QC, Canada
- Human Genetics, McGill University, Montréal, QC, Canada
| |
Collapse
|
187
|
Yilmaz F, Gurusamy U, Mosley TJ, Hallast P, Kim K, Mostovoy Y, Purcell RH, Shaikh TH, Zwick ME, Kwok PY, Lee C, Mulle JG. High level of complexity and global diversity of the 3q29 locus revealed by optical mapping and long-read sequencing. Genome Med 2023; 15:35. [PMID: 37165454 PMCID: PMC10170684 DOI: 10.1186/s13073-023-01184-5] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2022] [Accepted: 04/20/2023] [Indexed: 05/12/2023] Open
Abstract
BACKGROUND High sequence identity between segmental duplications (SDs) can facilitate copy number variants (CNVs) via non-allelic homologous recombination (NAHR). These CNVs are one of the fundamental causes of genomic disorders such as the 3q29 deletion syndrome (del3q29S). There are 21 protein-coding genes lost or gained as a result of such recurrent 1.6-Mbp deletions or duplications, respectively, in the 3q29 locus. While NAHR plays a role in CNV occurrence, the factors that increase the risk of NAHR at this particular locus are not well understood. METHODS We employed an optical genome mapping technique to characterize the 3q29 locus in 161 unaffected individuals, 16 probands with del3q29S and their parents, and 2 probands with the 3q29 duplication syndrome (dup3q29S). Long-read sequencing-based haplotype resolved de novo assemblies from 44 unaffected individuals, and 1 trio was used for orthogonal validation of haplotypes and deletion breakpoints. RESULTS In total, we discovered 34 haplotypes, of which 19 were novel haplotypes. Among these 19 novel haplotypes, 18 were detected in unaffected individuals, while 1 novel haplotype was detected on the parent-of-origin chromosome of a proband with the del3q29S. Phased assemblies from 44 unaffected individuals enabled the orthogonal validation of 20 haplotypes. In 89% (16/18) of the probands, breakpoints were confined to paralogous copies of a 20-kbp segment within the 3q29 SDs. In one del3q29S proband, the breakpoint was confined to a 374-bp region using long-read sequencing. Furthermore, we categorized del3q29S cases into three classes and dup3q29S cases into two classes based on breakpoints. Finally, we found no evidence of inversions in parent-of-origin chromosomes. CONCLUSIONS We have generated the most comprehensive haplotype map for the 3q29 locus using unaffected individuals, probands with del3q29S or dup3q29S, and available parents, and also determined the deletion breakpoint to be within a 374-bp region in one proband with del3q29S. These results should provide a better understanding of the underlying genetic architecture that contributes to the etiology of del3q29S and dup3q29S.
Collapse
Affiliation(s)
- Feyza Yilmaz
- The Jackson Laboratory for Genomic Medicine, 10 Discovery Drive, Farmington, CT, 06032, USA
| | - Umamaheswaran Gurusamy
- Cardiovascular Research Institute and Institute for Human Genetics, UCSF School of Medicine, 513 Parnassus Ave, San Francisco, CA, 94143, USA
| | - Trenell J Mosley
- Graduate Program in Genetics and Molecular Biology, Laney Graduate School, Emory University, 201 Dowman Drive, Atlanta, GA, 30322, USA
| | - Pille Hallast
- The Jackson Laboratory for Genomic Medicine, 10 Discovery Drive, Farmington, CT, 06032, USA
| | - Kwondo Kim
- The Jackson Laboratory for Genomic Medicine, 10 Discovery Drive, Farmington, CT, 06032, USA
| | - Yulia Mostovoy
- Cardiovascular Research Institute and Institute for Human Genetics, UCSF School of Medicine, 513 Parnassus Ave, San Francisco, CA, 94143, USA
| | - Ryan H Purcell
- Laboratory of Translational Cell Biology, Department of Cell Biology, Emory University School of Medicine, 100 Woodruff Circle, Atlanta, GA, 30322, USA
| | - Tamim H Shaikh
- Department of Pediatrics, Section of Genetics and Metabolism, University of Colorado School of Medicine, 13123 E 16Th Ave, Aurora, CO, 80045, USA
| | - Michael E Zwick
- Department of Genetics, Rutgers University-New Brunswick, Rutgers University, Piscataway, New Brunswick, NJ, 08901, USA
| | - Pui-Yan Kwok
- Cardiovascular Research Institute and Institute for Human Genetics, UCSF School of Medicine, 513 Parnassus Ave, San Francisco, CA, 94143, USA
- Department of Dermatology, UCSF School of Medicine, 1701 Divisadero Street, San Francisco, CA, 94115, USA
| | - Charles Lee
- The Jackson Laboratory for Genomic Medicine, 10 Discovery Drive, Farmington, CT, 06032, USA.
| | - Jennifer G Mulle
- Department of Psychiatry, Robert Wood Johnson Medical School, Rutgers Biomedical and Health Sciences, Rutgers University, 671 Hoes Lane, New Brunswick, NJ, 08901, USA.
| |
Collapse
|
188
|
Weisburd B, Tiao G, Rehm HL. Insights from a genome-wide truth set of tandem repeat variation. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.05.539588. [PMID: 37214979 PMCID: PMC10197592 DOI: 10.1101/2023.05.05.539588] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
Tools for genotyping tandem repeats (TRs) from short read sequencing data have improved significantly over the past decade. Extensive comparisons of these tools to gold standard diagnostic methods like RP-PCR have confirmed their accuracy for tens to hundreds of well-studied loci. However, a scarcity of high-quality orthogonal truth data limited our ability to measure tool accuracy for the millions of other loci throughout the genome. To address this, we developed a TR truth set based on the Synthetic Diploid Benchmark (SynDip). By identifying the subset of insertions and deletions that represent TR expansions or contractions with motifs between 2 and 50 base pairs, we obtained accurate genotypes for 139,795 pure and 6,845 interrupted repeats in a single diploid sample. Our approach did not require running existing genotyping tools on short read or long read sequencing data and provided an alternative, more accurate view of tandem repeat variation. We applied this truth set to compare the strengths and weaknesses of widely-used tools for genotyping TRs, evaluated the completeness of existing genome-wide TR catalogs, and explored the properties of tandem repeat variation throughout the genome. We found that, without filtering, ExpansionHunter had higher accuracy than GangSTR and HipSTR over a wide range of motifs and allele sizes. Also, when errors in allele size occurred, ExpansionHunter tended to overestimate expansion sizes, while GangSTR tended to underestimate them. Additionally, we saw that widely-used TR catalogs miss between 16% and 41% of variant loci in the truth set. These results suggest that genome-wide analyses would benefit from genotyping a larger set of loci as well as further tool development that builds on the strengths of current algorithms. To that end, we developed a new catalog of 2.8 million loci that captures 95% of variant loci in the truth set, and created a modified version of ExpansionHunter that runs 2 to 3x faster than the original while producing the same output.
Collapse
Affiliation(s)
- Ben Weisburd
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
| | - Grace Tiao
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
| | - Heidi L. Rehm
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
| |
Collapse
|
189
|
Harvey WT, Ebert P, Ebler J, Audano PA, Munson KM, Hoekzema K, Porubsky D, Beck CR, Marschall T, Garimella K, Eichler EE. Whole-genome long-read sequencing downsampling and its effect on variant calling precision and recall. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.04.539448. [PMID: 37205567 PMCID: PMC10187267 DOI: 10.1101/2023.05.04.539448] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
Advances in long-read sequencing (LRS) technology continue to make whole-genome sequencing more complete, affordable, and accurate. LRS provides significant advantages over short-read sequencing approaches, including phased de novo genome assembly, access to previously excluded genomic regions, and discovery of more complex structural variants (SVs) associated with disease. Limitations remain with respect to cost, scalability, and platform-dependent read accuracy and the tradeoffs between sequence coverage and sensitivity of variant discovery are important experimental considerations for the application of LRS. We compare the genetic variant calling precision and recall of Oxford Nanopore Technologies (ONT) and PacBio HiFi platforms over a range of sequence coverages. For read-based applications, LRS sensitivity begins to plateau around 12-fold coverage with a majority of variants called with reasonable accuracy (F1 score above 0.5), and both platforms perform well for SV detection. Genome assembly increases variant calling precision and recall of SVs and indels in HiFi datasets with HiFi outperforming ONT in quality as measured by the F1 score of assembly-based variant callsets. While both technologies continue to evolve, our work offers guidance to design cost-effective experimental strategies that do not compromise on discovering novel biology.
Collapse
Affiliation(s)
- William T. Harvey
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Peter Ebert
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Core Unit Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
| | - Jana Ebler
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
| | - Peter A. Audano
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Katherine M. Munson
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Kendra Hoekzema
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - David Porubsky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Christine R. Beck
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
- Department of Genetics and Genome Sciences, University of Connecticut Health Center, Farmington, CT 06032 USA
| | - Tobias Marschall
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
| | - Kiran Garimella
- Data Sciences Platform, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Evan E. Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| |
Collapse
|
190
|
Zhuo X, Hsu S, Purushotham D, Kuntala PK, Harrison JK, Du AY, Chen S, Li D, Wang T. Comparing genomic and epigenomic features across species using the WashU Comparative Epigenome Browser. Genome Res 2023; 33:824-835. [PMID: 37156621 PMCID: PMC10317122 DOI: 10.1101/gr.277550.122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2022] [Accepted: 05/03/2023] [Indexed: 05/10/2023]
Abstract
Genome browsers have become an intuitive and critical tool to visualize and analyze genomic features and data. Conventional genome browsers display data/annotations on a single reference genome/assembly; there are also genomic alignment viewer/browsers that help users visualize alignment, mismatch, and rearrangement between syntenic regions. However, there is a growing need for a comparative epigenome browser that can display genomic and epigenomic data sets across different species and enable users to compare them between syntenic regions. Here, we present the WashU Comparative Epigenome Browser. It allows users to load functional genomic data sets/annotations mapped to different genomes and display them over syntenic regions simultaneously. The browser also displays genetic differences between the genomes from single-nucleotide variants (SNVs) to structural variants (SVs) to visualize the association between epigenomic differences and genetic differences. Instead of anchoring all data sets to the reference genome coordinates, it creates independent coordinates of different genome assemblies to faithfully present features and data mapped to different genomes. It uses a simple, intuitive genome-align track to illustrate the syntenic relationship between different species. It extends the widely used WashU Epigenome Browser infrastructure and can be expanded to support multiple species. This new browser function will greatly facilitate comparative genomic/epigenomic research, as well as support the recent growing needs to directly compare and benchmark the T2T CHM13 assembly and other human genome assemblies.
Collapse
Affiliation(s)
- Xiaoyu Zhuo
- Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63110, USA
- The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, Missouri 63110, USA
| | - Silas Hsu
- Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63110, USA
- The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, Missouri 63110, USA
| | - Deepak Purushotham
- Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63110, USA
- The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, Missouri 63110, USA
| | - Prashant Kumar Kuntala
- Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63110, USA
- The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, Missouri 63110, USA
| | - Jessica K Harrison
- Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63110, USA
- The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, Missouri 63110, USA
| | - Alan Y Du
- Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63110, USA
- The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, Missouri 63110, USA
| | - Samuel Chen
- Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63110, USA
- The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, Missouri 63110, USA
| | - Daofeng Li
- Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63110, USA
- The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, Missouri 63110, USA
| | - Ting Wang
- Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63110, USA;
- The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, Missouri 63110, USA
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, Missouri 63110, USA
| |
Collapse
|
191
|
Liao WW, Asri M, Ebler J, Doerr D, Haukness M, Hickey G, Lu S, Lucas JK, Monlong J, Abel HJ, Buonaiuto S, Chang XH, Cheng H, Chu J, Colonna V, Eizenga JM, Feng X, Fischer C, Fulton RS, Garg S, Groza C, Guarracino A, Harvey WT, Heumos S, Howe K, Jain M, Lu TY, Markello C, Martin FJ, Mitchell MW, Munson KM, Mwaniki MN, Novak AM, Olsen HE, Pesout T, Porubsky D, Prins P, Sibbesen JA, Sirén J, Tomlinson C, Villani F, Vollger MR, Antonacci-Fulton LL, Baid G, Baker CA, Belyaeva A, Billis K, Carroll A, Chang PC, Cody S, Cook DE, Cook-Deegan RM, Cornejo OE, Diekhans M, Ebert P, Fairley S, Fedrigo O, Felsenfeld AL, Formenti G, Frankish A, Gao Y, Garrison NA, Giron CG, Green RE, Haggerty L, Hoekzema K, Hourlier T, Ji HP, Kenny EE, Koenig BA, Kolesnikov A, Korbel JO, Kordosky J, Koren S, Lee H, Lewis AP, Magalhães H, Marco-Sola S, Marijon P, McCartney A, McDaniel J, Mountcastle J, Nattestad M, Nurk S, Olson ND, Popejoy AB, Puiu D, Rautiainen M, Regier AA, Rhie A, Sacco S, Sanders AD, Schneider VA, Schultz BI, Shafin K, Smith MW, Sofia HJ, Abou Tayoun AN, Thibaud-Nissen F, Tricomi FF, et alLiao WW, Asri M, Ebler J, Doerr D, Haukness M, Hickey G, Lu S, Lucas JK, Monlong J, Abel HJ, Buonaiuto S, Chang XH, Cheng H, Chu J, Colonna V, Eizenga JM, Feng X, Fischer C, Fulton RS, Garg S, Groza C, Guarracino A, Harvey WT, Heumos S, Howe K, Jain M, Lu TY, Markello C, Martin FJ, Mitchell MW, Munson KM, Mwaniki MN, Novak AM, Olsen HE, Pesout T, Porubsky D, Prins P, Sibbesen JA, Sirén J, Tomlinson C, Villani F, Vollger MR, Antonacci-Fulton LL, Baid G, Baker CA, Belyaeva A, Billis K, Carroll A, Chang PC, Cody S, Cook DE, Cook-Deegan RM, Cornejo OE, Diekhans M, Ebert P, Fairley S, Fedrigo O, Felsenfeld AL, Formenti G, Frankish A, Gao Y, Garrison NA, Giron CG, Green RE, Haggerty L, Hoekzema K, Hourlier T, Ji HP, Kenny EE, Koenig BA, Kolesnikov A, Korbel JO, Kordosky J, Koren S, Lee H, Lewis AP, Magalhães H, Marco-Sola S, Marijon P, McCartney A, McDaniel J, Mountcastle J, Nattestad M, Nurk S, Olson ND, Popejoy AB, Puiu D, Rautiainen M, Regier AA, Rhie A, Sacco S, Sanders AD, Schneider VA, Schultz BI, Shafin K, Smith MW, Sofia HJ, Abou Tayoun AN, Thibaud-Nissen F, Tricomi FF, Wagner J, Walenz B, Wood JMD, Zimin AV, Bourque G, Chaisson MJP, Flicek P, Phillippy AM, Zook JM, Eichler EE, Haussler D, Wang T, Jarvis ED, Miga KH, Garrison E, Marschall T, Hall IM, Li H, Paten B. A draft human pangenome reference. Nature 2023; 617:312-324. [PMID: 37165242 PMCID: PMC10172123 DOI: 10.1038/s41586-023-05896-x] [Show More Authors] [Citation(s) in RCA: 467] [Impact Index Per Article: 233.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2022] [Accepted: 02/28/2023] [Indexed: 05/12/2023]
Abstract
Here the Human Pangenome Reference Consortium presents a first draft of the human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals1. These assemblies cover more than 99% of the expected sequence in each genome and are more than 99% accurate at the structural and base pair levels. Based on alignments of the assemblies, we generate a draft pangenome that captures known variants and haplotypes and reveals new alleles at structurally complex loci. We also add 119 million base pairs of euchromatic polymorphic sequences and 1,115 gene duplications relative to the existing reference GRCh38. Roughly 90 million of the additional base pairs are derived from structural variation. Using our draft pangenome to analyse short-read data reduced small variant discovery errors by 34% and increased the number of structural variants detected per haplotype by 104% compared with GRCh38-based workflows, which enabled the typing of the vast majority of structural variant alleles per sample.
Collapse
Affiliation(s)
- Wen-Wei Liao
- Department of Genetics, Yale University School of Medicine, New Haven, CT, USA
- Center for Genomic Health, Yale University School of Medicine, New Haven, CT, USA
- Division of Biology and Biomedical Sciences, Washington University School of Medicine, St. Louis, MO, USA
| | - Mobin Asri
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Jana Ebler
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
| | - Daniel Doerr
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
| | - Marina Haukness
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Glenn Hickey
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Shuangjia Lu
- Department of Genetics, Yale University School of Medicine, New Haven, CT, USA
- Center for Genomic Health, Yale University School of Medicine, New Haven, CT, USA
| | - Julian K Lucas
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Jean Monlong
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Haley J Abel
- Division of Oncology, Department of Internal Medicine, Washington University School of Medicine, St. Louis, MO, USA
| | - Silvia Buonaiuto
- Institute of Genetics and Biophysics, National Research Council, Naples, Italy
| | - Xian H Chang
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Haoyu Cheng
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Justin Chu
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Vincenza Colonna
- Institute of Genetics and Biophysics, National Research Council, Naples, Italy
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Jordan M Eizenga
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Xiaowen Feng
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Christian Fischer
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Robert S Fulton
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
- Department of Genetics, Washington University School of Medicine, St. Louis, MO, USA
| | - Shilpa Garg
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Copenhagen, Denmark
| | - Cristian Groza
- Quantitative Life Sciences, McGill University, Montréal, Québec, Canada
| | - Andrea Guarracino
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
- Genomics Research Centre, Human Technopole, Milan, Italy
| | - William T Harvey
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Simon Heumos
- Quantitative Biology Center (QBiC), University of Tübingen, Tübingen, Germany
- Biomedical Data Science, Department of Computer Science, University of Tübingen, Tübingen, Germany
| | - Kerstin Howe
- Tree of Life, Wellcome Sanger Institute, Hinxton, Cambridge, UK
| | - Miten Jain
- Northeastern University, Boston, MA, USA
| | - Tsung-Yu Lu
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Charles Markello
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Fergal J Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | | | - Katherine M Munson
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | | | - Adam M Novak
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Hugh E Olsen
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Trevor Pesout
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - David Porubsky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Pjotr Prins
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Jonas A Sibbesen
- Center for Health Data Science, University of Copenhagen, Copenhagen, Denmark
| | - Jouni Sirén
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Chad Tomlinson
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
| | - Flavia Villani
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Mitchell R Vollger
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Division of Medical Genetics, University of Washington School of Medicine, Seattle, WA, USA
| | | | | | - Carl A Baker
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | | | - Konstantinos Billis
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | | | | | - Sarah Cody
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
| | | | - Robert M Cook-Deegan
- Barrett and O'Connor Washington Center, Arizona State University, Washington, DC, USA
| | - Omar E Cornejo
- Department of Ecology and Evolutionary Biology, University of California, Santa Cruz, CA, USA
| | - Mark Diekhans
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Peter Ebert
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
- Core Unit Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
| | - Susan Fairley
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Olivier Fedrigo
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
| | - Adam L Felsenfeld
- National Institutes of Health (NIH)-National Human Genome Research Institute, Bethesda, MD, USA
| | - Giulio Formenti
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
| | - Adam Frankish
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Yan Gao
- Center for Computational and Genomic Medicine, The Children's Hospital of Philadelphia, Philadelphia, PA, USA
| | - Nanibaa' A Garrison
- Institute for Society and Genetics, College of Letters and Science, University of California, Los Angeles, CA, USA
- Institute for Precision Health, David Geffen School of Medicine, University of California, Los Angeles, CA, USA
- Division of General Internal Medicine and Health Services Research, David Geffen School of Medicine, University of California, Los Angeles, CA, USA
| | - Carlos Garcia Giron
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Richard E Green
- Department of Biomolecular Engineering, University of California, Santa Cruz, CA, USA
- Dovetail Genomics, Scotts Valley, CA, USA
| | - Leanne Haggerty
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Kendra Hoekzema
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Thibaut Hourlier
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Hanlee P Ji
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Eimear E Kenny
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Barbara A Koenig
- Program in Bioethics and Institute for Human Genetics, University of California, San Francisco, CA, USA
| | | | - Jan O Korbel
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
- Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
| | - Jennifer Kordosky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - HoJoon Lee
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Alexandra P Lewis
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Hugo Magalhães
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
| | - Santiago Marco-Sola
- Computer Sciences Department, Barcelona Supercomputing Center, Barcelona, Spain
- Departament d'Arquitectura de Computadors i Sistemes Operatius, Universitat Autònoma de Barcelona, Barcelona, Spain
| | - Pierre Marijon
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
| | - Ann McCartney
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Jennifer McDaniel
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | | | | | - Sergey Nurk
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Nathan D Olson
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Alice B Popejoy
- Department of Public Health Sciences, University of California, Davis, CA, USA
| | - Daniela Puiu
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Mikko Rautiainen
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Allison A Regier
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
| | - Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Samuel Sacco
- Department of Ecology and Evolutionary Biology, University of California, Santa Cruz, CA, USA
| | - Ashley D Sanders
- Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Berlin, Germany
| | - Valerie A Schneider
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Baergen I Schultz
- National Institutes of Health (NIH)-National Human Genome Research Institute, Bethesda, MD, USA
| | | | - Michael W Smith
- National Institutes of Health (NIH)-National Human Genome Research Institute, Bethesda, MD, USA
| | - Heidi J Sofia
- National Institutes of Health (NIH)-National Human Genome Research Institute, Bethesda, MD, USA
| | - Ahmad N Abou Tayoun
- Al Jalila Genomics Center of Excellence, Al Jalila Children's Specialty Hospital, Dubai, UAE
- Center for Genomic Discovery, Mohammed Bin Rashid University of Medicine and Health Sciences, Dubai, UAE
| | - Françoise Thibaud-Nissen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Francesca Floriana Tricomi
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Justin Wagner
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Brian Walenz
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | | | - Aleksey V Zimin
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Guillaume Bourque
- Department of Human Genetics, McGill University, Montréal, Québec, Canada
- Canadian Center for Computational Genomics, McGill University, Montréal, Québec, Canada
- Institute for the Advanced Study of Human Biology (WPI-ASHBi), Kyoto University, Kyoto, Japan
| | - Mark J P Chaisson
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Justin M Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - David Haussler
- Genomics Institute, University of California, Santa Cruz, CA, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - Ting Wang
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
- Department of Genetics, Washington University School of Medicine, St. Louis, MO, USA
| | - Erich D Jarvis
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
- Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, USA
| | - Karen H Miga
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA.
| | - Tobias Marschall
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany.
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany.
| | - Ira M Hall
- Department of Genetics, Yale University School of Medicine, New Haven, CT, USA.
- Center for Genomic Health, Yale University School of Medicine, New Haven, CT, USA.
| | - Heng Li
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA.
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| | - Benedict Paten
- Genomics Institute, University of California, Santa Cruz, CA, USA.
| |
Collapse
|
192
|
Porubsky D, Harvey WT, Rozanski AN, Ebler J, Höps W, Ashraf H, Hasenfeld P, Paten B, Sanders AD, Marschall T, Korbel JO, Eichler EE. Inversion polymorphism in a complete human genome assembly. Genome Biol 2023; 24:100. [PMID: 37122002 PMCID: PMC10150506 DOI: 10.1186/s13059-023-02919-8] [Citation(s) in RCA: 24] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2022] [Accepted: 03/31/2023] [Indexed: 05/02/2023] Open
Abstract
The telomere-to-telomere (T2T) complete human reference has significantly improved our ability to characterize genome structural variation. To understand its impact on inversion polymorphisms, we remapped data from 41 genomes against the T2T reference genome and compared it to the GRCh38 reference. We find a ~ 21% increase in sensitivity improving mapping of 63 inversions on the T2T reference. We identify 26 misorientations within GRCh38 and show that the T2T reference is three times more likely to represent the correct orientation of the major human allele. Analysis of 10 additional samples reveals novel rare inversions at chromosomes 15q25.2, 16p11.2, 16q22.1-23.1, and 22q11.21.
Collapse
Affiliation(s)
- David Porubsky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, 98195, USA
| | - William T Harvey
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, 98195, USA
| | - Allison N Rozanski
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, 98195, USA
| | - Jana Ebler
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Moorenstraße 5, 40225, Düsseldorf, Germany
| | - Wolfram Höps
- European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Meyerhofstr. 1, 69117, Heidelberg, Germany
| | - Hufsah Ashraf
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Moorenstraße 5, 40225, Düsseldorf, Germany
| | - Patrick Hasenfeld
- European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Meyerhofstr. 1, 69117, Heidelberg, Germany
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, 95064, USA
| | - Ashley D Sanders
- Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine, Helmholtz Association, 10115, Berlin, Germany
- Berlin Institute of Health (BIH), 10178, Berlin, Germany
- Charité-Universitätsmedizin, 10117, Berlin, Germany
| | - Tobias Marschall
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Moorenstraße 5, 40225, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Moorenstraße 5, 40225, Düsseldorf, Germany
| | - Jan O Korbel
- European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Meyerhofstr. 1, 69117, Heidelberg, Germany
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, 98195, USA.
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, 98195, USA.
| |
Collapse
|
193
|
Parejo M, Talenti A, Richardson M, Vignal A, Barnett M, Wragg D. AmelHap: Leveraging drone whole-genome sequence data to create a honey bee HapMap. Sci Data 2023; 10:198. [PMID: 37037860 PMCID: PMC10086014 DOI: 10.1038/s41597-023-02097-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Accepted: 03/22/2023] [Indexed: 04/12/2023] Open
Abstract
Honey bee, Apis mellifera, drones are typically haploid, developing from an unfertilized egg, inheriting only their queen's alleles and none from the many drones she mated with. Thus the ordered combination or 'phase' of alleles is known, making drones a valuable haplotype resource. We collated whole-genome sequence data for 1,407 drones, including 45 newly sequenced Scottish drones, collectively representing 19 countries, 8 subspecies and various hybrids. Following alignment to Amel_HAv3.1, variant calling and quality filtering, we retained 17.4 M high quality variants across 1,328 samples with a genotyping rate of 98.7%. We demonstrate the utility of this haplotype resource, AmelHap, for genotype imputation, returning >95% concordance when up to 61% of data is missing in haploids and up to 12% of data is missing in diploids. AmelHap will serve as a useful resource for the community for imputation from low-depth sequencing or SNP chip data, accurate phasing of diploids for association studies, and as a comprehensive reference panel for population genetic and evolutionary analyses.
Collapse
Affiliation(s)
- M Parejo
- Applied Genomics and Bioinformatics, University of the Basque Country (UPV/EHU), Leioa, Spain
| | - A Talenti
- The Roslin Institute, University of Edinburgh, Easter Bush Campus, Midlothian, UK
| | - M Richardson
- University of Edinburgh, King's Buildings Campus, Edinburgh, UK
- Beebytes Analytics CIC, Roslin Innovation Centre, Easter Bush Campus, Midlothian, UK
| | - A Vignal
- GenPhySE, Université de Toulouse, INRAE, INPT, INP-ENVT, 31326, Castanet Tolosan, France
| | - M Barnett
- Beebytes Analytics CIC, Roslin Innovation Centre, Easter Bush Campus, Midlothian, UK
| | - D Wragg
- The Roslin Institute, University of Edinburgh, Easter Bush Campus, Midlothian, UK.
- Beebytes Analytics CIC, Roslin Innovation Centre, Easter Bush Campus, Midlothian, UK.
| |
Collapse
|
194
|
Porubsky D, Vollger MR, Harvey WT, Rozanski AN, Ebert P, Hickey G, Hasenfeld P, Sanders AD, Stober C, Korbel JO, Paten B, Marschall T, Eichler EE. Gaps and complex structurally variant loci in phased genome assemblies. Genome Res 2023; 33:496-510. [PMID: 37164484 PMCID: PMC10234299 DOI: 10.1101/gr.277334.122] [Citation(s) in RCA: 27] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2022] [Accepted: 12/07/2022] [Indexed: 05/12/2023]
Abstract
There has been tremendous progress in phased genome assembly production by combining long-read data with parental information or linked-read data. Nevertheless, a typical phased genome assembly generated by trio-hifiasm still generates more than 140 gaps. We perform a detailed analysis of gaps, assembly breaks, and misorientations from 182 haploid assemblies obtained from a diversity panel of 77 unique human samples. Although trio-based approaches using HiFi are the current gold standard, chromosome-wide phasing accuracy is comparable when using Strand-seq instead of parental data. Importantly, the majority of assembly gaps cluster near the largest and most identical repeats (including segmental duplications [35.4%], satellite DNA [22.3%], or regions enriched in GA/AT-rich DNA [27.4%]). Consequently, 1513 protein-coding genes overlap assembly gaps in at least one haplotype, and 231 are recurrently disrupted or missing from five or more haplotypes. Furthermore, we estimate that 6-7 Mbp of DNA are misorientated per haplotype irrespective of whether trio-free or trio-based approaches are used. Of these misorientations, 81% correspond to bona fide large inversion polymorphisms in the human species, most of which are flanked by large segmental duplications. We also identify large-scale alignment discontinuities consistent with 11.9 Mbp of deletions and 161.4 Mbp of insertions per haploid genome. Although 99% of this variation corresponds to satellite DNA, we identify 230 regions of euchromatic DNA with frequent expansions and contractions, nearly half of which overlap with 197 protein-coding genes. Such variable and incompletely assembled regions are important targets for future algorithmic development and pangenome representation.
Collapse
Affiliation(s)
- David Porubsky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA
| | - Mitchell R Vollger
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA
| | - William T Harvey
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA
| | - Allison N Rozanski
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA
| | - Peter Ebert
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, 40225 Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, 40225 Düsseldorf, Germany
| | - Glenn Hickey
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, California 95064, USA
| | - Patrick Hasenfeld
- European Molecular Biology Laboratory (EMBL), Genome Biology Unit, 69117 Heidelberg, Germany
| | - Ashley D Sanders
- Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, 10115 Berlin, Germany
- Berlin Institute of Health (BIH), 10178 Berlin, Germany
- Charité-Universitätsmedizin, 10117 Berlin, Germany
| | - Catherine Stober
- European Molecular Biology Laboratory (EMBL), Genome Biology Unit, 69117 Heidelberg, Germany
| | - Jan O Korbel
- European Molecular Biology Laboratory (EMBL), Genome Biology Unit, 69117 Heidelberg, Germany
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, California 95064, USA
| | - Tobias Marschall
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, 40225 Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, 40225 Düsseldorf, Germany
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA;
- Howard Hughes Medical Institute, University of Washington, Seattle, Washington 98195, USA
| |
Collapse
|
195
|
Sikic M. Facilitating genome structural variation analysis. Nat Methods 2023; 20:491-492. [PMID: 36959321 DOI: 10.1038/s41592-023-01767-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/25/2023]
Affiliation(s)
- Mile Sikic
- Laboratory of AI in Genomics, Genome Institute of Singapore, A*STAR, Singapore, Singapore.
- Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, Croatia.
| |
Collapse
|
196
|
Lee H, Kim J, Lee J. Benchmarking datasets for assembly-based variant calling using high-fidelity long reads. BMC Genomics 2023; 24:148. [PMID: 36973656 PMCID: PMC10045170 DOI: 10.1186/s12864-023-09255-y] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2022] [Accepted: 03/17/2023] [Indexed: 03/29/2023] Open
Abstract
BACKGROUND Recent advances in long-read sequencing technologies have enabled accurate identification of all genetic variants in individuals or cells; this procedure is known as variant calling. However, benchmarking studies on variant calling using different long-read sequencing technologies are still lacking. RESULTS We used two Caenorhabditis elegans strains to measure several variant calling metrics. These two strains shared true-positive genetic variants that were introduced during strain generation. In addition, both strains contained common and distinguishable variants induced by DNA damage, possibly leading to false-positive estimation. We obtained accurate and noisy long reads from both strains using high-fidelity (HiFi) and continuous long-read (CLR) sequencing platforms, and compared the variant calling performance of the two platforms. HiFi identified a 1.65-fold higher number of true-positive variants on average, with 60% fewer false-positive variants, than CLR did. We also compared read-based and assembly-based variant calling methods in combination with subsampling of various sequencing depths and demonstrated that variant calling after genome assembly was particularly effective for detection of large insertions, even with 10 × sequencing depth of accurate long-read sequencing data. CONCLUSIONS By directly comparing the two long-read sequencing technologies, we demonstrated that variant calling after genome assembly with 10 × or more depth of accurate long-read sequencing data allowed reliable detection of true-positive variants. Considering the high cost of HiFi sequencing, we herein propose appropriate methodologies for performing cost-effective and high-quality variant calling: 10 × assembly-based variant calling. The results of the present study may facilitate the development of methods for identifying all genetic variants at the population level.
Collapse
Affiliation(s)
- Hyunji Lee
- Institute of Molecular Biology and Genetics, Seoul National University, Seoul, 08826 Korea
- Department of Biological Sciences, Seoul National University, Seoul, 08826 Korea
| | - Jun Kim
- Department of Biological Sciences, Seoul National University, Seoul, 08826 Korea
- Research Institute of Basic Sciences, Seoul National University, Seoul, 08826 Korea
- Department of Convergent Bioscience and Informatics, College of Bioscience and Biotechnology, Chungnam National University, Daejeon, 34134 Korea
| | - Junho Lee
- Institute of Molecular Biology and Genetics, Seoul National University, Seoul, 08826 Korea
- Department of Biological Sciences, Seoul National University, Seoul, 08826 Korea
- Research Institute of Basic Sciences, Seoul National University, Seoul, 08826 Korea
| |
Collapse
|
197
|
Jiang YF, Wang S, Wang CL, Xu RH, Wang WW, Jiang Y, Wang MS, Jiang L, Dai LH, Wang JR, Chu XH, Zeng YQ, Fang LZ, Wu DD, Zhang Q, Ding XD. Pangenome obtained by long-read sequencing of 11 genomes reveal hidden functional structural variants in pigs. iScience 2023; 26:106119. [PMID: 36852268 PMCID: PMC9958381 DOI: 10.1016/j.isci.2023.106119] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2022] [Revised: 12/21/2022] [Accepted: 01/30/2023] [Indexed: 02/05/2023] Open
Abstract
Long-read sequencing (LRS) facilitates both the genome assembly and the discovery of structural variants (SVs). Here, we built a graph-based pig pangenome by incorporating 11 LRS genomes with an average of 94.01% BUSCO completeness score, revealing 206-Mb novel sequences. We discovered 183,352 nonredundant SVs (63% novel), representing 12.12% of the reference genome. By genotyping SVs in an additional 196 short-read sequencing samples, we identified thousands of population stratified SVs. Particularly, we detected 7,568 Tibetan specific SVs, some of which demonstrate significant population differentiation between Tibetan and low-altitude pigs, which might be associated with the high-altitude hypoxia adaptation in Tibetan pigs. Further integrating functional genomic data, the most promising candidate genes within the SVs that might contribute to the high-altitude hypoxia adaptation were discovered. Overall, our study generates a benchmark pangenome resource for illustrating the important roles of SVs in adaptive evolution, domestication, and genetic improvement of agronomic traits in pigs.
Collapse
Affiliation(s)
- Yi-Fan Jiang
- National Engineering Laboratory for Animal Breeding, Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Agriculture and Rural Affairs, College of Animal Science and Technology, China Agricultural University, Beijing 100193, China
| | - Sheng Wang
- State Key Laboratory of Genetic Resources and Evolution, Yunnan Laboratory of Molecular Biology of Domestic Animals, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China
| | - Chong-Long Wang
- Key Laboratory of Pig Molecular Quantitative Genetics of Anhui Academy of Agricultural Sciences, Anhui Provincial Key Laboratory of Livestock and Poultry Product Safety Engineering, Institute of Animal Husbandry and Veterinary Medicine, Anhui Academy of Agricultural Sciences, Hefei 230031, China
| | - Ru-Hai Xu
- Key Laboratory of Animal Genetics and Breeding of Zhejiang Province, Institute of Animal Husbandry and Veterinary Science, Zhejiang Academy of Agricultural Sciences, Hangzhou 310021, China
| | - Wen-Wen Wang
- Shandong Provincial Key Laboratory of Animal Biotechnology and Disease Control and Prevention, College of Animal Science and Technology, Shandong Agricultural University, Taian 271001, China
| | - Yao Jiang
- National Engineering Laboratory for Animal Breeding, Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Agriculture and Rural Affairs, College of Animal Science and Technology, China Agricultural University, Beijing 100193, China
- Key Laboratory of Pig Molecular Quantitative Genetics of Anhui Academy of Agricultural Sciences, Anhui Provincial Key Laboratory of Livestock and Poultry Product Safety Engineering, Institute of Animal Husbandry and Veterinary Medicine, Anhui Academy of Agricultural Sciences, Hefei 230031, China
| | - Ming-Shan Wang
- State Key Laboratory of Genetic Resources and Evolution, Yunnan Laboratory of Molecular Biology of Domestic Animals, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China
| | - Li Jiang
- National Engineering Laboratory for Animal Breeding, Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Agriculture and Rural Affairs, College of Animal Science and Technology, China Agricultural University, Beijing 100193, China
| | - Li-He Dai
- Key Laboratory of Animal Genetics and Breeding of Zhejiang Province, Institute of Animal Husbandry and Veterinary Science, Zhejiang Academy of Agricultural Sciences, Hangzhou 310021, China
| | - Jie-Ru Wang
- Key Laboratory of Pig Molecular Quantitative Genetics of Anhui Academy of Agricultural Sciences, Anhui Provincial Key Laboratory of Livestock and Poultry Product Safety Engineering, Institute of Animal Husbandry and Veterinary Medicine, Anhui Academy of Agricultural Sciences, Hefei 230031, China
| | - Xiao-Hong Chu
- Key Laboratory of Animal Genetics and Breeding of Zhejiang Province, Institute of Animal Husbandry and Veterinary Science, Zhejiang Academy of Agricultural Sciences, Hangzhou 310021, China
| | - Yong-Qing Zeng
- Shandong Provincial Key Laboratory of Animal Biotechnology and Disease Control and Prevention, College of Animal Science and Technology, Shandong Agricultural University, Taian 271001, China
| | - Ling-Zhao Fang
- Center for Quantitative Genetics and Genomics, Aarhus University, Aarhus, 8000, Denmark
| | - Dong-Dong Wu
- State Key Laboratory of Genetic Resources and Evolution, Yunnan Laboratory of Molecular Biology of Domestic Animals, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China
| | - Qin Zhang
- Shandong Provincial Key Laboratory of Animal Biotechnology and Disease Control and Prevention, College of Animal Science and Technology, Shandong Agricultural University, Taian 271001, China
| | - Xiang-Dong Ding
- National Engineering Laboratory for Animal Breeding, Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Agriculture and Rural Affairs, College of Animal Science and Technology, China Agricultural University, Beijing 100193, China
| |
Collapse
|
198
|
Domcke S, Shendure J. A reference cell tree will serve science better than a reference cell atlas. Cell 2023; 186:1103-1114. [PMID: 36931241 DOI: 10.1016/j.cell.2023.02.016] [Citation(s) in RCA: 47] [Impact Index Per Article: 23.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2022] [Revised: 01/15/2023] [Accepted: 02/10/2023] [Indexed: 03/18/2023]
Abstract
Single-cell biology is facing a crisis of sorts. Vast numbers of single-cell molecular profiles are being generated, clustered and annotated. However, this is overwhelmingly ad hoc, and we continue to lack a principled, unified, and well-moored system for defining, naming, and organizing cell types. In this perspective, we argue against an atlas or periodic table-like discretization as the right metaphor for a reference taxonomy of cell types. In its place, we advocate for a data-driven, tree-based nomenclature that is rooted in a "consensus ontogeny" spanning the life cycle of a given species. We explore how such a reference cell tree, inclusive of both lineage histories and molecular states, could be constructed, represented, and segmented in practice. Analogous to the taxonomic classification of species, a consensus ontogeny would provide a universal, stable, and extendable framework for precise scientific communication, both contemporaneously and across the ages.
Collapse
Affiliation(s)
- Silvia Domcke
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA.
| | - Jay Shendure
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA; Brotman Baty Institute for Precision Medicine, Seattle, WA, USA; Allen Discovery Center for Cell Lineage Tracing, Seattle, WA, USA; Howard Hughes Medical Institute, Seattle, WA, USA.
| |
Collapse
|
199
|
Jam HZ, Li Y, DeVito R, Mousavi N, Ma N, Lujumba I, Adam Y, Maksimov M, Huang B, Dolzhenko E, Qiu Y, Kakembo FE, Joseph H, Onyido B, Adeyemi J, Bakhtiari M, Park J, Javadzadeh S, Jjingo D, Adebiyi E, Bafna V, Gymrek M. A deep population reference panel of tandem repeat variation. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.03.09.531600. [PMID: 36945429 PMCID: PMC10028971 DOI: 10.1101/2023.03.09.531600] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/14/2023]
Abstract
Tandem repeats (TRs) represent one of the largest sources of genetic variation in humans and are implicated in a range of phenotypes. Here we present a deep characterization of TR variation based on high coverage whole genome sequencing from 3,550 diverse individuals from the 1000 Genomes Project and H3Africa cohorts. We develop a method, EnsembleTR, to integrate genotypes from four separate methods resulting in high-quality genotypes at more than 1.7 million TR loci. Our catalog reveals novel sequence features influencing TR heterozygosity, identifies population-specific trinucleotide expansions, and finds hundreds of novel eQTL signals. Finally, we generate a phased haplotype panel which can be used to impute most TRs from nearby single nucleotide polymorphisms (SNPs) with high accuracy. Overall, the TR genotypes and reference haplotype panel generated here will serve as valuable resources for future genome-wide and population-wide studies of TRs and their role in human phenotypes.
Collapse
Affiliation(s)
- Helyaneh Ziaei Jam
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA
| | - Yang Li
- Department of Medicine, University of California San Diego, La Jolla, CA
| | - Ross DeVito
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA
| | - Nima Mousavi
- Department of Electrical and Computer Engineering, University of California San Diego, La Jolla, CA
| | - Nichole Ma
- Department of Medicine, University of California San Diego, La Jolla, CA
| | - Ibra Lujumba
- The African Center of Excellence in Bioinformatics and Data Intensive Sciences, the Infectious Diseases Institute, Makerere University, Kampala-Uganda
| | - Yagoub Adam
- Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun, 112233, Nigeria
| | - Mikhail Maksimov
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA
| | - Bonnie Huang
- Department of Bioengineering, University of California San Diego, La Jolla, CA
| | | | - Yunjiang Qiu
- Illumina Incorporated, San Diego, California 92122, USA
| | - Fredrick Elishama Kakembo
- The African Center of Excellence in Bioinformatics and Data Intensive Sciences, the Infectious Diseases Institute, Makerere University, Kampala-Uganda
| | - Habi Joseph
- The African Center of Excellence in Bioinformatics and Data Intensive Sciences, the Infectious Diseases Institute, Makerere University, Kampala-Uganda
| | - Blessing Onyido
- Department of Computer & Information Sciences, Covenant University, Ota, Ogun, 112233, Nigeria
- Covenant Applied Informatics and Communication Africa Centre of Excellence (CApIC-ACE), Covenant University, Ota, Ogun, 112233, Nigeria
| | - Jumoke Adeyemi
- Department of Computer & Information Sciences, Covenant University, Ota, Ogun, 112233, Nigeria
- Covenant Applied Informatics and Communication Africa Centre of Excellence (CApIC-ACE), Covenant University, Ota, Ogun, 112233, Nigeria
| | - Mehrdad Bakhtiari
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA
| | - Jonghun Park
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA
| | - Sara Javadzadeh
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA
| | - Daudi Jjingo
- The African Center of Excellence in Bioinformatics and Data Intensive Sciences, the Infectious Diseases Institute, Makerere University, Kampala-Uganda
- Department of Computer Science, Makerere University, Kampala, Uganda
| | - Ezekiel Adebiyi
- Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun, 112233, Nigeria
- Department of Computer & Information Sciences, Covenant University, Ota, Ogun, 112233, Nigeria
- Covenant Applied Informatics and Communication Africa Centre of Excellence (CApIC-ACE), Covenant University, Ota, Ogun, 112233, Nigeria
- Applied Bioinformatics Division, German Cancer Research Center (DKFZ), Heidelberg, Baden-Württemberg, 69120, Germany
| | - Vineet Bafna
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA
| | - Melissa Gymrek
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA
- Department of Medicine, University of California San Diego, La Jolla, CA
| |
Collapse
|
200
|
Aguilar R, Camplisson CK, Lin Q, Miga KH, Noble WS, Beliveau BJ. Tigerfish designs oligonucleotide-based in situ hybridization probes targeting intervals of highly repetitive DNA at the scale of genomes. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.03.06.530899. [PMID: 36945528 PMCID: PMC10028787 DOI: 10.1101/2023.03.06.530899] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 04/30/2023]
Abstract
Fluorescent in situ hybridization (FISH) is a powerful method for the targeted visualization of nucleic acids in their native contexts. Recent technological advances have leveraged computationally designed oligonucleotide (oligo) probes to interrogate >100 distinct targets in the same sample, pushing the boundaries of FISH-based assays. However, even in the most highly multiplexed experiments, repetitive DNA regions are typically not included as targets, as the computational design of specific probes against such regions presents significant technical challenges. Consequently, many open questions remain about the organization and function of highly repetitive sequences. Here, we introduce Tigerfish, a software tool for the genome-scale design of oligo probes against repetitive DNA intervals. We showcase Tigerfish by designing a panel of 24 interval-specific repeat probes specific to each of the 24 human chromosomes and imaging this panel on metaphase spreads and in interphase nuclei. Tigerfish extends the powerful toolkit of oligo-based FISH to highly repetitive DNA.
Collapse
Affiliation(s)
- Robin Aguilar
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | | | - Qiaoyi Lin
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Karen H. Miga
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, CA, USA
| | - William S. Noble
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA
| | - Brian J. Beliveau
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
- Brotman Baty Institute for Precision Medicine, Seattle, WA, USA
| |
Collapse
|