1
|
Hu T, Mosbruger TL, Tairis NG, Dinou A, Jayaraman P, Sarmady M, Brewster K, Li Y, Hayeck TJ, Duke JL, Monos DS. Targeted and complete genomic sequencing of the major histocompatibility complex in haplotypic form of individual heterozygous samples. Genome Res 2024; 34:1500-1513. [PMID: 39327030 PMCID: PMC11534196 DOI: 10.1101/gr.278588.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Accepted: 09/19/2024] [Indexed: 09/28/2024]
Abstract
The human major histocompatibility complex (MHC) is a ∼4 Mb genomic segment on Chromosome 6 that plays a pivotal role in the immune response. Despite its importance in various traits and diseases, its complex nature makes it challenging to accurately characterize on a routine basis. We present a novel approach allowing targeted sequencing and de novo haplotypic assembly of the MHC region in heterozygous samples, using long-read sequencing technologies. Our approach is validated using two reference samples, two family trios, and an African-American sample. We achieved excellent coverage (96.6%-99.9% with at least 30× depth) and high accuracy (99.89%-99.99%) for the different haplotypes. This methodology offers a reliable and cost-effective method for sequencing and fully characterizing the MHC without the need for whole-genome sequencing, facilitating broader studies on this important genomic segment and having significant implications in immunology, genetics, and medicine.
Collapse
Affiliation(s)
- Taishan Hu
- Immunogenetics Laboratory, Department of Pathology and Laboratory Medicine, Children's Hospital of Philadelphia, Philadelphia, Pennsylvania 19104, USA
| | - Timothy L Mosbruger
- Immunogenetics Laboratory, Department of Pathology and Laboratory Medicine, Children's Hospital of Philadelphia, Philadelphia, Pennsylvania 19104, USA
| | - Nikolaos G Tairis
- Immunogenetics Laboratory, Department of Pathology and Laboratory Medicine, Children's Hospital of Philadelphia, Philadelphia, Pennsylvania 19104, USA
| | - Amalia Dinou
- Immunogenetics Laboratory, Department of Pathology and Laboratory Medicine, Children's Hospital of Philadelphia, Philadelphia, Pennsylvania 19104, USA
| | - Pushkala Jayaraman
- Division of Genomic Diagnostics, Department of Pathology and Laboratory Medicine, Children's Hospital of Philadelphia, Philadelphia, Pennsylvania 19104, USA
| | - Mahdi Sarmady
- Division of Genomic Diagnostics, Department of Pathology and Laboratory Medicine, Children's Hospital of Philadelphia, Philadelphia, Pennsylvania 19104, USA
| | - Kingham Brewster
- Sequencing and Genotyping Center, Delaware Biotechnology Institute, University of Delaware, Newark, Delaware 19713, USA
| | - Yang Li
- Immunogenetics Laboratory, Department of Pathology and Laboratory Medicine, Children's Hospital of Philadelphia, Philadelphia, Pennsylvania 19104, USA
| | - Tristan J Hayeck
- Immunogenetics Laboratory, Department of Pathology and Laboratory Medicine, Children's Hospital of Philadelphia, Philadelphia, Pennsylvania 19104, USA
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA
| | - Jamie L Duke
- Immunogenetics Laboratory, Department of Pathology and Laboratory Medicine, Children's Hospital of Philadelphia, Philadelphia, Pennsylvania 19104, USA
| | - Dimitri S Monos
- Immunogenetics Laboratory, Department of Pathology and Laboratory Medicine, Children's Hospital of Philadelphia, Philadelphia, Pennsylvania 19104, USA;
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA
| |
Collapse
|
2
|
Uzuner H, Paschen A, Schadendorf D, Köster J. Orthanq: transparent and uncertainty-aware haplotype quantification with application in HLA-typing. BMC Bioinformatics 2024; 25:240. [PMID: 39014339 PMCID: PMC11253481 DOI: 10.1186/s12859-024-05832-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2024] [Accepted: 06/10/2024] [Indexed: 07/18/2024] Open
Abstract
BACKGROUND Identification of human leukocyte antigen (HLA) types from DNA-sequenced human samples is important in organ transplantation and cancer immunotherapy and remains a challenging task considering sequence homology and extreme polymorphism of HLA genes. RESULTS We present Orthanq, a novel statistical model and corresponding application for transparent and uncertainty-aware quantification of haplotypes. We utilize our approach to perform HLA typing while, for the first time, reporting uncertainty of predictions and transparently observing mutations beyond reported HLA types. Using 99 gold standard samples from 1000 Genomes, Illumina Platinum Genomes and Genome In a Bottle projects, we show that Orthanq can provide overall superior accuracy and shorter runtimes than state-of-the-art HLA typers. CONCLUSIONS Orthanq is the first approach that allows to directly utilize existing pangenome alignments and type all HLA loci. Moreover, it can be generalized for usages beyond HLA typing, e.g. for virus lineage quantification. Orthanq is available under https://orthanq.github.io .
Collapse
Affiliation(s)
- Hamdiye Uzuner
- Bioinformatics and Computational Oncology, Institute for Artifical Intelligence in Medicine (IKIM), University Hospital Essen, Faculty of Medicine, University of Duisburg-Essen, Essen, Germany.
| | - Annette Paschen
- Department of Dermatology, West German Cancer Center, University Hospital Essen, University Duisburg-Essen, Essen, Germany
- German Consortium for Translational Cancer Research (DKTK), Partner Site Essen/Düsseldorf, Essen, Germany
| | - Dirk Schadendorf
- Department of Dermatology, West German Cancer Center, University Hospital Essen, University Duisburg-Essen, Essen, Germany
- German Consortium for Translational Cancer Research (DKTK), Partner Site Essen/Düsseldorf, Essen, Germany
| | - Johannes Köster
- Bioinformatics and Computational Oncology, Institute for Artifical Intelligence in Medicine (IKIM), University Hospital Essen, Faculty of Medicine, University of Duisburg-Essen, Essen, Germany
- German Consortium for Translational Cancer Research (DKTK), Partner Site Essen/Düsseldorf, Essen, Germany
| |
Collapse
|
3
|
Bai X, Chen Z, Chen K, Wu Z, Wang R, Liu J, Chang L, Wen L, Tang F. Simultaneous de novo calling and phasing of genetic variants at chromosome-scale using NanoStrand-seq. Cell Discov 2024; 10:74. [PMID: 38977679 PMCID: PMC11231365 DOI: 10.1038/s41421-024-00694-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Accepted: 05/23/2024] [Indexed: 07/10/2024] Open
Abstract
The successful accomplishment of the first telomere-to-telomere human genome assembly, T2T-CHM13, marked a milestone in achieving completeness of the human reference genome. The upcoming era of genome study will focus on fully phased diploid genome assembly, with an emphasis on genetic differences between individual haplotypes. Most existing sequencing approaches only achieved localized haplotype phasing and relied on additional pedigree information for further whole-chromosome scale phasing. The short-read-based Strand-seq method is able to directly phase single nucleotide polymorphisms (SNPs) at whole-chromosome scale but falls short when it comes to phasing structural variations (SVs). To shed light on this issue, we developed a Nanopore sequencing platform-based Strand-seq approach, which we named NanoStrand-seq. This method allowed for de novo SNP calling with high precision (99.52%) and acheived a superior phasing accuracy (0.02% Hamming error rate) at whole-chromosome scale, a level of performance comparable to Strand-seq for haplotype phasing of the GM12878 genome. Importantly, we demonstrated that NanoStrand-seq can efficiently resolve the MHC locus, a highly polymorphic genomic region. Moreover, NanoStrand-seq enabled independent direct calling and phasing of deletions and insertions at whole-chromosome level; when applied to long genomic regions of SNP homozygosity, it outperformed the strategy that combined Strand-seq with bulk long-read sequencing. Finally, we showed that, like Strand-seq, NanoStrand-seq was also applicable to primary cultured cells. Together, here we provided a novel methodology that enabled interrogation of a full spectrum of haplotype-resolved SNPs and SVs at whole-chromosome scale, with broad applications for species with diploid or even potentially polypoid genomes.
Collapse
Affiliation(s)
- Xiuzhen Bai
- Biomedical Pioneering Innovation Center (BIOPIC), Peking University, Beijing, China
- Beijing Advanced Innovation Center for Genomics (ICG), Ministry of Education Key Laboratory of Cell Proliferation and Differentiation, Beijing, China
- Changping Laboratory, Beijing, China
| | - Zonggui Chen
- Biomedical Pioneering Innovation Center (BIOPIC), Peking University, Beijing, China
- Changping Laboratory, Beijing, China
- Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China
| | - Kexuan Chen
- Biomedical Pioneering Innovation Center (BIOPIC), Peking University, Beijing, China
- School of Life Sciences, Peking University, Beijing, China
| | - Zixin Wu
- Biomedical Pioneering Innovation Center (BIOPIC), Peking University, Beijing, China
- Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China
| | - Rui Wang
- Department of Medicine, Cancer Institute, Stanford University, Stanford, CA, USA
| | - Jun'e Liu
- Biomedical Pioneering Innovation Center (BIOPIC), Peking University, Beijing, China
- Beijing Advanced Innovation Center for Genomics (ICG), Ministry of Education Key Laboratory of Cell Proliferation and Differentiation, Beijing, China
- Changping Laboratory, Beijing, China
- School of Life Sciences, Peking University, Beijing, China
| | - Liang Chang
- State Key Laboratory of Female Fertility Promotion, Center for Reproductive Medicine, Department of Obstetrics and Gynecology, Peking University Third Hospital, Beijing, China
- National Clinical Research Center for Obstetrics and Gynecology (Peking University Third Hospital), Beijing, China
- Key Laboratory of Assisted Reproduction (Peking University), Ministry of Education Beijing, Beijing, China
- Key Laboratory of Reproductive Endocrinology and Assisted Reproductive Technology, Beijing, China
| | - Lu Wen
- Biomedical Pioneering Innovation Center (BIOPIC), Peking University, Beijing, China
- Beijing Advanced Innovation Center for Genomics (ICG), Ministry of Education Key Laboratory of Cell Proliferation and Differentiation, Beijing, China
- Changping Laboratory, Beijing, China
| | - Fuchou Tang
- Biomedical Pioneering Innovation Center (BIOPIC), Peking University, Beijing, China.
- Beijing Advanced Innovation Center for Genomics (ICG), Ministry of Education Key Laboratory of Cell Proliferation and Differentiation, Beijing, China.
- Changping Laboratory, Beijing, China.
- Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China.
- School of Life Sciences, Peking University, Beijing, China.
| |
Collapse
|
4
|
Chen Y, Huang JH, Sun Y, Zhang Y, Li Y, Xu X. Haplotype-resolved assembly of diploid and polyploid genomes using quantum computing. CELL REPORTS METHODS 2024; 4:100754. [PMID: 38614089 PMCID: PMC11133727 DOI: 10.1016/j.crmeth.2024.100754] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/08/2023] [Revised: 01/03/2024] [Accepted: 03/20/2024] [Indexed: 04/15/2024]
Abstract
Precision medicine's emphasis on individual genetic variants highlights the importance of haplotype-resolved assembly, a computational challenge in bioinformatics given its combinatorial nature. While classical algorithms have made strides in addressing this issue, the potential of quantum computing remains largely untapped. Here, we present the vehicle routing problem (VRP) assembler: an approach that transforms this task into a vehicle routing problem, an optimization formulation solvable on a quantum computer. We demonstrate its potential and feasibility through a proof of concept on short synthetic diploid and triploid genomes using a D-Wave quantum annealer. To tackle larger-scale assembly problems, we integrate the VRP assembler with Google's OR-Tools, achieving a haplotype-resolved local assembly across the human major histocompatibility complex (MHC) region. Our results show encouraging performance compared to Hifiasm with phasing accuracy approaching the theoretical limit, underscoring the promising future of quantum computing in bioinformatics.
Collapse
Affiliation(s)
- Yibo Chen
- BGI Research, Shenzhen 518083, China
| | | | - Yuhui Sun
- BGI Research, Shenzhen 518083, China
| | - Yong Zhang
- BGI Research, Wuhan 430047, China; Guangdong Bigdata Engineering Technology Research Center for Life Sciences, BGI Research, Shenzhen 518083, China.
| | - Yuxiang Li
- BGI Research, Wuhan 430047, China; Guangdong Bigdata Engineering Technology Research Center for Life Sciences, BGI Research, Shenzhen 518083, China.
| | - Xun Xu
- BGI Research, Shenzhen 518083, China; BGI Research, Wuhan 430047, China.
| |
Collapse
|
5
|
English AC, Dolzhenko E, Ziaei Jam H, McKenzie SK, Olson ND, De Coster W, Park J, Gu B, Wagner J, Eberle MA, Gymrek M, Chaisson MJP, Zook JM, Sedlazeck FJ. Analysis and benchmarking of small and large genomic variants across tandem repeats. Nat Biotechnol 2024:10.1038/s41587-024-02225-z. [PMID: 38671154 DOI: 10.1038/s41587-024-02225-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2023] [Accepted: 03/28/2024] [Indexed: 04/28/2024]
Abstract
Tandem repeats (TRs) are highly polymorphic in the human genome, have thousands of associated molecular traits and are linked to over 60 disease phenotypes. However, they are often excluded from at-scale studies because of challenges with variant calling and representation, as well as a lack of a genome-wide standard. Here, to promote the development of TR methods, we created a catalog of TR regions and explored TR properties across 86 haplotype-resolved long-read human assemblies. We curated variants from the Genome in a Bottle (GIAB) HG002 individual to create a TR dataset to benchmark existing and future TR analysis methods. We also present an improved variant comparison method that handles variants greater than 4 bp in length and varying allelic representation. The 8.1% of the genome covered by the TR catalog holds ~24.9% of variants per individual, including 124,728 small and 17,988 large variants for the GIAB HG002 'truth-set' TR benchmark. We demonstrate the utility of this pipeline across short-read and long-read technologies.
Collapse
Affiliation(s)
- Adam C English
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA.
| | | | - Helyaneh Ziaei Jam
- Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA, USA
| | | | - Nathan D Olson
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Wouter De Coster
- Applied and Translational Neurogenomics Group, VIB Center for Molecular Neurology, VIB, Antwerp, Belgium
- Applied and Translational Neurogenomics Group, Department of Biomedical Sciences, University of Antwerp, Antwerp, Belgium
| | - Jonghun Park
- Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA, USA
| | - Bida Gu
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Justin Wagner
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | | | - Melissa Gymrek
- Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA, USA
- Department of Medicine, University of California, San Diego, La Jolla, CA, USA
| | - Mark J P Chaisson
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Justin M Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA.
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA.
- Department of Computer Science, Rice University, Houston, TX, USA.
| |
Collapse
|
6
|
Lorig-Roach R, Meredith M, Monlong J, Jain M, Olsen HE, McNulty B, Porubsky D, Montague TG, Lucas JK, Condon C, Eizenga JM, Juul S, McKenzie SK, Simmonds SE, Park J, Asri M, Koren S, Eichler EE, Axel R, Martin B, Carnevali P, Miga KH, Paten B. Phased nanopore assembly with Shasta and modular graph phasing with GFAse. Genome Res 2024; 34:454-468. [PMID: 38627094 PMCID: PMC11067879 DOI: 10.1101/gr.278268.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2023] [Accepted: 03/19/2024] [Indexed: 04/30/2024]
Abstract
Reference-free genome phasing is vital for understanding allele inheritance and the impact of single-molecule DNA variation on phenotypes. To achieve thorough phasing across homozygous or repetitive regions of the genome, long-read sequencing technologies are often used to perform phased de novo assembly. As a step toward reducing the cost and complexity of this type of analysis, we describe new methods for accurately phasing Oxford Nanopore Technologies (ONT) sequence data with the Shasta genome assembler and a modular tool for extending phasing to the chromosome scale called GFAse. We test using new variants of ONT PromethION sequencing, including those using proximity ligation, and show that newer, higher accuracy ONT reads substantially improve assembly quality.
Collapse
Affiliation(s)
- Ryan Lorig-Roach
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, California 95060, USA;
| | - Melissa Meredith
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, California 95060, USA
| | - Jean Monlong
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, California 95060, USA
| | - Miten Jain
- Department of Bioengineering, Department of Physics, Northeastern University, Boston, Massachusetts 02120, USA
| | - Hugh E Olsen
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, California 95060, USA
| | - Brandy McNulty
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, California 95060, USA
| | - David Porubsky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA
| | - Tessa G Montague
- The Mortimer B. Zuckerman Mind Brain Behavior Institute, Department of Neuroscience, Columbia University, New York, New York 10027, USA
- Howard Hughes Medical Institute, Columbia University, New York, New York 10032, USA
| | - Julian K Lucas
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, California 95060, USA
| | - Chris Condon
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, California 95060, USA
| | - Jordan M Eizenga
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, California 95060, USA
| | - Sissel Juul
- Oxford Nanopore Technologies Incorporated, New York, New York 10013, USA
| | - Sean K McKenzie
- Oxford Nanopore Technologies Incorporated, New York, New York 10013, USA
| | - Sara E Simmonds
- Chan Zuckerberg Initiative Foundation, Redwood City, California 94063, USA
| | - Jimin Park
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, California 95060, USA
| | - Mobin Asri
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, California 95060, USA
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20894, USA
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, Washington 98195, USA
| | - Richard Axel
- The Mortimer B. Zuckerman Mind Brain Behavior Institute, Department of Neuroscience, Columbia University, New York, New York 10027, USA
- Howard Hughes Medical Institute, Columbia University, New York, New York 10032, USA
| | - Bruce Martin
- Chan Zuckerberg Initiative Foundation, Redwood City, California 94063, USA
| | - Paolo Carnevali
- Chan Zuckerberg Initiative Foundation, Redwood City, California 94063, USA;
| | - Karen H Miga
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, California 95060, USA
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, California 95060, USA;
| |
Collapse
|
7
|
Mikhaylova V, Rzepka M, Kawamura T, Xia Y, Chang PL, Zhou S, Paasch A, Pham L, Modi N, Yao L, Perez-Agustin A, Pagans S, Boles TC, Lei M, Wang Y, Garcia-Bassets I, Chen Z. Targeted phasing of 2-200 kilobase DNA fragments with a short-read sequencer and a single-tube linked-read library method. Sci Rep 2024; 14:7988. [PMID: 38580715 PMCID: PMC10997766 DOI: 10.1038/s41598-024-58733-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Accepted: 04/02/2024] [Indexed: 04/07/2024] Open
Abstract
In the human genome, heterozygous sites refer to genomic positions with a different allele or nucleotide variant on the maternal and paternal chromosomes. Resolving these allelic differences by chromosomal copy, also known as phasing, is achievable on a short-read sequencer when using a library preparation method that captures long-range genomic information. TELL-Seq is a library preparation that captures long-range genomic information with the aid of molecular identifiers (barcodes). The same barcode is used to tag the reads derived from the same long DNA fragment within a range of up to 200 kilobases (kb), generating linked-reads. This strategy can be used to phase an entire genome. Here, we introduce a TELL-Seq protocol developed for targeted applications, enabling the phasing of enriched loci of varying sizes, purity levels, and heterozygosity. To validate this protocol, we phased 2-200 kb loci enriched with different methods: CRISPR/Cas9-mediated excision coupled with pulse-field electrophoresis for the longest fragments, CRISPR/Cas9-mediated protection from exonuclease digestion for mid-size fragments, and long PCR for the shortest fragments. All selected loci have known clinical relevance: BRCA1, BRCA2, MLH1, MSH2, MSH6, APC, PMS2, SCN5A-SCN10A, and PKI3CA. Collectively, the analyses show that TELL-Seq can accurately phase 2-200 kb targets using a short-read sequencer.
Collapse
Affiliation(s)
| | - Madison Rzepka
- Universal Sequencing Technology Corp., Carlsbad, CA, 92011, USA
| | | | - Yu Xia
- Universal Sequencing Technology Corp., Carlsbad, CA, 92011, USA
| | - Peter L Chang
- Universal Sequencing Technology Corp., Carlsbad, CA, 92011, USA
| | | | - Amber Paasch
- Universal Sequencing Technology Corp., Carlsbad, CA, 92011, USA
| | - Long Pham
- Universal Sequencing Technology Corp., Carlsbad, CA, 92011, USA
| | - Naisarg Modi
- Universal Sequencing Technology Corp., Carlsbad, CA, 92011, USA
| | - Likun Yao
- Department of Medicine, University of California, San Diego, La Jolla, CA, 92093, USA
| | - Adrian Perez-Agustin
- Department of Medical Sciences, School of Medicine, University of Girona, Girona, Spain
| | - Sara Pagans
- Department of Medical Sciences, School of Medicine, University of Girona, Girona, Spain
| | | | - Ming Lei
- Universal Sequencing Technology Corp., Canton, MA, 02021, USA
| | - Yong Wang
- Universal Sequencing Technology Corp., Canton, MA, 02021, USA
| | | | - Zhoutao Chen
- Universal Sequencing Technology Corp., Carlsbad, CA, 92011, USA.
| |
Collapse
|
8
|
Barbitoff YA, Ushakov MO, Lazareva TE, Nasykhova YA, Glotov AS, Predeus AV. Bioinformatics of germline variant discovery for rare disease diagnostics: current approaches and remaining challenges. Brief Bioinform 2024; 25:bbad508. [PMID: 38271481 PMCID: PMC10810331 DOI: 10.1093/bib/bbad508] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2023] [Revised: 11/18/2023] [Accepted: 12/12/2023] [Indexed: 01/27/2024] Open
Abstract
Next-generation sequencing (NGS) has revolutionized the field of rare disease diagnostics. Whole exome and whole genome sequencing are now routinely used for diagnostic purposes; however, the overall diagnosis rate remains lower than expected. In this work, we review current approaches used for calling and interpretation of germline genetic variants in the human genome, and discuss the most important challenges that persist in the bioinformatic analysis of NGS data in medical genetics. We describe and attempt to quantitatively assess the remaining problems, such as the quality of the reference genome sequence, reproducible coverage biases, or variant calling accuracy in complex regions of the genome. We also discuss the prospects of switching to the complete human genome assembly or the human pan-genome and important caveats associated with such a switch. We touch on arguably the hardest problem of NGS data analysis for medical genomics, namely, the annotation of genetic variants and their subsequent interpretation. We highlight the most challenging aspects of annotation and prioritization of both coding and non-coding variants. Finally, we demonstrate the persistent prevalence of pathogenic variants in the coding genome, and outline research directions that may enhance the efficiency of NGS-based disease diagnostics.
Collapse
Affiliation(s)
- Yury A Barbitoff
- Dpt. of Genomic Medicine, D.O. Ott Research Institute of Obstetrics, Gynaecology, and Reproductology, Mendeleevskaya line 3, 199034, St. Petersburg, Russia
- Bioinformatics Institute, Kentemirovskaya st. 2A, 197342, St. Petersburg, Russia
| | - Mikhail O Ushakov
- Dpt. of Genomic Medicine, D.O. Ott Research Institute of Obstetrics, Gynaecology, and Reproductology, Mendeleevskaya line 3, 199034, St. Petersburg, Russia
| | - Tatyana E Lazareva
- Dpt. of Genomic Medicine, D.O. Ott Research Institute of Obstetrics, Gynaecology, and Reproductology, Mendeleevskaya line 3, 199034, St. Petersburg, Russia
| | - Yulia A Nasykhova
- Dpt. of Genomic Medicine, D.O. Ott Research Institute of Obstetrics, Gynaecology, and Reproductology, Mendeleevskaya line 3, 199034, St. Petersburg, Russia
| | - Andrey S Glotov
- Dpt. of Genomic Medicine, D.O. Ott Research Institute of Obstetrics, Gynaecology, and Reproductology, Mendeleevskaya line 3, 199034, St. Petersburg, Russia
| | - Alexander V Predeus
- Bioinformatics Institute, Kentemirovskaya st. 2A, 197342, St. Petersburg, Russia
| |
Collapse
|
9
|
Jia P, Dong L, Yang X, Wang B, Bush SJ, Wang T, Lin J, Wang S, Zhao X, Xu T, Che Y, Dang N, Ren L, Zhang Y, Wang X, Liang F, Wang Y, Ruan J, Xia H, Zheng Y, Shi L, Lv Y, Wang J, Ye K. Haplotype-resolved assemblies and variant benchmark of a Chinese Quartet. Genome Biol 2023; 24:277. [PMID: 38049885 PMCID: PMC10694985 DOI: 10.1186/s13059-023-03116-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2023] [Accepted: 11/21/2023] [Indexed: 12/06/2023] Open
Abstract
BACKGROUND Recent state-of-the-art sequencing technologies enable the investigation of challenging regions in the human genome and expand the scope of variant benchmarking datasets. Herein, we sequence a Chinese Quartet, comprising two monozygotic twin daughters and their biological parents, using four short and long sequencing platforms (Illumina, BGI, PacBio, and Oxford Nanopore Technology). RESULTS The long reads from the monozygotic twin daughters are phased into paternal and maternal haplotypes using the parent-child genetic map and for each haplotype. We also use long reads to generate haplotype-resolved whole-genome assemblies with completeness and continuity exceeding that of GRCh38. Using this Quartet, we comprehensively catalogue the human variant landscape, generating a dataset of 3,962,453 SNVs, 886,648 indels (< 50 bp), 9726 large deletions (≥ 50 bp), 15,600 large insertions (≥ 50 bp), 40 inversions, 31 complex structural variants, and 68 de novo mutations which are shared between the monozygotic twin daughters. Variants underrepresented in previous benchmarks owing to their complexity-including those located at long repeat regions, complex structural variants, and de novo mutations-are systematically examined in this study. CONCLUSIONS In summary, this study provides high-quality haplotype-resolved assemblies and a comprehensive set of benchmarking resources for two Chinese monozygotic twin samples which, relative to existing benchmarks, offers expanded genomic coverage and insight into complex variant categories.
Collapse
Affiliation(s)
- Peng Jia
- National Local Joint Engineering Research Center for Precision Surgery & Regenerative Medicine, Center for Mathematical Medical, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an, 710061, China
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
| | - Lianhua Dong
- National Institute of Metrology, Beijing, 100029, China
| | - Xiaofei Yang
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
- School of Computer Science and Technology, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
- Genome Institute, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an, 710061, China
| | - Bo Wang
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
| | - Stephen J Bush
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
| | - Tingjie Wang
- National Local Joint Engineering Research Center for Precision Surgery & Regenerative Medicine, Center for Mathematical Medical, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an, 710061, China
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
| | - Jiadong Lin
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
| | - Songbo Wang
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
| | - Xixi Zhao
- National Local Joint Engineering Research Center for Precision Surgery & Regenerative Medicine, Center for Mathematical Medical, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an, 710061, China
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
- School of Computer Science and Technology, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
| | - Tun Xu
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
| | - Yizhuo Che
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
| | - Ningxin Dang
- Genome Institute, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an, 710061, China
| | - Luyao Ren
- State Key Laboratory of Genetic Engineering, Human Phenome Institute, School of Life Sciences and Shanghai Cancer Center, Fudan University, Shanghai, 200438, China
| | - Yujing Zhang
- National Institute of Metrology, Beijing, 100029, China
| | - Xia Wang
- National Institute of Metrology, Beijing, 100029, China
| | - Fan Liang
- GrandOmics Biosciences, Beijing, 100089, China
| | - Yang Wang
- GrandOmics Biosciences, Beijing, 100089, China
| | - Jue Ruan
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, 518120, China
| | - Han Xia
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
| | - Yuanting Zheng
- State Key Laboratory of Genetic Engineering, Human Phenome Institute, School of Life Sciences and Shanghai Cancer Center, Fudan University, Shanghai, 200438, China
| | - Leming Shi
- State Key Laboratory of Genetic Engineering, Human Phenome Institute, School of Life Sciences and Shanghai Cancer Center, Fudan University, Shanghai, 200438, China
| | - Yi Lv
- National Local Joint Engineering Research Center for Precision Surgery & Regenerative Medicine, Center for Mathematical Medical, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an, 710061, China.
| | - Jing Wang
- National Institute of Metrology, Beijing, 100029, China.
| | - Kai Ye
- National Local Joint Engineering Research Center for Precision Surgery & Regenerative Medicine, Center for Mathematical Medical, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an, 710061, China.
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China.
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China.
- Genome Institute, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an, 710061, China.
- School of Life Science and Technology, Xi'an Jiaotong University, Xi'an 710049, China.
- Faculty of Science, Leiden University, Leiden, 2311EZ, The Netherlands.
| |
Collapse
|
10
|
Ren L, Duan X, Dong L, Zhang R, Yang J, Gao Y, Peng R, Hou W, Liu Y, Li J, Yu Y, Zhang N, Shang J, Liang F, Wang D, Chen H, Sun L, Hao L, Scherer A, Nordlund J, Xiao W, Xu J, Tong W, Hu X, Jia P, Ye K, Li J, Jin L, Hong H, Wang J, Fan S, Fang X, Zheng Y, Shi L. Quartet DNA reference materials and datasets for comprehensively evaluating germline variant calling performance. Genome Biol 2023; 24:270. [PMID: 38012772 PMCID: PMC10680274 DOI: 10.1186/s13059-023-03109-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Accepted: 11/13/2023] [Indexed: 11/29/2023] Open
Abstract
BACKGROUND Genomic DNA reference materials are widely recognized as essential for ensuring data quality in omics research. However, relying solely on reference datasets to evaluate the accuracy of variant calling results is incomplete, as they are limited to benchmark regions. Therefore, it is important to develop DNA reference materials that enable the assessment of variant detection performance across the entire genome. RESULTS We established a DNA reference material suite from four immortalized cell lines derived from a family of parents and monozygotic twins. Comprehensive reference datasets of 4.2 million small variants and 15,000 structural variants were integrated and certified for evaluating the reliability of germline variant calls inside the benchmark regions. Importantly, the genetic built-in-truth of the Quartet family design enables estimation of the precision of variant calls outside the benchmark regions. Using the Quartet reference materials along with study samples, batch effects are objectively monitored and alleviated by training a machine learning model with the Quartet reference datasets to remove potential artifact calls. Moreover, the matched RNA and protein reference materials and datasets from the Quartet project enables cross-omics validation of variant calls from multiomics data. CONCLUSIONS The Quartet DNA reference materials and reference datasets provide a unique resource for objectively assessing the quality of germline variant calls throughout the whole-genome regions and improving the reliability of large-scale genomic profiling.
Collapse
Affiliation(s)
- Luyao Ren
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Fudan University, Shanghai, China
| | - Xiaoke Duan
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Fudan University, Shanghai, China
| | | | - Rui Zhang
- National Center for Clinical Laboratories, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing Hospital, Beijing, China
| | - Jingcheng Yang
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Fudan University, Shanghai, China
- Greater Bay Area Institute of Precision Medicine, Guangzhou, Guangdong, China
| | - Yuechen Gao
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Fudan University, Shanghai, China
| | - Rongxue Peng
- National Center for Clinical Laboratories, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing Hospital, Beijing, China
| | - Wanwan Hou
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Fudan University, Shanghai, China
| | - Yaqing Liu
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Fudan University, Shanghai, China
| | - Jingjing Li
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Fudan University, Shanghai, China
- Nextomics Biosciences Institute, Wuhan, Hubei, China
| | - Ying Yu
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Fudan University, Shanghai, China
| | - Naixin Zhang
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Fudan University, Shanghai, China
| | - Jun Shang
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Fudan University, Shanghai, China
| | - Fan Liang
- Nextomics Biosciences Institute, Wuhan, Hubei, China
| | - Depeng Wang
- Nextomics Biosciences Institute, Wuhan, Hubei, China
| | - Hui Chen
- OrigiMed Co., Ltd, Shanghai, China
| | - Lele Sun
- Sequanta Technologies Co., Ltd, Shanghai, China
| | | | - Andreas Scherer
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland
- EATRIS ERIC-European Infrastructure for Translational Medicine, Amsterdam, the Netherlands
| | - Jessica Nordlund
- EATRIS ERIC-European Infrastructure for Translational Medicine, Amsterdam, the Netherlands
- Department of Medical Sciences, Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| | - Wenming Xiao
- Office of Oncologic Diseases, Office of New Drugs, Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Spring, MD, USA
| | - Joshua Xu
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, USA
| | - Weida Tong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, USA
| | - Xin Hu
- Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Peng Jia
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China
| | - Kai Ye
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China
| | - Jinming Li
- National Center for Clinical Laboratories, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing Hospital, Beijing, China
| | - Li Jin
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Fudan University, Shanghai, China
| | - Huixiao Hong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, USA
| | - Jing Wang
- National Institute of Metrology, Beijing, China.
| | - Shaohua Fan
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Fudan University, Shanghai, China.
| | - Xiang Fang
- National Institute of Metrology, Beijing, China.
| | - Yuanting Zheng
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Fudan University, Shanghai, China.
| | - Leming Shi
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Fudan University, Shanghai, China
- Shanghai Cancer Center, Fudan University, Shanghai, China
- International Human Phenome Institutes, Shanghai, China
| |
Collapse
|
11
|
English A, Dolzhenko E, Jam HZ, Mckenzie S, Olson ND, De Coster W, Park J, Gu B, Wagner J, Eberle MA, Gymrek M, Chaisson MJP, Zook JM, Sedlazeck FJ. Benchmarking of small and large variants across tandem repeats. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.10.29.564632. [PMID: 37961319 PMCID: PMC10634962 DOI: 10.1101/2023.10.29.564632] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
Tandem repeats (TRs) are highly polymorphic in the human genome, have thousands of associated molecular traits, and are linked to over 60 disease phenotypes. However, their complexity often excludes them from at-scale studies due to challenges with variant calling, representation, and lack of a genome-wide standard. To promote TR methods development, we create a comprehensive catalog of TR regions and explore its properties across 86 samples. We then curate variants from the GIAB HG002 individual to create a tandem repeat benchmark. We also present a variant comparison method that handles small and large alleles and varying allelic representation. The 8.1% of the genome covered by the TR catalog holds ∼24.9% of variants per individual, including 124,728 small and 17,988 large variants for the GIAB HG002 TR benchmark. We work with the GIAB community to demonstrate the utility of this benchmark across short and long read technologies.
Collapse
|
12
|
Majidian S, Agustinho DP, Chin CS, Sedlazeck FJ, Mahmoud M. Genomic variant benchmark: if you cannot measure it, you cannot improve it. Genome Biol 2023; 24:221. [PMID: 37798733 PMCID: PMC10552390 DOI: 10.1186/s13059-023-03061-1] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2022] [Accepted: 09/18/2023] [Indexed: 10/07/2023] Open
Abstract
Genomic benchmark datasets are essential to driving the field of genomics and bioinformatics. They provide a snapshot of the performances of sequencing technologies and analytical methods and highlight future challenges. However, they depend on sequencing technology, reference genome, and available benchmarking methods. Thus, creating a genomic benchmark dataset is laborious and highly challenging, often involving multiple sequencing technologies, different variant calling tools, and laborious manual curation. In this review, we discuss the available benchmark datasets and their utility. Additionally, we focus on the most recent benchmark of genes with medical relevance and challenging genomic complexity.
Collapse
Affiliation(s)
- Sina Majidian
- Department of Computational Biology, University of Lausanne, 1015, Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
| | | | | | - Fritz J Sedlazeck
- Baylor College of Medicine, Human Genome Sequencing Center, Houston, TX, 77030, USA.
- Department of Computer Science, Rice University, 6100 Main Street, Houston, TX, 77005, USA.
| | - Medhat Mahmoud
- Baylor College of Medicine, Human Genome Sequencing Center, Houston, TX, 77030, USA.
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA.
| |
Collapse
|
13
|
Wang S, Wang M, Chen L, Pan G, Wang Y, Li SC. SpecHLA enables full-resolution HLA typing from sequencing data. CELL REPORTS METHODS 2023; 3:100589. [PMID: 37714157 PMCID: PMC10545945 DOI: 10.1016/j.crmeth.2023.100589] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/11/2023] [Revised: 06/20/2023] [Accepted: 08/21/2023] [Indexed: 09/17/2023]
Abstract
Reconstructing diploid sequences of human leukocyte antigen (HLA) genes, i.e., full-resolution HLA typing, from sequencing data is challenging. The high homogeneity across HLA genes and the high heterogeneity within HLA alleles complicate the identification of genomic source loci for sequencing reads. Here, we present SpecHLA, which utilizes fine-tuned reads binning and local assembly to achieve accurate full-resolution HLA typing. SpecHLA accepts sequencing data from paired-end, 10×-linked-reads, high-throughput chromosome conformation capture (Hi-C), Pacific Biosciences (PacBio), and Oxford Nanopore Technology (ONT). It can also incorporate pedigree data and genotype frequency to refine typing. In 32 Human Genome Structural Variation Consortium, Phase 2 (HGSVC2) samples, SpecHLA achieved 98.6% accuracy for G-group-resolution HLA typing, inferring entire HLA alleles with an average of three mismatches fewer, ten gaps fewer, and 590 bp less edit distance than HISAT-genotype per allele. Additionally, SpecHLA exhibited a 2-field typing accuracy of 98.6% in 875 real samples. Finally, SpecHLA detected HLA loss of heterozygosity with 99.7% specificity and 96.8% sensitivity in simulated samples of cancer cell lines.
Collapse
Affiliation(s)
- Shuai Wang
- City University of Hong Kong, Department of Computer Science, Kowloon, Hong Kong
| | - Mengyao Wang
- City University of Hong Kong, Department of Computer Science, Kowloon, Hong Kong
| | - Lingxi Chen
- City University of Hong Kong, Department of Computer Science, Kowloon, Hong Kong
| | - Guangze Pan
- City University of Hong Kong, Department of Computer Science, Kowloon, Hong Kong
| | - Yanfei Wang
- City University of Hong Kong, Department of Computer Science, Kowloon, Hong Kong
| | - Shuai Cheng Li
- City University of Hong Kong, Department of Computer Science, Kowloon, Hong Kong.
| |
Collapse
|
14
|
Chin CS, Behera S, Khalak A, Sedlazeck FJ, Sudmant PH, Wagner J, Zook JM. Multiscale analysis of pangenomes enables improved representation of genomic diversity for repetitive and clinically relevant genes. Nat Methods 2023; 20:1213-1221. [PMID: 37365340 PMCID: PMC10406601 DOI: 10.1038/s41592-023-01914-y] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2022] [Accepted: 05/17/2023] [Indexed: 06/28/2023]
Abstract
Advancements in sequencing technologies and assembly methods enable the regular production of high-quality genome assemblies characterizing complex regions. However, challenges remain in efficiently interpreting variation at various scales, from smaller tandem repeats to megabase rearrangements, across many human genomes. We present a PanGenome Research Tool Kit (PGR-TK) enabling analyses of complex pangenome structural and haplotype variation at multiple scales. We apply the graph decomposition methods in PGR-TK to the class II major histocompatibility complex demonstrating the importance of the human pangenome for analyzing complicated regions. Moreover, we investigate the Y-chromosome genes, DAZ1/DAZ2/DAZ3/DAZ4, of which structural variants have been linked to male infertility, and X-chromosome genes OPN1LW and OPN1MW linked to eye disorders. We further showcase PGR-TK across 395 complex repetitive medically important genes. This highlights the power of PGR-TK to resolve complex variation in regions of the genome that were previously too complex to analyze.
Collapse
Affiliation(s)
- Chen-Shan Chin
- GeneDX, Stamford, CT, USA.
- Foundation of Biological Data Science, Belmont, CA, USA.
| | - Sairam Behera
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Asif Khalak
- Foundation of Biological Data Science, Belmont, CA, USA
| | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Peter H Sudmant
- Department of Integrative Biology, University of California Berkeley, Berkeley, CA, USA
| | - Justin Wagner
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Justin M Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| |
Collapse
|
15
|
Houwaart T, Scholz S, Pollock NR, Palmer WH, Kichula KM, Strelow D, Le DB, Belick D, Hülse L, Lautwein T, Wachtmeister T, Wollenweber TE, Henrich B, Köhrer K, Parham P, Guethlein LA, Norman PJ, Dilthey AT. Complete sequences of six major histocompatibility complex haplotypes, including all the major MHC class II structures. HLA 2023; 102:28-43. [PMID: 36932816 PMCID: PMC10986641 DOI: 10.1111/tan.15020] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2022] [Revised: 02/10/2023] [Accepted: 02/24/2023] [Indexed: 03/19/2023]
Abstract
Accurate and comprehensive immunogenetic reference panels are key to the successful implementation of population-scale immunogenomics. The 5Mbp Major Histocompatibility Complex (MHC) is the most polymorphic region of the human genome and associated with multiple immune-mediated diseases, transplant matching and therapy responses. Analysis of MHC genetic variation is severely complicated by complex patterns of sequence variation, linkage disequilibrium and a lack of fully resolved MHC reference haplotypes, increasing the risk of spurious findings on analyzing this medically important region. Integrating Illumina, ultra-long Nanopore, and PacBio HiFi sequencing as well as bespoke bioinformatics, we completed five of the alternative MHC reference haplotypes of the current (GRCh38/hg38) build of the human reference genome and added one other. The six assembled MHC haplotypes encompass the DR1 and DR4 haplotype structures in addition to the previously completed DR2 and DR3, as well as six distinct classes of the structurally variable C4 region. Analysis of the assembled haplotypes showed that MHC class II sequence structures, including repeat element positions, are generally conserved within the DR haplotype supergroups, and that sequence diversity peaks in three regions around HLA-A, HLA-B+C, and the HLA class II genes. Demonstrating the potential for improved short-read analysis, the number of proper read pairs recruited to the MHC was found to be increased by 0.06%-0.49% in a 1000 Genomes Project read remapping experiment with seven diverse samples. Furthermore, the assembled haplotypes can serve as references for the community and provide the basis of a structurally accurate genotyping graph of the complete MHC region.
Collapse
Affiliation(s)
- Torsten Houwaart
- Institute of Medical Microbiology and Hospital HygieneHeinrich Heine University DüsseldorfDüsseldorfGermany
| | - Stephan Scholz
- Institute of Medical Microbiology and Hospital HygieneHeinrich Heine University DüsseldorfDüsseldorfGermany
| | - Nicholas R. Pollock
- Department of Biomedical InformaticsAnschutz Medical Campus, University of ColoradoAuroraColoradoUSA
- Department of Immunology and MicrobiologyAnschutz Medical Campus, University of ColoradoAuroraColoradoUSA
| | - William H. Palmer
- Department of Biomedical InformaticsAnschutz Medical Campus, University of ColoradoAuroraColoradoUSA
- Department of Immunology and MicrobiologyAnschutz Medical Campus, University of ColoradoAuroraColoradoUSA
| | - Katherine M. Kichula
- Department of Biomedical InformaticsAnschutz Medical Campus, University of ColoradoAuroraColoradoUSA
- Department of Immunology and MicrobiologyAnschutz Medical Campus, University of ColoradoAuroraColoradoUSA
| | - Daniel Strelow
- Institute of Medical Microbiology and Hospital HygieneHeinrich Heine University DüsseldorfDüsseldorfGermany
| | - Duyen B. Le
- Institute of Medical Microbiology and Hospital HygieneHeinrich Heine University DüsseldorfDüsseldorfGermany
| | - Dana Belick
- Institute of Medical Microbiology and Hospital HygieneHeinrich Heine University DüsseldorfDüsseldorfGermany
| | - Lisanna Hülse
- Institute of Medical Microbiology and Hospital HygieneHeinrich Heine University DüsseldorfDüsseldorfGermany
| | - Tobias Lautwein
- Biologisch‐Medizinisches‐Forschungszentrum (BMFZ)Genomics & Transcriptomics Laboratory, Heinrich Heine University DüsseldorfDüsseldorfGermany
| | - Thorsten Wachtmeister
- Biologisch‐Medizinisches‐Forschungszentrum (BMFZ)Genomics & Transcriptomics Laboratory, Heinrich Heine University DüsseldorfDüsseldorfGermany
| | - Tassilo E. Wollenweber
- Biologisch‐Medizinisches‐Forschungszentrum (BMFZ)Genomics & Transcriptomics Laboratory, Heinrich Heine University DüsseldorfDüsseldorfGermany
| | - Birgit Henrich
- Institute of Medical Microbiology and Hospital HygieneHeinrich Heine University DüsseldorfDüsseldorfGermany
| | - Karl Köhrer
- Biologisch‐Medizinisches‐Forschungszentrum (BMFZ)Genomics & Transcriptomics Laboratory, Heinrich Heine University DüsseldorfDüsseldorfGermany
| | - Peter Parham
- Department of Structural Biology, and Department of Microbiology and ImmunologyStanford UniversityStanfordCaliforniaUSA
| | - Lisbeth A. Guethlein
- Department of Structural Biology, and Department of Microbiology and ImmunologyStanford UniversityStanfordCaliforniaUSA
| | - Paul J. Norman
- Department of Biomedical InformaticsAnschutz Medical Campus, University of ColoradoAuroraColoradoUSA
- Department of Immunology and MicrobiologyAnschutz Medical Campus, University of ColoradoAuroraColoradoUSA
| | - Alexander T. Dilthey
- Institute of Medical Microbiology and Hospital HygieneHeinrich Heine University DüsseldorfDüsseldorfGermany
| |
Collapse
|
16
|
Liao WW, Asri M, Ebler J, Doerr D, Haukness M, Hickey G, Lu S, Lucas JK, Monlong J, Abel HJ, Buonaiuto S, Chang XH, Cheng H, Chu J, Colonna V, Eizenga JM, Feng X, Fischer C, Fulton RS, Garg S, Groza C, Guarracino A, Harvey WT, Heumos S, Howe K, Jain M, Lu TY, Markello C, Martin FJ, Mitchell MW, Munson KM, Mwaniki MN, Novak AM, Olsen HE, Pesout T, Porubsky D, Prins P, Sibbesen JA, Sirén J, Tomlinson C, Villani F, Vollger MR, Antonacci-Fulton LL, Baid G, Baker CA, Belyaeva A, Billis K, Carroll A, Chang PC, Cody S, Cook DE, Cook-Deegan RM, Cornejo OE, Diekhans M, Ebert P, Fairley S, Fedrigo O, Felsenfeld AL, Formenti G, Frankish A, Gao Y, Garrison NA, Giron CG, Green RE, Haggerty L, Hoekzema K, Hourlier T, Ji HP, Kenny EE, Koenig BA, Kolesnikov A, Korbel JO, Kordosky J, Koren S, Lee H, Lewis AP, Magalhães H, Marco-Sola S, Marijon P, McCartney A, McDaniel J, Mountcastle J, Nattestad M, Nurk S, Olson ND, Popejoy AB, Puiu D, Rautiainen M, Regier AA, Rhie A, Sacco S, Sanders AD, Schneider VA, Schultz BI, Shafin K, Smith MW, Sofia HJ, Abou Tayoun AN, Thibaud-Nissen F, Tricomi FF, Wagner J, Walenz B, Wood JMD, Zimin AV, Bourque G, Chaisson MJP, Flicek P, Phillippy AM, Zook JM, Eichler EE, Haussler D, Wang T, Jarvis ED, Miga KH, Garrison E, Marschall T, Hall IM, Li H, Paten B. A draft human pangenome reference. Nature 2023; 617:312-324. [PMID: 37165242 PMCID: PMC10172123 DOI: 10.1038/s41586-023-05896-x] [Citation(s) in RCA: 281] [Impact Index Per Article: 281.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2022] [Accepted: 02/28/2023] [Indexed: 05/12/2023]
Abstract
Here the Human Pangenome Reference Consortium presents a first draft of the human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals1. These assemblies cover more than 99% of the expected sequence in each genome and are more than 99% accurate at the structural and base pair levels. Based on alignments of the assemblies, we generate a draft pangenome that captures known variants and haplotypes and reveals new alleles at structurally complex loci. We also add 119 million base pairs of euchromatic polymorphic sequences and 1,115 gene duplications relative to the existing reference GRCh38. Roughly 90 million of the additional base pairs are derived from structural variation. Using our draft pangenome to analyse short-read data reduced small variant discovery errors by 34% and increased the number of structural variants detected per haplotype by 104% compared with GRCh38-based workflows, which enabled the typing of the vast majority of structural variant alleles per sample.
Collapse
Affiliation(s)
- Wen-Wei Liao
- Department of Genetics, Yale University School of Medicine, New Haven, CT, USA
- Center for Genomic Health, Yale University School of Medicine, New Haven, CT, USA
- Division of Biology and Biomedical Sciences, Washington University School of Medicine, St. Louis, MO, USA
| | - Mobin Asri
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Jana Ebler
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
| | - Daniel Doerr
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
| | - Marina Haukness
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Glenn Hickey
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Shuangjia Lu
- Department of Genetics, Yale University School of Medicine, New Haven, CT, USA
- Center for Genomic Health, Yale University School of Medicine, New Haven, CT, USA
| | - Julian K Lucas
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Jean Monlong
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Haley J Abel
- Division of Oncology, Department of Internal Medicine, Washington University School of Medicine, St. Louis, MO, USA
| | - Silvia Buonaiuto
- Institute of Genetics and Biophysics, National Research Council, Naples, Italy
| | - Xian H Chang
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Haoyu Cheng
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Justin Chu
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Vincenza Colonna
- Institute of Genetics and Biophysics, National Research Council, Naples, Italy
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Jordan M Eizenga
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Xiaowen Feng
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Christian Fischer
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Robert S Fulton
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
- Department of Genetics, Washington University School of Medicine, St. Louis, MO, USA
| | - Shilpa Garg
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Copenhagen, Denmark
| | - Cristian Groza
- Quantitative Life Sciences, McGill University, Montréal, Québec, Canada
| | - Andrea Guarracino
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
- Genomics Research Centre, Human Technopole, Milan, Italy
| | - William T Harvey
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Simon Heumos
- Quantitative Biology Center (QBiC), University of Tübingen, Tübingen, Germany
- Biomedical Data Science, Department of Computer Science, University of Tübingen, Tübingen, Germany
| | - Kerstin Howe
- Tree of Life, Wellcome Sanger Institute, Hinxton, Cambridge, UK
| | - Miten Jain
- Northeastern University, Boston, MA, USA
| | - Tsung-Yu Lu
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Charles Markello
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Fergal J Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | | | - Katherine M Munson
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | | | - Adam M Novak
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Hugh E Olsen
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Trevor Pesout
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - David Porubsky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Pjotr Prins
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Jonas A Sibbesen
- Center for Health Data Science, University of Copenhagen, Copenhagen, Denmark
| | - Jouni Sirén
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Chad Tomlinson
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
| | - Flavia Villani
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Mitchell R Vollger
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Division of Medical Genetics, University of Washington School of Medicine, Seattle, WA, USA
| | | | | | - Carl A Baker
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | | | - Konstantinos Billis
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | | | | | - Sarah Cody
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
| | | | - Robert M Cook-Deegan
- Barrett and O'Connor Washington Center, Arizona State University, Washington, DC, USA
| | - Omar E Cornejo
- Department of Ecology and Evolutionary Biology, University of California, Santa Cruz, CA, USA
| | - Mark Diekhans
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Peter Ebert
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
- Core Unit Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
| | - Susan Fairley
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Olivier Fedrigo
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
| | - Adam L Felsenfeld
- National Institutes of Health (NIH)-National Human Genome Research Institute, Bethesda, MD, USA
| | - Giulio Formenti
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
| | - Adam Frankish
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Yan Gao
- Center for Computational and Genomic Medicine, The Children's Hospital of Philadelphia, Philadelphia, PA, USA
| | - Nanibaa' A Garrison
- Institute for Society and Genetics, College of Letters and Science, University of California, Los Angeles, CA, USA
- Institute for Precision Health, David Geffen School of Medicine, University of California, Los Angeles, CA, USA
- Division of General Internal Medicine and Health Services Research, David Geffen School of Medicine, University of California, Los Angeles, CA, USA
| | - Carlos Garcia Giron
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Richard E Green
- Department of Biomolecular Engineering, University of California, Santa Cruz, CA, USA
- Dovetail Genomics, Scotts Valley, CA, USA
| | - Leanne Haggerty
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Kendra Hoekzema
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Thibaut Hourlier
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Hanlee P Ji
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Eimear E Kenny
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Barbara A Koenig
- Program in Bioethics and Institute for Human Genetics, University of California, San Francisco, CA, USA
| | | | - Jan O Korbel
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
- Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
| | - Jennifer Kordosky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - HoJoon Lee
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Alexandra P Lewis
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Hugo Magalhães
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
| | - Santiago Marco-Sola
- Computer Sciences Department, Barcelona Supercomputing Center, Barcelona, Spain
- Departament d'Arquitectura de Computadors i Sistemes Operatius, Universitat Autònoma de Barcelona, Barcelona, Spain
| | - Pierre Marijon
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
| | - Ann McCartney
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Jennifer McDaniel
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | | | | | - Sergey Nurk
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Nathan D Olson
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Alice B Popejoy
- Department of Public Health Sciences, University of California, Davis, CA, USA
| | - Daniela Puiu
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Mikko Rautiainen
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Allison A Regier
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
| | - Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Samuel Sacco
- Department of Ecology and Evolutionary Biology, University of California, Santa Cruz, CA, USA
| | - Ashley D Sanders
- Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Berlin, Germany
| | - Valerie A Schneider
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Baergen I Schultz
- National Institutes of Health (NIH)-National Human Genome Research Institute, Bethesda, MD, USA
| | | | - Michael W Smith
- National Institutes of Health (NIH)-National Human Genome Research Institute, Bethesda, MD, USA
| | - Heidi J Sofia
- National Institutes of Health (NIH)-National Human Genome Research Institute, Bethesda, MD, USA
| | - Ahmad N Abou Tayoun
- Al Jalila Genomics Center of Excellence, Al Jalila Children's Specialty Hospital, Dubai, UAE
- Center for Genomic Discovery, Mohammed Bin Rashid University of Medicine and Health Sciences, Dubai, UAE
| | - Françoise Thibaud-Nissen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Francesca Floriana Tricomi
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Justin Wagner
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Brian Walenz
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | | | - Aleksey V Zimin
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Guillaume Bourque
- Department of Human Genetics, McGill University, Montréal, Québec, Canada
- Canadian Center for Computational Genomics, McGill University, Montréal, Québec, Canada
- Institute for the Advanced Study of Human Biology (WPI-ASHBi), Kyoto University, Kyoto, Japan
| | - Mark J P Chaisson
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Justin M Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - David Haussler
- Genomics Institute, University of California, Santa Cruz, CA, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - Ting Wang
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
- Department of Genetics, Washington University School of Medicine, St. Louis, MO, USA
| | - Erich D Jarvis
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
- Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, USA
| | - Karen H Miga
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA.
| | - Tobias Marschall
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany.
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany.
| | - Ira M Hall
- Department of Genetics, Yale University School of Medicine, New Haven, CT, USA.
- Center for Genomic Health, Yale University School of Medicine, New Haven, CT, USA.
| | - Heng Li
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA.
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| | - Benedict Paten
- Genomics Institute, University of California, Santa Cruz, CA, USA.
| |
Collapse
|
17
|
Olson ND, Wagner J, Dwarshuis N, Miga KH, Sedlazeck FJ, Salit M, Zook JM. Variant calling and benchmarking in an era of complete human genome sequences. Nat Rev Genet 2023:10.1038/s41576-023-00590-0. [PMID: 37059810 DOI: 10.1038/s41576-023-00590-0] [Citation(s) in RCA: 31] [Impact Index Per Article: 31.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/22/2023] [Indexed: 04/16/2023]
Abstract
Genetic variant calling from DNA sequencing has enabled understanding of germline variation in hundreds of thousands of humans. Sequencing technologies and variant-calling methods have advanced rapidly, routinely providing reliable variant calls in most of the human genome. We describe how advances in long reads, deep learning, de novo assembly and pangenomes have expanded access to variant calls in increasingly challenging, repetitive genomic regions, including medically relevant regions, and how new benchmark sets and benchmarking methods illuminate their strengths and limitations. Finally, we explore the possible future of more complete characterization of human genome variation in light of the recent completion of a telomere-to-telomere human genome reference assembly and human pangenomes, and we consider the innovations needed to benchmark their newly accessible repetitive regions and complex variants.
Collapse
Affiliation(s)
- Nathan D Olson
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Justin Wagner
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Nathan Dwarshuis
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Karen H Miga
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Fritz J Sedlazeck
- Baylor College of Medicine, Human Genome Sequencing Center, Houston, TX, USA
| | | | - Justin M Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA.
| |
Collapse
|
18
|
Mikhaylova V, Rzepka M, Kawamura T, Xia Y, Chang PL, Zhou S, Pham L, Modi N, Yao L, Perez-Agustin A, Pagans S, Boles TC, Lei M, Wang Y, Garcia-Bassets I, Chen Z. Targeted Phasing of 2-200 Kilobase DNA Fragments with a Short-Read Sequencer and a Single-Tube Linked-Read Library Method. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.03.05.531179. [PMID: 36945366 PMCID: PMC10028795 DOI: 10.1101/2023.03.05.531179] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/11/2023]
Abstract
In the human genome, heterozygous sites are genomic positions with different alleles inherited from each parent. On average, there is a heterozygous site every 1-2 kilobases (kb). Resolving whether two alleles in neighboring heterozygous positions are physically linked-that is, phased-is possible with a short-read sequencer if the sequencing library captures long-range information. TELL-Seq is a library preparation method based on millions of barcoded micro-sized beads that enables instrument-free phasing of a whole human genome in a single PCR tube. TELL-Seq incorporates a unique molecular identifier (barcode) to the short reads generated from the same high-molecular-weight (HMW) DNA fragment (known as 'linked-reads'). However, genome-scale TELL-Seq is not cost-effective for applications focusing on a single locus or a few loci. Here, we present an optimized TELL-Seq protocol that enables the cost-effective phasing of enriched loci (targets) of varying sizes, purity levels, and heterozygosity. Targeted TELL-Seq maximizes linked-read efficiency and library yield while minimizing input requirements, fragment collisions on microbeads, and sequencing burden. To validate the targeted protocol, we phased seven 180-200 kb loci enriched by CRISPR/Cas9-mediated excision coupled with pulse-field electrophoresis, four 20 kb loci enriched by CRISPR/Cas9-mediated protection from exonuclease digestion, and six 2-13 kb loci amplified by PCR. The selected targets have clinical and research relevance (BRCA1, BRCA2, MLH1, MSH2, MSH6, APC, PMS2, SCN5A-SCN10A, and PKI3CA). These analyses reveal that targeted TELL-Seq provides a reliable way of phasing allelic variants within targets (2-200 kb in length) with the low cost and high accuracy of short-read sequencing.
Collapse
Affiliation(s)
| | - Madison Rzepka
- Universal Sequencing Technology Corp., Carlsbad, CA 92011, USA
| | | | - Yu Xia
- Universal Sequencing Technology Corp., Carlsbad, CA 92011, USA
| | - Peter L. Chang
- Universal Sequencing Technology Corp., Carlsbad, CA 92011, USA
| | | | - Long Pham
- Universal Sequencing Technology Corp., Carlsbad, CA 92011, USA
| | - Naisarg Modi
- Universal Sequencing Technology Corp., Carlsbad, CA 92011, USA
| | - Likun Yao
- Department of Medicine, University of California, San Diego, La Jolla, CA 92093 USA
| | - Adrian Perez-Agustin
- Department of Medical Sciences, School of Medicine, University of Girona, Girona, Spain
| | - Sara Pagans
- Department of Medical Sciences, School of Medicine, University of Girona, Girona, Spain
| | | | - Ming Lei
- Universal Sequencing Technology Corp., Canton, MA 02021, USA
| | - Yong Wang
- Universal Sequencing Technology Corp., Canton, MA 02021, USA
| | | | - Zhoutao Chen
- Universal Sequencing Technology Corp., Carlsbad, CA 92011, USA
| |
Collapse
|
19
|
Lorig-Roach R, Meredith M, Monlong J, Jain M, Olsen H, McNulty B, Porubsky D, Montague T, Lucas J, Condon C, Eizenga J, Juul S, McKenzie S, Simmonds SE, Park J, Asri M, Koren S, Eichler E, Axel R, Martin B, Carnevali P, Miga K, Paten B. Phased nanopore assembly with Shasta and modular graph phasing with GFAse. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.02.21.529152. [PMID: 36865218 PMCID: PMC9980101 DOI: 10.1101/2023.02.21.529152] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 04/30/2023]
Abstract
As a step towards simplifying and reducing the cost of haplotype resolved de novo assembly, we describe new methods for accurately phasing nanopore data with the Shasta genome assembler and a modular tool for extending phasing to the chromosome scale called GFAse. We test using new variants of Oxford Nanopore Technologies' (ONT) PromethION sequencing, including those using proximity ligation and show that newer, higher accuracy ONT reads substantially improve assembly quality.
Collapse
Affiliation(s)
- Ryan Lorig-Roach
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Melissa Meredith
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Jean Monlong
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Miten Jain
- Department of Bioengineering, Department of Physics, Northeastern University, Boston, MA, USA
| | - Hugh Olsen
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Brandy McNulty
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - David Porubsky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Tessa Montague
- The Mortimer B. Zuckerman Mind Brain Behavior Institute, Department of Neuroscience, Columbia University, New York, NY, USA & Howard Hughes Medical Institute, Columbia University, New York, NY, USA
| | - Julian Lucas
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Chris Condon
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Jordan Eizenga
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | | | | | | | - Jimin Park
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Mobin Asri
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome & Research Institute, National Institutes of Health, Bethesda, MD USA
| | - Evan Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA & Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| | - Richard Axel
- The Mortimer B. Zuckerman Mind Brain Behavior Institute, Department of Neuroscience, Columbia University, New York, NY, USA & Howard Hughes Medical Institute, Columbia University, New York, NY, USA
| | - Bruce Martin
- Chan Zuckerberg Initiative Foundation, Redwood City, CA, USA
| | - Paolo Carnevali
- Chan Zuckerberg Initiative Foundation, Redwood City, CA, USA
| | - Karen Miga
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| |
Collapse
|
20
|
Alper CA, Dawkins RL, Kulski JK, Larsen CE, Lloyd SS. Editorial: Population genomic architecture: Conserved polymorphic sequences (CPSs), not linkage disequilibrium. Front Genet 2023; 14:1140350. [PMID: 36777737 PMCID: PMC9911302 DOI: 10.3389/fgene.2023.1140350] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2023] [Accepted: 01/17/2023] [Indexed: 01/28/2023] Open
Affiliation(s)
- Chester A. Alper
- Program in Cellular and Molecular Medicine, Boston Children’s Hospital, Boston, MA, United States,Department of Pediatrics, Harvard Medical School, Boston, MA, United States,*Correspondence: Chester A. Alper, ; Roger L. Dawkins, ; Jerzy K. Kulski, ; Charles E. Larsen, ; Sally S. Lloyd,
| | - Roger L. Dawkins
- CY O’Connor ERADE Village Foundation, North Dandalup, WA, Australia,*Correspondence: Chester A. Alper, ; Roger L. Dawkins, ; Jerzy K. Kulski, ; Charles E. Larsen, ; Sally S. Lloyd,
| | - Jerzy K. Kulski
- Department of Molecular Life Sciences, Division of Basic Medical Science and Molecular Medicine, Tokai University School of Medicine, Isehara, Japan,*Correspondence: Chester A. Alper, ; Roger L. Dawkins, ; Jerzy K. Kulski, ; Charles E. Larsen, ; Sally S. Lloyd,
| | - Charles E. Larsen
- Program in Cellular and Molecular Medicine, Boston Children’s Hospital, Boston, MA, United States,Department of Pediatrics, Harvard Medical School, Boston, MA, United States,*Correspondence: Chester A. Alper, ; Roger L. Dawkins, ; Jerzy K. Kulski, ; Charles E. Larsen, ; Sally S. Lloyd,
| | - Sally S. Lloyd
- CY O’Connor ERADE Village Foundation, North Dandalup, WA, Australia,*Correspondence: Chester A. Alper, ; Roger L. Dawkins, ; Jerzy K. Kulski, ; Charles E. Larsen, ; Sally S. Lloyd,
| |
Collapse
|
21
|
Kulski JK, Suzuki S, Shiina T. Human leukocyte antigen super-locus: nexus of genomic supergenes, SNPs, indels, transcripts, and haplotypes. Hum Genome Var 2022; 9:49. [PMID: 36543786 PMCID: PMC9772353 DOI: 10.1038/s41439-022-00226-5] [Citation(s) in RCA: 20] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Revised: 11/08/2022] [Accepted: 11/15/2022] [Indexed: 12/24/2022] Open
Abstract
The human Major Histocompatibility Complex (MHC) or Human Leukocyte Antigen (HLA) super-locus is a highly polymorphic genomic region that encodes more than 140 coding genes including the transplantation and immune regulatory molecules. It receives special attention for genetic investigation because of its important role in the regulation of innate and adaptive immune responses and its strong association with numerous infectious and/or autoimmune diseases. In recent years, MHC genotyping and haplotyping using Sanger sequencing and next-generation sequencing (NGS) methods have produced many hundreds of genomic sequences of the HLA super-locus for comparative studies of the genetic architecture and diversity between the same and different haplotypes. In this special issue on 'The Current Landscape of HLA Genomics and Genetics', we provide a short review of some of the recent analytical developments used to investigate the SNP polymorphisms, structural variants (indels), transcription and haplotypes of the HLA super-locus. This review highlights the importance of using reference cell-lines, population studies, and NGS methods to improve and update our understanding of the mechanisms, architectural structures and combinations of human MHC genomic alleles (SNPs and indels) that better define and characterise haplotypes and their association with various phenotypes and diseases.
Collapse
Affiliation(s)
- Jerzy K Kulski
- Department of Molecular Life Science, Tokai University School of Medicine, Isehara, Kanagawa, Japan.
| | - Shingo Suzuki
- Department of Molecular Life Science, Tokai University School of Medicine, Isehara, Kanagawa, Japan
| | - Takashi Shiina
- Department of Molecular Life Science, Tokai University School of Medicine, Isehara, Kanagawa, Japan
| |
Collapse
|
22
|
Chander V, Mahmoud M, Hu J, Dardas Z, Grochowski CM, Dawood M, Khayat MM, Li H, Li S, Jhangiani S, Korchina V, Shen H, Weissenberger G, Meng Q, Gingras MC, Muzny DM, Doddapaneni H, Posey JE, Lupski JR, Sabo A, Murdock DR, Sedlazeck FJ, Gibbs RA. Long read sequencing and expression studies of AHDC1 deletions in Xia-Gibbs syndrome reveal a novel genetic regulatory mechanism. Hum Mutat 2022; 43:2033-2053. [PMID: 36054313 PMCID: PMC10167679 DOI: 10.1002/humu.24461] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2022] [Revised: 08/17/2022] [Accepted: 08/30/2022] [Indexed: 01/25/2023]
Abstract
Xia-Gibbs syndrome (XGS; MIM# 615829) is a rare mendelian disorder characterized by Development Delay (DD), intellectual disability (ID), and hypotonia. Individuals with XGS typically harbor de novo protein-truncating mutations in the AT-Hook DNA binding motif containing 1 (AHDC1) gene, although some missense mutations can also cause XGS. Large de novo heterozygous deletions that encompass the AHDC1 gene have also been ascribed as diagnostic for the disorder, without substantial evidence to support their pathogenicity. We analyzed 19 individuals with large contiguous deletions involving AHDC1, along with other genes. One individual bore the smallest known contiguous AHDC1 deletion (∼350 Kb), encompassing eight other genes within chr1p36.11 (Feline Gardner-Rasheed, IFI6, FAM76A, STX12, PPP1R8, THEMIS2, RPA2, SMPDL3B) and terminating within the first intron of AHDC1. The breakpoint junctions and phase of the deletion were identified using both short and long read sequencing (Oxford Nanopore). Quantification of RNA expression patterns in whole blood revealed that AHDC1 exhibited a mono-allelic expression pattern with no deficiency in overall AHDC1 expression levels, in contrast to the other deleted genes, which exhibited a 50% reduction in mRNA expression. These results suggest that AHDC1 expression in this individual is compensated by a novel regulatory mechanism and advances understanding of mutational and regulatory mechanisms in neurodevelopmental disorders.
Collapse
Affiliation(s)
- Varuna Chander
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| | - Medhat Mahmoud
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| | - Jianhong Hu
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| | - Zain Dardas
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| | | | - Moez Dawood
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| | - Michael M. Khayat
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| | - He Li
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA
| | - Shoudong Li
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA
| | - Shalini Jhangiani
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA
| | - Viktoriya Korchina
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA
| | - Hua Shen
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA
| | | | - Qingchang Meng
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA
| | - Marie-Claude Gingras
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| | - Donna M. Muzny
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| | - Harsha Doddapaneni
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA
| | - Jennifer E. Posey
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| | - James R. Lupski
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
- Texas Children’s Hospital, Houston, Texas, USA
- Department of Pediatrics, Baylor College of Medicine, Houston, Texas, USA
| | - Aniko Sabo
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA
| | - David R. Murdock
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| | - Fritz J. Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
- Department of Computer Science, Rice University, Houston, Texas, USA
| | - Richard A. Gibbs
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| |
Collapse
|
23
|
Xiao C, Chen Z, Chen W, Padilla C, Colgan M, Wu W, Fang LT, Liu T, Yang Y, Schneider V, Wang C, Xiao W. Personalized genome assembly for accurate cancer somatic mutation discovery using tumor-normal paired reference samples. Genome Biol 2022; 23:237. [PMID: 36352452 PMCID: PMC9648002 DOI: 10.1186/s13059-022-02803-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2021] [Accepted: 10/25/2022] [Indexed: 11/11/2022] Open
Abstract
BACKGROUND The use of a personalized haplotype-specific genome assembly, rather than an unrelated, mosaic genome like GRCh38, as a reference for detecting the full spectrum of somatic events from cancers has long been advocated but has never been explored in tumor-normal paired samples. Here, we provide the first demonstrated use of de novo assembled personalized genome as a reference for cancer mutation detection and quantifying the effects of the reference genomes on the accuracy of somatic mutation detection. RESULTS We generate de novo assemblies of the first tumor-normal paired genomes, both nuclear and mitochondrial, derived from the same individual with triple negative breast cancer. The personalized genome was chromosomal scale, haplotype phased, and annotated. We demonstrate that it provides individual specific haplotypes for complex regions and medically relevant genes. We illustrate that the personalized genome reference not only improves read alignments for both short-read and long-read sequencing data but also ameliorates the detection accuracy of somatic SNVs and SVs. We identify the equivalent somatic mutation calls between two genome references and uncover novel somatic mutations only when personalized genome assembly is used as a reference. CONCLUSIONS Our findings demonstrate that use of a personalized genome with individual-specific haplotypes is essential for accurate detection of the full spectrum of somatic mutations in the paired tumor-normal samples. The unique resource and methodology established in this study will be beneficial to the development of precision oncology medicine not only for breast cancer, but also for other cancers.
Collapse
Affiliation(s)
- Chunlin Xiao
- grid.94365.3d0000 0001 2297 5165National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 45 Center Drive, Bethesda, MD 20894 USA
| | - Zhong Chen
- grid.43582.380000 0000 9852 649XCenter for Genomics, Loma Linda University School of Medicine, 11021 Campus St., Loma Linda, CA 92350 USA
| | - Wanqiu Chen
- grid.43582.380000 0000 9852 649XCenter for Genomics, Loma Linda University School of Medicine, 11021 Campus St., Loma Linda, CA 92350 USA
| | - Cory Padilla
- grid.504403.6Dovetail Genomics, 100 Enterprise Way, Scotts Valley, CA 95066 USA
| | - Michael Colgan
- grid.417587.80000 0001 2243 3366The Center for Drug Evaluation and Research, U.S. Food and Drug Administration, Silver Spring, MD USA
| | - Wenjun Wu
- grid.249335.a0000 0001 2218 7820Blood Cell Development and Function Program, Fox Chase Cancer Center, Philadelphia, PA 19111 USA
| | - Li-Tai Fang
- grid.418158.10000 0004 0534 4718Bioinformatics Research & Early Development, Roche Sequencing Solutions Inc., 1301 Shoreway Road, Belmont, CA 94002 USA
| | - Tiantian Liu
- grid.43582.380000 0000 9852 649XCenter for Genomics, Loma Linda University School of Medicine, 11021 Campus St., Loma Linda, CA 92350 USA
| | - Yibin Yang
- grid.249335.a0000 0001 2218 7820Blood Cell Development and Function Program, Fox Chase Cancer Center, Philadelphia, PA 19111 USA
| | - Valerie Schneider
- grid.94365.3d0000 0001 2297 5165National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 45 Center Drive, Bethesda, MD 20894 USA
| | - Charles Wang
- grid.43582.380000 0000 9852 649XCenter for Genomics, Loma Linda University School of Medicine, 11021 Campus St., Loma Linda, CA 92350 USA
| | - Wenming Xiao
- grid.417587.80000 0001 2243 3366The Center for Drug Evaluation and Research, U.S. Food and Drug Administration, Silver Spring, MD USA
| |
Collapse
|
24
|
Jarvis ED, Formenti G, Rhie A, Guarracino A, Yang C, Wood J, Tracey A, Thibaud-Nissen F, Vollger MR, Porubsky D, Cheng H, Asri M, Logsdon GA, Carnevali P, Chaisson MJP, Chin CS, Cody S, Collins J, Ebert P, Escalona M, Fedrigo O, Fulton RS, Fulton LL, Garg S, Gerton JL, Ghurye J, Granat A, Green RE, Harvey W, Hasenfeld P, Hastie A, Haukness M, Jaeger EB, Jain M, Kirsche M, Kolmogorov M, Korbel JO, Koren S, Korlach J, Lee J, Li D, Lindsay T, Lucas J, Luo F, Marschall T, Mitchell MW, McDaniel J, Nie F, Olsen HE, Olson ND, Pesout T, Potapova T, Puiu D, Regier A, Ruan J, Salzberg SL, Sanders AD, Schatz MC, Schmitt A, Schneider VA, Selvaraj S, Shafin K, Shumate A, Stitziel NO, Stober C, Torrance J, Wagner J, Wang J, Wenger A, Xiao C, Zimin AV, Zhang G, Wang T, Li H, Garrison E, Haussler D, Hall I, Zook JM, Eichler EE, Phillippy AM, Paten B, Howe K, Miga KH. Semi-automated assembly of high-quality diploid human reference genomes. Nature 2022; 611:519-531. [PMID: 36261518 PMCID: PMC9668749 DOI: 10.1038/s41586-022-05325-5] [Citation(s) in RCA: 70] [Impact Index Per Article: 35.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2021] [Accepted: 09/06/2022] [Indexed: 01/01/2023]
Abstract
The current human reference genome, GRCh38, represents over 20 years of effort to generate a high-quality assembly, which has benefitted society1,2. However, it still has many gaps and errors, and does not represent a biological genome as it is a blend of multiple individuals3,4. Recently, a high-quality telomere-to-telomere reference, CHM13, was generated with the latest long-read technologies, but it was derived from a hydatidiform mole cell line with a nearly homozygous genome5. To address these limitations, the Human Pangenome Reference Consortium formed with the goal of creating high-quality, cost-effective, diploid genome assemblies for a pangenome reference that represents human genetic diversity6. Here, in our first scientific report, we determined which combination of current genome sequencing and assembly approaches yield the most complete and accurate diploid genome assembly with minimal manual curation. Approaches that used highly accurate long reads and parent-child data with graph-based haplotype phasing during assembly outperformed those that did not. Developing a combination of the top-performing methods, we generated our first high-quality diploid reference assembly, containing only approximately four gaps per chromosome on average, with most chromosomes within ±1% of the length of CHM13. Nearly 48% of protein-coding genes have non-synonymous amino acid changes between haplotypes, and centromeric regions showed the highest diversity. Our findings serve as a foundation for assembling near-complete diploid human genomes at scale for a pangenome reference to capture global genetic variation from single nucleotides to structural rearrangements.
Collapse
Affiliation(s)
- Erich D Jarvis
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA.
- Howard Hughes Medical Institute, Chevy Chase, MD, USA.
| | - Giulio Formenti
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA.
| | - Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Andrea Guarracino
- Genomics Research Centre, Human Technopole, Viale Rita Levi-Montalcini, Milan, Italy
| | | | - Jonathan Wood
- Tree of Life, Wellcome Sanger Institute, Cambridge, UK
| | - Alan Tracey
- Tree of Life, Wellcome Sanger Institute, Cambridge, UK
| | - Francoise Thibaud-Nissen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Mitchell R Vollger
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - David Porubsky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Haoyu Cheng
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Mobin Asri
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Glennis A Logsdon
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | | | - Mark J P Chaisson
- Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Chen-Shan Chin
- Foundation for Biological Data Science, Belmont, CA, USA
| | - Sarah Cody
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
| | | | - Peter Ebert
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
| | - Merly Escalona
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Olivier Fedrigo
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
| | - Robert S Fulton
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
| | - Lucinda L Fulton
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
| | - Shilpa Garg
- Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | | | - Jay Ghurye
- Dovetail Genomics, Scotts Valley, CA, USA
| | | | - Richard E Green
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
| | - William Harvey
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Patrick Hasenfeld
- European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany
| | | | - Marina Haukness
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
| | | | - Miten Jain
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Melanie Kirsche
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Mikhail Kolmogorov
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
| | - Jan O Korbel
- European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | | | - Joyce Lee
- Bionano Genomics, San Diego, CA, USA
| | - Daofeng Li
- Department of Genetics, Washington University School of Medicine, St. Louis, MO, USA
- The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, MO, USA
| | - Tina Lindsay
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
| | - Julian Lucas
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Feng Luo
- School of Computing, Clemson University, Clemson, SC, USA
| | - Tobias Marschall
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
| | | | - Jennifer McDaniel
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Fan Nie
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, China
| | - Hugh E Olsen
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Nathan D Olson
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Trevor Pesout
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Tamara Potapova
- Stowers Institute for Medical Research, Kansas City, MO, USA
| | - Daniela Puiu
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | | | - Jue Ruan
- Agricultural Genomics Institute, Chinese Academy of Agricultural Sciences, Shenzhen, China
| | - Steven L Salzberg
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Ashley D Sanders
- Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Berlin, Germany
| | - Michael C Schatz
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | | | - Valerie A Schneider
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | | | - Kishwar Shafin
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Alaina Shumate
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Nathan O Stitziel
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
- Department of Genetics, Washington University School of Medicine, St. Louis, MO, USA
- Cardiovascular Division, John T. Milliken Department of Internal Medicine, Washington University School of Medicine, St. Louis, USA
| | - Catherine Stober
- European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany
| | | | - Justin Wagner
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Jianxin Wang
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, China
| | | | - Chuanle Xiao
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou, China
| | - Aleksey V Zimin
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Guojie Zhang
- Center for Evolutionary & Organismal Biology, Zhejiang University School of Medicine, Hangzhou, China
| | - Ting Wang
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
- Department of Genetics, Washington University School of Medicine, St. Louis, MO, USA
- The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, MO, USA
| | - Heng Li
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - David Haussler
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
- Department of Ecology and Evolutionary Biology, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Ira Hall
- Yale School of Medicine, New Haven, CT, USA
| | - Justin M Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Evan E Eichler
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Kerstin Howe
- Tree of Life, Wellcome Sanger Institute, Cambridge, UK.
| | - Karen H Miga
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA.
| |
Collapse
|
25
|
Xie H, Li W, Hu Y, Yang C, Lu J, Guo Y, Wen L, Tang F. De novo assembly of human genome at single-cell levels. Nucleic Acids Res 2022; 50:7479-7492. [PMID: 35819189 PMCID: PMC9303314 DOI: 10.1093/nar/gkac586] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2022] [Revised: 05/17/2022] [Accepted: 06/24/2022] [Indexed: 12/12/2022] Open
Abstract
Genome assembly has been benefited from long-read sequencing technologies with higher accuracy and higher continuity. However, most human genome assembly require large amount of DNAs from homogeneous cell lines without keeping cell heterogeneities, since cell heterogeneity could profoundly affect haplotype assembly results. Herein, using single-cell genome long-read sequencing technology (SMOOTH-seq), we have sequenced K562 and HG002 cells on PacBio HiFi and Oxford Nanopore Technologies (ONT) platforms and conducted de novo genome assembly. For the first time, we have completed the human genome assembly with high continuity (with NG50 of ∼2 Mb using 95 individual K562 cells) at single-cell levels, and explored the impact of different assemblers and sequencing strategies on genome assembly. With sequencing data from 30 diploid individual HG002 cells of relatively high genome coverage (average coverage ∼41.7%) on ONT platform, the NG50 can reach over 1.3 Mb. Furthermore, with the assembled genome from K562 single-cell dataset, more complete and accurate set of insertion events and complex structural variations could be identified. This study opened a new chapter on the practice of single-cell genome de novo assembly.
Collapse
Affiliation(s)
- Haoling Xie
- School of Life Sciences, Biomedical Pioneering Innovation Center, Peking University, Beijing 100871, China
- Peking University-Tsinghua University-National Institute of Biological Sciences Joint Graduate Program (PTN), School of Life Sciences, Peking University, Beijing 100871, China
- Beijing Advanced Innovation Center for Genomics (ICG), Ministry of Education Key Laboratory of Cell Proliferation and Differentiation, Beijing 100871, China
| | - Wen Li
- School of Life Sciences, Biomedical Pioneering Innovation Center, Peking University, Beijing 100871, China
- Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, China
- Beijing Advanced Innovation Center for Genomics (ICG), Ministry of Education Key Laboratory of Cell Proliferation and Differentiation, Beijing 100871, China
| | - Yuqiong Hu
- School of Life Sciences, Biomedical Pioneering Innovation Center, Peking University, Beijing 100871, China
- Beijing Advanced Innovation Center for Genomics (ICG), Ministry of Education Key Laboratory of Cell Proliferation and Differentiation, Beijing 100871, China
| | - Cheng Yang
- School of Life Sciences, Biomedical Pioneering Innovation Center, Peking University, Beijing 100871, China
- Beijing Advanced Innovation Center for Genomics (ICG), Ministry of Education Key Laboratory of Cell Proliferation and Differentiation, Beijing 100871, China
| | - Jiansen Lu
- School of Life Sciences, Biomedical Pioneering Innovation Center, Peking University, Beijing 100871, China
- Beijing Advanced Innovation Center for Genomics (ICG), Ministry of Education Key Laboratory of Cell Proliferation and Differentiation, Beijing 100871, China
| | - Yuqing Guo
- School of Life Sciences, Biomedical Pioneering Innovation Center, Peking University, Beijing 100871, China
- Beijing Advanced Innovation Center for Genomics (ICG), Ministry of Education Key Laboratory of Cell Proliferation and Differentiation, Beijing 100871, China
| | - Lu Wen
- School of Life Sciences, Biomedical Pioneering Innovation Center, Peking University, Beijing 100871, China
- Beijing Advanced Innovation Center for Genomics (ICG), Ministry of Education Key Laboratory of Cell Proliferation and Differentiation, Beijing 100871, China
| | - Fuchou Tang
- School of Life Sciences, Biomedical Pioneering Innovation Center, Peking University, Beijing 100871, China
- Peking University-Tsinghua University-National Institute of Biological Sciences Joint Graduate Program (PTN), School of Life Sciences, Peking University, Beijing 100871, China
- Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, China
- Beijing Advanced Innovation Center for Genomics (ICG), Ministry of Education Key Laboratory of Cell Proliferation and Differentiation, Beijing 100871, China
| |
Collapse
|
26
|
Olson ND, Wagner J, McDaniel J, Stephens SH, Westreich ST, Prasanna AG, Johanson E, Boja E, Maier EJ, Serang O, Jáspez D, Lorenzo-Salazar JM, Muñoz-Barrera A, Rubio-Rodríguez LA, Flores C, Kyriakidis K, Malousi A, Shafin K, Pesout T, Jain M, Paten B, Chang PC, Kolesnikov A, Nattestad M, Baid G, Goel S, Yang H, Carroll A, Eveleigh R, Bourgey M, Bourque G, Li G, Ma C, Tang L, Du Y, Zhang S, Morata J, Tonda R, Parra G, Trotta JR, Brueffer C, Demirkaya-Budak S, Kabakci-Zorlu D, Turgut D, Kalay Ö, Budak G, Narcı K, Arslan E, Brown R, Johnson IJ, Dolgoborodov A, Semenyuk V, Jain A, Tetikol HS, Jain V, Ruehle M, Lajoie B, Roddey C, Catreux S, Mehio R, Ahsan MU, Liu Q, Wang K, Ebrahim Sahraeian SM, Fang LT, Mohiyuddin M, Hung C, Jain C, Feng H, Li Z, Chen L, Sedlazeck FJ, Zook JM. PrecisionFDA Truth Challenge V2: Calling variants from short and long reads in difficult-to-map regions. CELL GENOMICS 2022; 2:S2666-979X(22)00058-1. [PMID: 35720974 PMCID: PMC9205427 DOI: 10.1016/j.xgen.2022.100129] [Citation(s) in RCA: 54] [Impact Index Per Article: 27.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/07/2020] [Revised: 11/01/2021] [Accepted: 04/08/2022] [Indexed: 11/19/2022]
Abstract
The precisionFDA Truth Challenge V2 aimed to assess the state of the art of variant calling in challenging genomic regions. Starting with FASTQs, 20 challenge participants applied their variant-calling pipelines and submitted 64 variant call sets for one or more sequencing technologies (Illumina, PacBio HiFi, and Oxford Nanopore Technologies). Submissions were evaluated following best practices for benchmarking small variants with updated Genome in a Bottle benchmark sets and genome stratifications. Challenge submissions included numerous innovative methods, with graph-based and machine learning methods scoring best for short-read and long-read datasets, respectively. With machine learning approaches, combining multiple sequencing technologies performed particularly well. Recent developments in sequencing and variant calling have enabled benchmarking variants in challenging genomic regions, paving the way for the identification of previously unknown clinically relevant variants.
Collapse
Affiliation(s)
- Nathan D. Olson
- Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Dr, MS8312, Gaithersburg, MD 20899, USA
| | - Justin Wagner
- Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Dr, MS8312, Gaithersburg, MD 20899, USA
| | - Jennifer McDaniel
- Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Dr, MS8312, Gaithersburg, MD 20899, USA
| | | | | | | | - Elaine Johanson
- Office of Health Informatics, Office of the Chief Scientist, Office of the Commissioner, US Food and Drug Administration, Silver Spring, MD, USA
| | - Emily Boja
- Office of Health Informatics, Office of the Chief Scientist, Office of the Commissioner, US Food and Drug Administration, Silver Spring, MD, USA
| | - Ezekiel J. Maier
- Booz Allen Hamilton, 8283 Greensboro Drive, Mclean, VA 22102, USA
| | - Omar Serang
- DNAnexus, Inc., 1975 W El Camino Real #204, Mountain View, CA 94040, USA
| | - David Jáspez
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Santa Cruz de Tenerife, Spain
| | - José M. Lorenzo-Salazar
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Santa Cruz de Tenerife, Spain
| | - Adrián Muñoz-Barrera
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Santa Cruz de Tenerife, Spain
| | - Luis A. Rubio-Rodríguez
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Santa Cruz de Tenerife, Spain
| | - Carlos Flores
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Santa Cruz de Tenerife, Spain
- CIBER de Enfermedades Respiratorias, Instituto de Salud Carlos III, Madrid, Spain
- Research Unit, Hospital Universitario N.S. de Candelaria, Santa Cruz de Tenerife, Spain
- Instituto de Tecnologías Biomédicas (ITB), Universidad de La Laguna, 38200 San Cristóbal de La Laguna, Spain
| | - Konstantinos Kyriakidis
- School of Pharmacy, Aristotle University of Thessaloniki (AUTH), 541 24 Thessaloniki, Greece
- Genomics and Epigenomics Translational Research (GENeTres), Center for Interdisciplinary Research and Innovation, 570 01 Thessaloniki, Greece
| | - Andigoni Malousi
- Genomics and Epigenomics Translational Research (GENeTres), Center for Interdisciplinary Research and Innovation, 570 01 Thessaloniki, Greece
- Laboratory of Biological Chemistry, School of Medicine, Aristotle University of Thessaloniki (AUTH), 541 24 Thessaloniki, Greece
| | - Kishwar Shafin
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, 1156 High Street, Santa Cruz, CA, USA
| | - Trevor Pesout
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, 1156 High Street, Santa Cruz, CA, USA
| | - Miten Jain
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, 1156 High Street, Santa Cruz, CA, USA
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, 1156 High Street, Santa Cruz, CA, USA
| | - Pi-Chuan Chang
- Google Inc, 1600 Amphitheater Pkwy, Mountain View, CA 94040, USA
| | | | - Maria Nattestad
- Google Inc, 1600 Amphitheater Pkwy, Mountain View, CA 94040, USA
| | - Gunjan Baid
- Google Inc, 1600 Amphitheater Pkwy, Mountain View, CA 94040, USA
| | - Sidharth Goel
- Google Inc, 1600 Amphitheater Pkwy, Mountain View, CA 94040, USA
| | - Howard Yang
- Google Inc, 1600 Amphitheater Pkwy, Mountain View, CA 94040, USA
| | - Andrew Carroll
- Google Inc, 1600 Amphitheater Pkwy, Mountain View, CA 94040, USA
| | - Robert Eveleigh
- The Canadian Center for Computational Genomics (C3G), Montréal, QC, Canada
| | - Mathieu Bourgey
- The Canadian Center for Computational Genomics (C3G), Montréal, QC, Canada
| | - Guillaume Bourque
- The Canadian Center for Computational Genomics (C3G), Montréal, QC, Canada
| | - Gen Li
- HuXinDao, QingZhuHu TaiYangShan Road, KaiFu, ChangSha, HuNan, China
| | - ChouXian Ma
- HuXinDao, QingZhuHu TaiYangShan Road, KaiFu, ChangSha, HuNan, China
| | - LinQi Tang
- HuXinDao, QingZhuHu TaiYangShan Road, KaiFu, ChangSha, HuNan, China
| | - YuanPing Du
- HuXinDao, QingZhuHu TaiYangShan Road, KaiFu, ChangSha, HuNan, China
| | - ShaoWei Zhang
- HuXinDao, QingZhuHu TaiYangShan Road, KaiFu, ChangSha, HuNan, China
| | - Jordi Morata
- CNAG-CRG, Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Baldiri i Reixac 4, 08028 Barcelona, Spain
- Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Raúl Tonda
- CNAG-CRG, Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Baldiri i Reixac 4, 08028 Barcelona, Spain
- Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Genís Parra
- CNAG-CRG, Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Baldiri i Reixac 4, 08028 Barcelona, Spain
- Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Jean-Rémi Trotta
- CNAG-CRG, Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Baldiri i Reixac 4, 08028 Barcelona, Spain
- Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Christian Brueffer
- Division of Oncology, Department of Clinical Sciences, Lund University, Lund, Sweden
| | | | | | - Deniz Turgut
- Seven Bridges Genomics, Inc, Charlestown, MA, USA
| | - Özem Kalay
- Seven Bridges Genomics, Inc, Charlestown, MA, USA
| | - Gungor Budak
- Seven Bridges Genomics, Inc, Charlestown, MA, USA
| | - Kübra Narcı
- Seven Bridges Genomics, Inc, Charlestown, MA, USA
| | - Elif Arslan
- Seven Bridges Genomics, Inc, Charlestown, MA, USA
| | | | | | | | | | - Amit Jain
- Seven Bridges Genomics, Inc, Charlestown, MA, USA
| | | | | | | | | | | | | | | | - Mian Umair Ahsan
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Qian Liu
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Kai Wang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | | | - Li Tai Fang
- Roche Sequencing Solutions, Santa Clara, CA 95050, USA
| | | | | | - Chirag Jain
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | | | | | | | - Fritz J. Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, USA
| | - Justin M. Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Dr, MS8312, Gaithersburg, MD 20899, USA
| |
Collapse
|
27
|
Yang J, Chaisson MJP. TT-Mars: structural variants assessment based on haplotype-resolved assemblies. Genome Biol 2022; 23:110. [PMID: 35524317 PMCID: PMC9077962 DOI: 10.1186/s13059-022-02666-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2021] [Accepted: 03/30/2022] [Indexed: 01/30/2023] Open
Abstract
Variant benchmarking is often performed by comparing a test callset to a gold standard set of variants. In repetitive regions of the genome, it may be difficult to establish what is the truth for a call, for example, when different alignment scoring metrics provide equally supported but different variant calls on the same data. Here, we provide an alternative approach, TT-Mars, that takes advantage of the recent production of high-quality haplotype-resolved genome assemblies by providing false discovery rates for variant calls based on how well their call reflects the content of the assembly, rather than comparing calls themselves.
Collapse
Affiliation(s)
- Jianzhi Yang
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Mark J P Chaisson
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA.
| |
Collapse
|
28
|
Wagner J, Olson ND, Harris L, Khan Z, Farek J, Mahmoud M, Stankovic A, Kovacevic V, Yoo B, Miller N, Rosenfeld JA, Ni B, Zarate S, Kirsche M, Aganezov S, Schatz MC, Narzisi G, Byrska-Bishop M, Clarke W, Evani US, Markello C, Shafin K, Zhou X, Sidow A, Bansal V, Ebert P, Marschall T, Lansdorp P, Hanlon V, Mattsson CA, Barrio AM, Fiddes IT, Xiao C, Fungtammasan A, Chin CS, Wenger AM, Rowell WJ, Sedlazeck FJ, Carroll A, Salit M, Zook JM. Benchmarking challenging small variants with linked and long reads. CELL GENOMICS 2022; 2:100128. [PMID: 36452119 PMCID: PMC9706577 DOI: 10.1016/j.xgen.2022.100128] [Citation(s) in RCA: 55] [Impact Index Per Article: 27.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 05/14/2023]
Abstract
Genome in a Bottle benchmarks are widely used to help validate clinical sequencing pipelines and develop variant calling and sequencing methods. Here we use accurate linked and long reads to expand benchmarks in 7 samples to include difficult-to-map regions and segmental duplications that are challenging for short reads. These benchmarks add more than 300,000 SNVs and 50,000 insertions or deletions (indels) and include 16% more exonic variants, many in challenging, clinically relevant genes not covered previously, such as PMS2. For HG002, we include 92% of the autosomal GRCh38 assembly while excluding regions problematic for benchmarking small variants, such as copy number variants, that should not have been in the previous version, which included 85% of GRCh38. It identifies eight times more false negatives in a short read variant call set relative to our previous benchmark. We demonstrate that this benchmark reliably identifies false positives and false negatives across technologies, enabling ongoing methods development.
Collapse
Affiliation(s)
- Justin Wagner
- Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Dr, MS8312, Gaithersburg, MD 20899, USA
- Corresponding author
| | - Nathan D. Olson
- Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Dr, MS8312, Gaithersburg, MD 20899, USA
| | - Lindsay Harris
- Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Dr, MS8312, Gaithersburg, MD 20899, USA
| | - Ziad Khan
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, USA
| | - Jesse Farek
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, USA
| | - Medhat Mahmoud
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, USA
| | - Ana Stankovic
- Seven Bridges, Omladinskih brigada 90g, 11070 Belgrade, Republic of Serbia
| | - Vladimir Kovacevic
- Seven Bridges, Omladinskih brigada 90g, 11070 Belgrade, Republic of Serbia
| | - Byunggil Yoo
- Children’s Mercy Kansas City, Kansas City, MO, USA
| | - Neil Miller
- Children’s Mercy Kansas City, Kansas City, MO, USA
| | | | - Bohan Ni
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Samantha Zarate
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Melanie Kirsche
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Sergey Aganezov
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Michael C. Schatz
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Giuseppe Narzisi
- New York Genome Center, 101 Avenue of the Americas, New York, NY, USA
| | | | - Wayne Clarke
- New York Genome Center, 101 Avenue of the Americas, New York, NY, USA
| | - Uday S. Evani
- New York Genome Center, 101 Avenue of the Americas, New York, NY, USA
| | - Charles Markello
- University of California at Santa Cruz Genomics Institute, 1156 High Street, Santa Cruz, CA, USA
| | - Kishwar Shafin
- University of California at Santa Cruz Genomics Institute, 1156 High Street, Santa Cruz, CA, USA
| | - Xin Zhou
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA
| | - Arend Sidow
- Department of Pathology, Stanford University, Stanford, CA 94305, USA
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
| | - Vikas Bansal
- Department of Pediatrics, University of California, San Diego, La Jolla, CA 92093, USA
| | - Peter Ebert
- Institute of Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, 40225 Düsseldorf, Germany
| | - Tobias Marschall
- Institute of Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, 40225 Düsseldorf, Germany
| | - Peter Lansdorp
- Institute of Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, 40225 Düsseldorf, Germany
| | - Vincent Hanlon
- Terry Fox Laboratory, BC Cancer Research Institute and Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada
| | - Carl-Adam Mattsson
- Terry Fox Laboratory, BC Cancer Research Institute and Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada
| | | | | | - Chunlin Xiao
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | | | | | | | | | - Fritz J. Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, USA
| | - Andrew Carroll
- Google Inc., 1600 Amphitheatre Pkwy., Mountain View, CA 94040, USA
| | - Marc Salit
- Joint Initiative for Metrology in Biology, SLAC National Laboratory, Stanford, CA, USA
| | - Justin M. Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Dr, MS8312, Gaithersburg, MD 20899, USA
- Corresponding author
| |
Collapse
|
29
|
Wagner J, Olson ND, Harris L, McDaniel J, Cheng H, Fungtammasan A, Hwang YC, Gupta R, Wenger AM, Rowell WJ, Khan ZM, Farek J, Zhu Y, Pisupati A, Mahmoud M, Xiao C, Yoo B, Sahraeian SME, Miller DE, Jáspez D, Lorenzo-Salazar JM, Muñoz-Barrera A, Rubio-Rodríguez LA, Flores C, Narzisi G, Evani US, Clarke WE, Lee J, Mason CE, Lincoln SE, Miga KH, Ebbert MTW, Shumate A, Li H, Chin CS, Zook JM, Sedlazeck FJ. Curated variation benchmarks for challenging medically relevant autosomal genes. Nat Biotechnol 2022; 40:672-680. [PMID: 35132260 PMCID: PMC9117392 DOI: 10.1038/s41587-021-01158-1] [Citation(s) in RCA: 87] [Impact Index Per Article: 43.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2021] [Accepted: 11/10/2021] [Indexed: 11/09/2022]
Abstract
The repetitive nature and complexity of some medically relevant genes poses a challenge for their accurate analysis in a clinical setting. The Genome in a Bottle Consortium has provided variant benchmark sets, but these exclude nearly 400 medically relevant genes due to their repetitiveness or polymorphic complexity. Here, we characterize 273 of these 395 challenging autosomal genes using a haplotype-resolved whole-genome assembly. This curated benchmark reports over 17,000 single-nucleotide variations, 3,600 insertions and deletions and 200 structural variations each for human genome reference GRCh37 and GRCh38 across HG002. We show that false duplications in either GRCh37 or GRCh38 result in reference-specific, missed variants for short- and long-read technologies in medically relevant genes, including CBS, CRYAA and KCNE1. When masking these false duplications, variant recall can improve from 8% to 100%. Forming benchmarks from a haplotype-resolved whole-genome assembly may become a prototype for future benchmarks covering the whole genome.
Collapse
Affiliation(s)
- Justin Wagner
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Nathan D Olson
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Lindsay Harris
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Jennifer McDaniel
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Haoyu Cheng
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA
| | | | | | | | | | | | - Ziad M Khan
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Jesse Farek
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Yiming Zhu
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Aishwarya Pisupati
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Medhat Mahmoud
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Chunlin Xiao
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Byunggil Yoo
- Genomic Medicine Center, Children's Mercy Kansas City, Kansas City, MO, USA
| | | | - Danny E Miller
- Department of Pediatrics, Division of Genetic Medicine, University of Washington and Seattle Children's Hospital, Seattle, WA, USA
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - David Jáspez
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Santa Cruz de Tenerife, Spain
| | - José M Lorenzo-Salazar
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Santa Cruz de Tenerife, Spain
| | - Adrián Muñoz-Barrera
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Santa Cruz de Tenerife, Spain
| | - Luis A Rubio-Rodríguez
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Santa Cruz de Tenerife, Spain
| | - Carlos Flores
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Santa Cruz de Tenerife, Spain
- CIBER de Enfermedades Respiratorias, Instituto de Salud Carlos III, Madrid, Spain
- Research Unit, Hospital Universitario N.S. de Candelaria, Santa Cruz de Tenerife, Spain
| | | | | | | | - Joyce Lee
- Bionano Genomics, San Diego, CA, USA
| | - Christopher E Mason
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY, USA
| | | | - Karen H Miga
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Mark T W Ebbert
- Sanders-Brown Center on Aging, University of Kentucky, Lexington, KY, USA
- Department of Internal Medicine, Division of Biomedical Informatics, University of Kentucky, Lexington, KY, USA
- Department of Neuroscience, University of Kentucky, Lexington, KY, USA
| | - Alaina Shumate
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Heng Li
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA
| | | | - Justin M Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA.
| | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA.
| |
Collapse
|
30
|
Naito T, Okada Y. HLA imputation and its application to genetic and molecular fine-mapping of the MHC region in autoimmune diseases. Semin Immunopathol 2022; 44:15-28. [PMID: 34786601 PMCID: PMC8837514 DOI: 10.1007/s00281-021-00901-9] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2021] [Accepted: 10/22/2021] [Indexed: 12/19/2022]
Abstract
Variations of human leukocyte antigen (HLA) genes in the major histocompatibility complex region (MHC) significantly affect the risk of various diseases, especially autoimmune diseases. Fine-mapping of causal variants in this region was challenging due to the difficulty in sequencing and its inapplicability to large cohorts. Thus, HLA imputation, a method to infer HLA types from regional single nucleotide polymorphisms, has been developed and has successfully contributed to MHC fine-mapping of various diseases. Different HLA imputation methods have been developed, each with its own advantages, and recent methods have been improved in terms of accuracy and computational performance. Additionally, advances in HLA reference panels by next-generation sequencing technologies have enabled higher resolution and a more reliable imputation, allowing a finer-grained evaluation of the association between sequence variations and disease risk. Risk-associated variants in the MHC region would affect disease susceptibility through complicated mechanisms including alterations in peripheral responses and central thymic selection of T cells. The cooperation of reliable HLA imputation methods, informative fine-mapping, and experimental validation of the functional significance of MHC variations would be essential for further understanding of the role of the MHC in the immunopathology of autoimmune diseases.
Collapse
Affiliation(s)
- Tatsuhiko Naito
- Department of Statistical Genetics, Osaka University Graduate School of Medicine, 2-2 Yamadaoka, Osaka, Suita, 565-0871, Japan.
- Department of Neurology, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan.
| | - Yukinori Okada
- Department of Statistical Genetics, Osaka University Graduate School of Medicine, 2-2 Yamadaoka, Osaka, Suita, 565-0871, Japan
- Laboratory of Statistical Immunology, Immunology Frontier Research Center (WPI-IFReC), Osaka University, Suita, Japan
- Integrated Frontier Research for Medical Science Division, Institute for Open and Transdisciplinary Research Initiatives, Osaka University, Suita, Japan
| |
Collapse
|
31
|
Sirén J, Monlong J, Chang X, Novak AM, Eizenga JM, Markello C, Sibbesen JA, Hickey G, Chang PC, Carroll A, Gupta N, Gabriel S, Blackwell TW, Ratan A, Taylor KD, Rich SS, Rotter JI, Haussler D, Garrison E, Paten B. Pangenomics enables genotyping of known structural variants in 5202 diverse genomes. Science 2021; 374:abg8871. [PMID: 34914532 PMCID: PMC9365333 DOI: 10.1126/science.abg8871] [Citation(s) in RCA: 107] [Impact Index Per Article: 35.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
We introduce Giraffe, a pangenome short-read mapper that can efficiently map to a collection of haplotypes threaded through a sequence graph. Giraffe maps sequencing reads to thousands of human genomes at a speed comparable to that of standard methods mapping to a single reference genome. The increased mapping accuracy enables downstream improvements in genome-wide genotyping pipelines for both small variants and larger structural variants. We used Giraffe to genotype 167,000 structural variants, discovered in long-read studies, in 5202 diverse human genomes that were sequenced using short reads. We conclude that pangenomics facilitates a more comprehensive characterization of variation and, as a result, has the potential to improve many genomic analyses.
Collapse
Affiliation(s)
- Jouni Sirén
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | - Jean Monlong
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | - Xian Chang
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | - Adam M. Novak
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | | | | | | | - Glenn Hickey
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | - Pi-Chuan Chang
- Google Inc, 1600 Amphitheatre Pkwy, Mountain View, CA, USA
| | - Andrew Carroll
- Google Inc, 1600 Amphitheatre Pkwy, Mountain View, CA, USA
| | - Namrata Gupta
- Genomics Platform, Broad Institute, Cambridge, MA, USA
| | - Stacey Gabriel
- Program in Medical and Population Genetics, Broad Institute, Cambridge, MA, USA
| | | | - Aakrosh Ratan
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA
| | - Kent D. Taylor
- The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, USA
| | - Stephen S. Rich
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA
| | - Jerome I. Rotter
- The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, USA
| | - David Haussler
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
- Howard Hughes Medical Institute, University of California, Santa Cruz, CA, USA
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | | |
Collapse
|
32
|
Shafin K, Pesout T, Chang PC, Nattestad M, Kolesnikov A, Goel S, Baid G, Kolmogorov M, Eizenga JM, Miga KH, Carnevali P, Jain M, Carroll A, Paten B. Haplotype-aware variant calling with PEPPER-Margin-DeepVariant enables high accuracy in nanopore long-reads. Nat Methods 2021; 18:1322-1332. [PMID: 34725481 PMCID: PMC8571015 DOI: 10.1038/s41592-021-01299-w] [Citation(s) in RCA: 118] [Impact Index Per Article: 39.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2021] [Accepted: 09/06/2021] [Indexed: 01/15/2023]
Abstract
Long-read sequencing has the potential to transform variant detection by reaching currently difficult-to-map regions and routinely linking together adjacent variations to enable read-based phasing. Third-generation nanopore sequence data have demonstrated a long read length, but current interpretation methods for their novel pore-based signal have unique error profiles, making accurate analysis challenging. Here, we introduce a haplotype-aware variant calling pipeline, PEPPER-Margin-DeepVariant, that produces state-of-the-art variant calling results with nanopore data. We show that our nanopore-based method outperforms the short-read-based single-nucleotide-variant identification method at the whole-genome scale and produces high-quality single-nucleotide variants in segmental duplications and low-mappability regions where short-read-based genotyping fails. We show that our pipeline can provide highly contiguous phase blocks across the genome with nanopore reads, contiguously spanning between 85% and 92% of annotated genes across six samples. We also extend PEPPER-Margin-DeepVariant to PacBio HiFi data, providing an efficient solution with superior performance over the current WhatsHap-DeepVariant standard. Finally, we demonstrate de novo assembly polishing methods that use nanopore and PacBio HiFi reads to produce diploid assemblies with high accuracy (Q35+ nanopore-polished and Q40+ PacBio HiFi-polished).
Collapse
Affiliation(s)
| | - Trevor Pesout
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | | | | | | | | | | | | | | | - Karen H Miga
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | | | - Miten Jain
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | | | | |
Collapse
|
33
|
Fu Y, Mahmoud M, Muraliraman VV, Sedlazeck FJ, Treangen TJ. Vulcan: Improved long-read mapping and structural variant calling via dual-mode alignment. Gigascience 2021; 10:6375129. [PMID: 34561697 PMCID: PMC8463296 DOI: 10.1093/gigascience/giab063] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2021] [Revised: 07/22/2021] [Accepted: 08/29/2021] [Indexed: 01/23/2023] Open
Abstract
BACKGROUND Long-read sequencing has enabled unprecedented surveys of structural variation across the entire human genome. To maximize the potential of long-read sequencing in this context, novel mapping methods have emerged that have primarily focused on either speed or accuracy. Various heuristics and scoring schemas have been implemented in widely used read mappers (minimap2 and NGMLR) to optimize for speed or accuracy, which have variable performance across different genomic regions and for specific structural variants. Our hypothesis is that constraining read mapping to the use of a single gap penalty across distinct mutational hot spots reduces read alignment accuracy and impedes structural variant detection. FINDINGS We tested our hypothesis by implementing a read-mapping pipeline called Vulcan that uses two distinct gap penalty modes, which we refer to as dual-mode alignment. The high-level idea is that Vulcan leverages the computed normalized edit distance of the mapped reads via minimap2 to identify poorly aligned reads and realigns them using the more accurate yet computationally more expensive long-read mapper (NGMLR). In support of our hypothesis, we show that Vulcan improves the alignments for Oxford Nanopore Technology long reads for both simulated and real datasets. These improvements, in turn, lead to improved accuracy for structural variant calling performance on human genome datasets compared to either of the read-mapping methods alone. CONCLUSIONS Vulcan is the first long-read mapping framework that combines two distinct gap penalty modes for improved structural variant recall and precision. Vulcan is open-source and available under the MIT License at https://gitlab.com/treangenlab/vulcan.
Collapse
Affiliation(s)
- Yilei Fu
- Department of Computer Science, Rice University, Houston, TX 77251-1892, USA
| | - Medhat Mahmoud
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA.,Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | | | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Todd J Treangen
- Department of Computer Science, Rice University, Houston, TX 77251-1892, USA
| |
Collapse
|
34
|
Yan SM, Sherman RM, Taylor DJ, Nair DR, Bortvin AN, Schatz MC, McCoy RC. Local adaptation and archaic introgression shape global diversity at human structural variant loci. eLife 2021; 10:e67615. [PMID: 34528508 PMCID: PMC8492059 DOI: 10.7554/elife.67615] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2021] [Accepted: 09/14/2021] [Indexed: 12/13/2022] Open
Abstract
Large genomic insertions and deletions are a potent source of functional variation, but are challenging to resolve with short-read sequencing, limiting knowledge of the role of such structural variants (SVs) in human evolution. Here, we used a graph-based method to genotype long-read-discovered SVs in short-read data from diverse human genomes. We then applied an admixture-aware method to identify 220 SVs exhibiting extreme patterns of frequency differentiation - a signature of local adaptation. The top two variants traced to the immunoglobulin heavy chain locus, tagging a haplotype that swept to near fixation in certain southeast Asian populations, but is rare in other global populations. Further investigation revealed evidence that the haplotype traces to gene flow from Neanderthals, corroborating the role of immune-related genes as prominent targets of adaptive introgression. Our study demonstrates how recent technical advances can help resolve signatures of key evolutionary events that remained obscured within technically challenging regions of the genome.
Collapse
Affiliation(s)
- Stephanie M Yan
- Department of Biology, Johns Hopkins University, BaltimoreBaltimoreUnited States
| | - Rachel M Sherman
- Department of Computer Science, Johns Hopkins UniversityBaltimoreUnited States
| | - Dylan J Taylor
- Department of Biology, Johns Hopkins University, BaltimoreBaltimoreUnited States
| | - Divya R Nair
- Department of Biology, Johns Hopkins University, BaltimoreBaltimoreUnited States
| | - Andrew N Bortvin
- Department of Biology, Johns Hopkins University, BaltimoreBaltimoreUnited States
| | - Michael C Schatz
- Department of Biology, Johns Hopkins University, BaltimoreBaltimoreUnited States
- Department of Computer Science, Johns Hopkins UniversityBaltimoreUnited States
| | - Rajiv C McCoy
- Department of Biology, Johns Hopkins University, BaltimoreBaltimoreUnited States
| |
Collapse
|
35
|
Mahmoud M, Doddapaneni H, Timp W, Sedlazeck FJ. PRINCESS: comprehensive detection of haplotype resolved SNVs, SVs, and methylation. Genome Biol 2021; 22:268. [PMID: 34521442 PMCID: PMC8442460 DOI: 10.1186/s13059-021-02486-w] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2021] [Accepted: 09/02/2021] [Indexed: 12/11/2022] Open
Abstract
Long-read sequencing has been shown to have advantages in structural variation (SV) detection and methylation calling. Many studies focus either on SV, methylation, or phasing of SNV; however, only the combination of variants provides a comprehensive insight into the sample and thus enables novel findings in biology or medicine. PRINCESS is a structured workflow that takes raw sequence reads and generates a fully phased SNV, SV, and methylation call set within a few hours. PRINCESS achieves high accuracy and long phasing even on low coverage datasets and can resolve repetitive, complex medical relevant genes that often escape detection. PRINCESS is publicly available at https://github.com/MeHelmy/princess under the MIT license.
Collapse
Affiliation(s)
- Medhat Mahmoud
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA.
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA.
| | | | - Winston Timp
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA.
| |
Collapse
|
36
|
Using de novo assembly to identify structural variation of eight complex immune system gene regions. PLoS Comput Biol 2021; 17:e1009254. [PMID: 34343164 PMCID: PMC8363018 DOI: 10.1371/journal.pcbi.1009254] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2021] [Revised: 08/13/2021] [Accepted: 07/06/2021] [Indexed: 12/11/2022] Open
Abstract
Driven by the necessity to survive environmental pathogens, the human immune system has evolved exceptional diversity and plasticity, to which several factors contribute including inheritable structural polymorphism of the underlying genes. Characterizing this variation is challenging due to the complexity of these loci, which contain extensive regions of paralogy, segmental duplication and high copy-number repeats, but recent progress in long-read sequencing and optical mapping techniques suggests this problem may now be tractable. Here we assess this by using long-read sequencing platforms from PacBio and Oxford Nanopore, supplemented with short-read sequencing and Bionano optical mapping, to sequence DNA extracted from CD14+ monocytes and peripheral blood mononuclear cells from a single European individual identified as HV31. We use this data to build a de novo assembly of eight genomic regions encoding four key components of the immune system, namely the human leukocyte antigen, immunoglobulins, T cell receptors, and killer-cell immunoglobulin-like receptors. Validation of our assembly using k-mer based and alignment approaches suggests that it has high accuracy, with estimated base-level error rates below 1 in 10 kb, although we identify a small number of remaining structural errors. We use the assembly to identify heterozygous and homozygous structural variation in comparison to GRCh38. Despite analyzing only a single individual, we find multiple large structural variants affecting core genes at all three immunoglobulin regions and at two of the three T cell receptor regions. Several of these variants are not accurately callable using current algorithms, implying that further methodological improvements are needed. Our results demonstrate that assessing haplotype variation in these regions is possible given sufficiently accurate long-read and associated data. Continued reductions in the cost of these technologies will enable application of these methods to larger samples and provide a broader catalogue of germline structural variation at these loci, an important step toward making these regions accessible to large-scale genetic association studies. The human immune system is incredibly versatile underlying its capacity to defend the body against thousands of pathogens. At a molecular level, it recognizes pathogens using large libraries of antibodies and related protein receptors. These molecules are encoded by gene families that are particularly difficult to analyze due to their unusually complex patterns of similarities and differences between genes and individuals. To overcome this, we applied several sequencing methods to DNA from a single individual and developed methods to reconstruct the underlying sequence at eight of the immune-associated regions. Importantly, we used DNA extracted from monocytes to avoid capturing the further rearrangements that occur in active immune cells. We generated accurate assemblies by integrating multiple complementary data types, although we noted a small subset of locations that remain challenging. Moreover, we found that this individual contains multiple structural differences between the two inherited chromosomes and compared to previously analyzed genomes, affecting the copy number of immune system genes. Application of these methods in larger numbers of individuals will clearly uncover much more variation than is currently known, and might lead to new understanding of the effect of genetic variation on the broad range of human diseases determined by the immune response.
Collapse
|
37
|
Mc Cartney AM, Mahmoud M, Jochum M, Agustinho DP, Zorman B, Al Khleifat A, Dabbaghie F, K Kesharwani R, Smolka M, Dawood M, Albin D, Aliyev E, Almabrazi H, Arslan A, Balaji A, Behera S, Billingsley K, L Cameron D, Daw J, T. Dawson E, De Coster W, Du H, Dunn C, Esteban R, Jolly A, Kalra D, Liao C, Liu Y, Lu TY, M Havrilla J, M Khayat M, Marin M, Monlong J, Price S, Rafael Gener A, Ren J, Sagayaradj S, Sapoval N, Sinner C, C. Soto D, Soylev A, Subramaniyan A, Syed N, Tadimeti N, Tater P, Vats P, Vaughn J, Walker K, Wang G, Zeng Q, Zhang S, Zhao T, Kille B, Biederstedt E, Chaisson M, English A, Kronenberg Z, J. Treangen T, Hefferon T, Chin CS, Busby B, J Sedlazeck F. An international virtual hackathon to build tools for the analysis of structural variants within species ranging from coronaviruses to vertebrates. F1000Res 2021; 10:246. [PMID: 34621504 PMCID: PMC8479851 DOI: 10.12688/f1000research.51477.2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 08/23/2021] [Indexed: 11/20/2022] Open
Abstract
In October 2020, 62 scientists from nine nations worked together remotely in the Second Baylor College of Medicine & DNAnexus hackathon, focusing on different related topics on Structural Variation, Pan-genomes, and SARS-CoV-2 related research. The overarching focus was to assess the current status of the field and identify the remaining challenges. Furthermore, how to combine the strengths of the different interests to drive research and method development forward. Over the four days, eight groups each designed and developed new open-source methods to improve the identification and analysis of variations among species, including humans and SARS-CoV-2. These included improvements in SV calling, genotyping, annotations and filtering. Together with advancements in benchmarking existing methods. Furthermore, groups focused on the diversity of SARS-CoV-2. Daily discussion summary and methods are available publicly at https://github.com/collaborativebioinformatics provides valuable insights for both participants and the research community.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Fawaz Dabbaghie
- Institute for Medical Biometry and Bioinformatics, Düsseldorf, Germany
| | | | | | | | | | | | | | - Ahmed Arslan
- Stanford University School of Medicine, California, USA
| | | | | | | | - Daniel L Cameron
- Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
| | - Joyjit Daw
- NVIDIA Corporation, Santa Clara, California, USA
| | | | | | - Haowei Du
- Baylor College of Medicine, Houston, USA
| | | | | | | | | | | | | | | | | | | | | | - Jean Monlong
- UC Santa Cruz Genomics Institute, Santa Cruz, USA
| | | | | | | | | | | | | | | | - Arda Soylev
- Konya Food and Agriculture University, Konya, Turkey
| | | | | | | | | | - Pankaj Vats
- NVIDIA Corporation, Santa Clara, California, USA
| | | | | | | | - Qiandong Zeng
- Laboratory Corporation of America Holdings, Westborough, USA
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
38
|
Mc Cartney AM, Mahmoud M, Jochum M, Agustinho DP, Zorman B, Al Khleifat A, Dabbaghie F, K Kesharwani R, Smolka M, Dawood M, Albin D, Aliyev E, Almabrazi H, Arslan A, Balaji A, Behera S, Billingsley K, L Cameron D, Daw J, T. Dawson E, De Coster W, Du H, Dunn C, Esteban R, Jolly A, Kalra D, Liao C, Liu Y, Lu TY, M Havrilla J, M Khayat M, Marin M, Monlong J, Price S, Rafael Gener A, Ren J, Sagayaradj S, Sapoval N, Sinner C, C. Soto D, Soylev A, Subramaniyan A, Syed N, Tadimeti N, Tater P, Vats P, Vaughn J, Walker K, Wang G, Zeng Q, Zhang S, Zhao T, Kille B, Biederstedt E, Chaisson M, English A, Kronenberg Z, J. Treangen T, Hefferon T, Chin CS, Busby B, J Sedlazeck F. An international virtual hackathon to build tools for the analysis of structural variants within species ranging from coronaviruses to vertebrates. F1000Res 2021; 10:246. [PMID: 34621504 PMCID: PMC8479851 DOI: 10.12688/f1000research.51477.1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 03/04/2021] [Indexed: 11/08/2023] Open
Abstract
In October 2020, 62 scientists from nine nations worked together remotely in the Second Baylor College of Medicine & DNAnexus hackathon, focusing on different related topics on Structural Variation, Pan-genomes, and SARS-CoV-2 related research. The overarching focus was to assess the current status of the field and identify the remaining challenges. Furthermore, how to combine the strengths of the different interests to drive research and method development forward. Over the four days, eight groups each designed and developed new open-source methods to improve the identification and analysis of variations among species, including humans and SARS-CoV-2. These included improvements in SV calling, genotyping, annotations and filtering. Together with advancements in benchmarking existing methods. Furthermore, groups focused on the diversity of SARS-CoV-2. Daily discussion summary and methods are available publicly at https://github.com/collaborativebioinformatics provides valuable insights for both participants and the research community.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Fawaz Dabbaghie
- Institute for Medical Biometry and Bioinformatics, Düsseldorf, Germany
| | | | | | | | | | | | | | - Ahmed Arslan
- Stanford University School of Medicine, California, USA
| | | | | | | | - Daniel L Cameron
- Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
| | - Joyjit Daw
- NVIDIA Corporation, Santa Clara, California, USA
| | | | | | - Haowei Du
- Baylor College of Medicine, Houston, USA
| | | | | | | | | | | | | | | | | | | | | | - Jean Monlong
- UC Santa Cruz Genomics Institute, Santa Cruz, USA
| | | | | | | | | | | | | | | | - Arda Soylev
- Konya Food and Agriculture University, Konya, Turkey
| | | | | | | | | | - Pankaj Vats
- NVIDIA Corporation, Santa Clara, California, USA
| | | | | | | | - Qiandong Zeng
- Laboratory Corporation of America Holdings, Westborough, USA
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
39
|
Xu M, Guo L, Du X, Li L, Peters BA, Deng L, Wang O, Chen F, Wang J, Jiang Z, Han J, Ni M, Yang H, Xu X, Liu X, Huang J, Fan G. Accurate Haplotype-Resolved Assembly Reveals The Origin Of Structural Variants For Human Trios. Bioinformatics 2021; 37:2095-2102. [PMID: 33538292 PMCID: PMC8613828 DOI: 10.1093/bioinformatics/btab068] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2020] [Revised: 12/07/2020] [Accepted: 01/28/2021] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Achieving a near complete understanding of how the genome of an individual affects the phenotypes of that individual requires deciphering the order of variations along homologous chromosomes in species with diploid genomes. However, true diploid assembly of long-range haplotypes remains challenging. RESULTS To address this, we have developed Haplotype-resolved Assembly for Synthetic long reads using a Trio-binning strategy, or HAST, which uses parental information to classify reads into maternal or paternal. Once sorted, these reads are used to independently de novo assemble the parent-specific haplotypes. We applied HAST to co-barcoded second-generation sequencing data from an Asian individual, resulting in a haplotype assembly covering 94.7% of the reference genome with a scaffold N50 longer than 11 Mb. The high haplotyping precision (∼99.7%) and recall (∼95.9%) represents a substantial improvement over the commonly used tool for assembling co-barcoded reads (Supernova), and is comparable to a trio-binning-based third generation long-read based assembly method (TrioCanu) but with a significantly higher single-base accuracy (up to 99.99997% (Q65)). This makes HAST a superior tool for accurate haplotyping and future haplotype-based studies. AVAILABILITY The code of the analysis is available at https://github.com/BGI-Qingdao/HAST. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mengyang Xu
- BGI-QingDao, Qingdao, 266555, China.,BGI-Shenzhen, Shenzhen, 518083, China
| | - Lidong Guo
- BGI-QingDao, Qingdao, 266555, China.,BGI Education Center, University of Chinese Academy of Sciences, Shenzhen, 518083, China
| | - Xiao Du
- BGI-QingDao, Qingdao, 266555, China.,BGI-Shenzhen, Shenzhen, 518083, China
| | - Lei Li
- BGI-QingDao, Qingdao, 266555, China.,School of Future Technology, University of Chinese Academy of Sciences, Beijing, 101408, China
| | - Brock A Peters
- BGI-Shenzhen, Shenzhen, 518083, China.,Complete Genomics Inc, 2904 Orchard Pkwy, San Jose, California, 95134, USA
| | - Li Deng
- BGI-QingDao, Qingdao, 266555, China
| | - Ou Wang
- BGI-Shenzhen, Shenzhen, 518083, China
| | - Fang Chen
- MGI, BGI-Shenzhen, Shenzhen, 518083, China
| | - Jun Wang
- BGI-QingDao, Qingdao, 266555, China
| | | | | | - Ming Ni
- BGI-QingDao, Qingdao, 266555, China.,BGI-Shenzhen, Shenzhen, 518083, China
| | | | - Xun Xu
- BGI-Shenzhen, Shenzhen, 518083, China
| | - Xin Liu
- BGI-QingDao, Qingdao, 266555, China.,BGI-Shenzhen, Shenzhen, 518083, China.,State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen, 518083, China
| | - Jie Huang
- National Institutes for food and drug Control (NIFDC), No.2 Tiantan Xili, Dongcheng District, Beijing, 10050, China
| | - Guangyi Fan
- BGI-QingDao, Qingdao, 266555, China.,BGI-Shenzhen, Shenzhen, 518083, China.,State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen, 518083, China
| |
Collapse
|
40
|
Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods 2021; 18:170-175. [PMID: 33526886 DOI: 10.1038/s41592-020-01056-5] [Citation(s) in RCA: 1747] [Impact Index Per Article: 582.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2020] [Accepted: 12/23/2020] [Indexed: 02/07/2023]
Abstract
Haplotype-resolved de novo assembly is the ultimate solution to the study of sequence variations in a genome. However, existing algorithms either collapse heterozygous alleles into one consensus copy or fail to cleanly separate the haplotypes to produce high-quality phased assemblies. Here we describe hifiasm, a de novo assembler that takes advantage of long high-fidelity sequence reads to faithfully represent the haplotype information in a phased assembly graph. Unlike other graph-based assemblers that only aim to maintain the contiguity of one haplotype, hifiasm strives to preserve the contiguity of all haplotypes. This feature enables the development of a graph trio binning algorithm that greatly advances over standard trio binning. On three human and five nonhuman datasets, including California redwood with a ~30-Gb hexaploid genome, we show that hifiasm frequently delivers better assemblies than existing tools and consistently outperforms others on haplotype-resolved assembly.
Collapse
|
41
|
Dilthey AT. State-of-the-art genome inference in the human MHC. Int J Biochem Cell Biol 2021; 131:105882. [PMID: 33189874 DOI: 10.1016/j.biocel.2020.105882] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2020] [Revised: 10/29/2020] [Accepted: 11/04/2020] [Indexed: 12/20/2022]
Abstract
The Major Histocompatibility Complex (MHC) on the short arm of chromosome 6 is associated with more diseases than any other region of the genome; it encodes the antigen-presenting Human Leukocyte Antigen (HLA) proteins and is one of the key immunogenetic regions of the genome. Accurate genome inference and interpretation of MHC association signals have traditionally been hampered by the region's uniquely complex features, such as high levels of polymorphism; inter-gene sequence homologies; structural variation; and long-range haplotype structures. Recent algorithmic and technological advances have, however, significantly increased the accessibility of genetic variation in the MHC; these developments include (i) accurate SNP-based HLA type imputation; (ii) genome graph approaches for variation-aware genome inference from next-generation sequencing data; (iii) long-read-based diploid de novo assembly of the MHC; (iv) cost-effective targeted MHC sequencing methods. Applied to hundreds of thousands of samples over the last years, these technologies have already enabled significant biological discoveries, for example in the field of autoimmune disease genetics. Remaining challenges concern the development of integrated methods that leverage haplotype-resolved de novo assembly of the MHC for the development of improved MHC genotyping methods for short reads and the construction of improved reference panels for SNP-based imputation. Improved genome inference in the MHC can crucially contribute to an improved genetic and functional understanding of many immune-related phenotypes and diseases.
Collapse
Affiliation(s)
- Alexander T Dilthey
- Institute of Medical Statistics and Computational Biology, University of Cologne, Cologne, Germany; Cologne Excellence Cluster on Cellular Stress Responses in Aging-Associated Diseases (CECAD), University of Cologne, Cologne, Germany; Institute of Medical Microbiology and Hospital Hygiene, Heinrich Heine University Düsseldorf, Düsseldorf, Germany.
| |
Collapse
|
42
|
Cechova M. Probably Correct: Rescuing Repeats with Short and Long Reads. Genes (Basel) 2020; 12:48. [PMID: 33396198 PMCID: PMC7823596 DOI: 10.3390/genes12010048] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2020] [Revised: 12/23/2020] [Accepted: 12/24/2020] [Indexed: 02/07/2023] Open
Abstract
Ever since the introduction of high-throughput sequencing following the human genome project, assembling short reads into a reference of sufficient quality posed a significant problem as a large portion of the human genome-estimated 50-69%-is repetitive. As a result, a sizable proportion of sequencing reads is multi-mapping, i.e., without a unique placement in the genome. The two key parameters for whether or not a read is multi-mapping are the read length and genome complexity. Long reads are now able to span difficult, heterochromatic regions, including full centromeres, and characterize chromosomes from "telomere to telomere". Moreover, identical reads or repeat arrays can be differentiated based on their epigenetic marks, such as methylation patterns, aiding in the assembly process. This is despite the fact that long reads still contain a modest percentage of sequencing errors, disorienting the aligners and assemblers both in accuracy and speed. Here, I review the proposed and implemented solutions to the repeat resolution and the multi-mapping read problem, as well as the downstream consequences of reference choice, repeat masking, and proper representation of sex chromosomes. I also consider the forthcoming challenges and solutions with regards to long reads, where we expect the shift from the problem of repeat localization within a single individual to the problem of repeat positioning within pangenomes.
Collapse
Affiliation(s)
- Monika Cechova
- Genetics and Reproductive Biotechnologies, Veterinary Research Institute, Central European Institute of Technology (CEITEC), 621 00 Brno, Czech Republic
| |
Collapse
|
43
|
Nurk S, Walenz BP, Rhie A, Vollger MR, Logsdon GA, Grothe R, Miga KH, Eichler EE, Phillippy AM, Koren S. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res 2020; 30:1291-1305. [PMID: 32801147 PMCID: PMC7545148 DOI: 10.1101/gr.263566.120] [Citation(s) in RCA: 362] [Impact Index Per Article: 90.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2020] [Accepted: 08/04/2020] [Indexed: 12/14/2022]
Abstract
Complete and accurate genome assemblies form the basis of most downstream genomic analyses and are of critical importance. Recent genome assembly projects have relied on a combination of noisy long-read sequencing and accurate short-read sequencing, with the former offering greater assembly continuity and the latter providing higher consensus accuracy. The recently introduced Pacific Biosciences (PacBio) HiFi sequencing technology bridges this divide by delivering long reads (>10 kbp) with high per-base accuracy (>99.9%). Here we present HiCanu, a modification of the Canu assembler designed to leverage the full potential of HiFi reads via homopolymer compression, overlap-based error correction, and aggressive false overlap filtering. We benchmark HiCanu with a focus on the recovery of haplotype diversity, major histocompatibility complex (MHC) variants, satellite DNAs, and segmental duplications. For diploid human genomes sequenced to 30× HiFi coverage, HiCanu achieved superior accuracy and allele recovery compared to the current state of the art. On the effectively haploid CHM13 human cell line, HiCanu achieved an NG50 contig size of 77 Mbp with a per-base consensus accuracy of 99.999% (QV50), surpassing recent assemblies of high-coverage, ultralong Oxford Nanopore Technologies (ONT) reads in terms of both accuracy and continuity. This HiCanu assembly correctly resolves 337 out of 341 validation BACs sampled from known segmental duplications and provides the first preliminary assemblies of nine complete human centromeric regions. Although gaps and errors still remain within the most challenging regions of the genome, these results represent a significant advance toward the complete assembly of human genomes.
Collapse
Affiliation(s)
- Sergey Nurk
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20894, USA
| | - Brian P Walenz
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20894, USA
| | - Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20894, USA
| | - Mitchell R Vollger
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA
| | - Glennis A Logsdon
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA
| | - Robert Grothe
- Pacific Biosciences, Menlo Park, California 94025, USA
| | - Karen H Miga
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, California 95064, USA
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, Washington 98195, USA
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20894, USA
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20894, USA
| |
Collapse
|
44
|
Llamas B, Narzisi G, Schneider V, Audano PA, Biederstedt E, Blauvelt L, Bradbury P, Chang X, Chin CS, Fungtammasan A, Clarke WE, Cleary A, Ebler J, Eizenga J, Sibbesen JA, Markello CJ, Garrison E, Garg S, Hickey G, Lazo GR, Lin MF, Mahmoud M, Marschall T, Minkin I, Monlong J, Musunuri RL, Sagayaradj S, Novak AM, Rautiainen M, Regier A, Sedlazeck FJ, Siren J, Souilmi Y, Wagner J, Wrightsman T, Yokoyama TT, Zeng Q, Zook JM, Paten B, Busby B. A strategy for building and using a human reference pangenome. F1000Res 2019; 8:1751. [PMID: 34386196 PMCID: PMC8350888 DOI: 10.12688/f1000research.19630.1] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 07/23/2021] [Indexed: 01/27/2024] Open
Abstract
In March 2019, 45 scientists and software engineers from around the world converged at the University of California, Santa Cruz for the first pangenomics codeathon. The purpose of the meeting was to propose technical specifications and standards for a usable human pangenome as well as to build relevant tools for genome graph infrastructures. During the meeting, the group held several intense and productive discussions covering a diverse set of topics, including advantages of graph genomes over a linear reference representation, design of new methods that can leverage graph-based data structures, and novel visualization and annotation approaches for pangenomes. Additionally, the participants self-organized themselves into teams that worked intensely over a three-day period to build a set of pipelines and tools for specific pangenomic applications. A summary of the questions raised and the tools developed are reported in this manuscript.
Collapse
Affiliation(s)
- Bastien Llamas
- Australian Centre for Ancient DNA, School of Biological Sciences, Environment Institute, The University of Adelaide, Adelaide, South Australia, 5005, Australia
| | | | - Valerie Schneider
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Peter A. Audano
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, 98195, USA
| | - Evan Biederstedt
- Kravis Center for Molecular Oncology, Memorial Sloan Kettering Cancer Center, New York, NY, 10065, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, 02215, USA
| | - Lon Blauvelt
- Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, 95064, USA
| | - Peter Bradbury
- Robert W. Holley Center, USDA-ARS, Ithaca, NY, 14853, USA
| | - Xian Chang
- Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, 95064, USA
| | | | | | | | - Alan Cleary
- National Center for Genome Resources 87505, Santa Fe, NM, 87505, USA
| | - Jana Ebler
- Max Planck Institute for Informatics, Saarbrücken, Germany
| | - Jordan Eizenga
- Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, 95064, USA
| | - Jonas A. Sibbesen
- Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, 95064, USA
| | - Charles J. Markello
- Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, 95064, USA
| | - Erik Garrison
- Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, 95064, USA
| | - Shilpa Garg
- Department of Genetics, Harvard Medical School, Boston, MA, 02115, USA
| | - Glenn Hickey
- Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, 95064, USA
| | - Gerard R. Lazo
- Western Regional Research Center, USDA-ARS, Albany, CA, 94710-1105, USA
| | | | - Medhat Mahmoud
- Human Genome Sequencing Center, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston TX, TX, 77030, USA
| | | | - Ilia Minkin
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, 16802, USA
| | - Jean Monlong
- Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, 95064, USA
| | | | - Sagayamary Sagayaradj
- Genome Center, University of California, Davis, Davis, CA, USA
- BASF, West Sacramento, CA, USA
| | - Adam M. Novak
- Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, 95064, USA
| | | | - Allison Regier
- McDonnell Genome Institute, Washington University in St Louis, St Louis, MO, 63108, USA
| | - Fritz J. Sedlazeck
- Human Genome Sequencing Center, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston TX, TX, 77030, USA
| | - Jouni Siren
- Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, 95064, USA
| | - Yassine Souilmi
- Australian Centre for Ancient DNA, School of Biological Sciences, Environment Institute, The University of Adelaide, Adelaide, South Australia, 5005, Australia
| | - Justin Wagner
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, 20899, USA
| | - Travis Wrightsman
- Section of Plant Breeding and Genetics, Cornell University, Ithaca, NY, 14853, USA
| | - Toshiyuki T. Yokoyama
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Chiba, Japan
| | - Qiandong Zeng
- Laboratory Corporation of America Holdings, Westborough, MA, 01581, USA
| | - Justin M. Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, 20899, USA
| | - Benedict Paten
- Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, 95064, USA
| | - Ben Busby
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| |
Collapse
|
45
|
Llamas B, Narzisi G, Schneider V, Audano PA, Biederstedt E, Blauvelt L, Bradbury P, Chang X, Chin CS, Fungtammasan A, Clarke WE, Cleary A, Ebler J, Eizenga J, Sibbesen JA, Markello CJ, Garrison E, Garg S, Hickey G, Lazo GR, Lin MF, Mahmoud M, Marschall T, Minkin I, Monlong J, Musunuri RL, Sagayaradj S, Novak AM, Rautiainen M, Regier A, Sedlazeck FJ, Siren J, Souilmi Y, Wagner J, Wrightsman T, Yokoyama TT, Zeng Q, Zook JM, Paten B, Busby B. A strategy for building and using a human reference pangenome. F1000Res 2019; 8:1751. [PMID: 34386196 PMCID: PMC8350888 DOI: 10.12688/f1000research.19630.2] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 07/23/2021] [Indexed: 11/20/2022] Open
Abstract
In March 2019, 45 scientists and software engineers from around the world converged at the University of California, Santa Cruz for the first pangenomics codeathon. The purpose of the meeting was to propose technical specifications and standards for a usable human pangenome as well as to build relevant tools for genome graph infrastructures. During the meeting, the group held several intense and productive discussions covering a diverse set of topics, including advantages of graph genomes over a linear reference representation, design of new methods that can leverage graph-based data structures, and novel visualization and annotation approaches for pangenomes. Additionally, the participants self-organized themselves into teams that worked intensely over a three-day period to build a set of pipelines and tools for specific pangenomic applications. A summary of the questions raised and the tools developed are reported in this manuscript.
Collapse
Affiliation(s)
- Bastien Llamas
- Australian Centre for Ancient DNA, School of Biological Sciences, Environment Institute, The University of Adelaide, Adelaide, South Australia, 5005, Australia
| | | | - Valerie Schneider
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Peter A Audano
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, 98195, USA
| | - Evan Biederstedt
- Kravis Center for Molecular Oncology, Memorial Sloan Kettering Cancer Center, New York, NY, 10065, USA.,Department of Biomedical Informatics, Harvard Medical School, Boston, MA, 02215, USA
| | - Lon Blauvelt
- Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, 95064, USA
| | - Peter Bradbury
- Robert W. Holley Center, USDA-ARS, Ithaca, NY, 14853, USA
| | - Xian Chang
- Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, 95064, USA
| | | | | | | | - Alan Cleary
- National Center for Genome Resources 87505, Santa Fe, NM, 87505, USA
| | - Jana Ebler
- Max Planck Institute for Informatics, Saarbrücken, Germany
| | - Jordan Eizenga
- Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, 95064, USA
| | - Jonas A Sibbesen
- Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, 95064, USA
| | - Charles J Markello
- Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, 95064, USA
| | - Erik Garrison
- Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, 95064, USA
| | - Shilpa Garg
- Department of Genetics, Harvard Medical School, Boston, MA, 02115, USA
| | - Glenn Hickey
- Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, 95064, USA
| | - Gerard R Lazo
- Western Regional Research Center, USDA-ARS, Albany, CA, 94710-1105, USA
| | | | - Medhat Mahmoud
- Human Genome Sequencing Center, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston TX, TX, 77030, USA
| | | | - Ilia Minkin
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, 16802, USA
| | - Jean Monlong
- Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, 95064, USA
| | | | - Sagayamary Sagayaradj
- Genome Center, University of California, Davis, Davis, CA, USA.,BASF, West Sacramento, CA, USA
| | - Adam M Novak
- Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, 95064, USA
| | | | - Allison Regier
- McDonnell Genome Institute, Washington University in St Louis, St Louis, MO, 63108, USA
| | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston TX, TX, 77030, USA
| | - Jouni Siren
- Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, 95064, USA
| | - Yassine Souilmi
- Australian Centre for Ancient DNA, School of Biological Sciences, Environment Institute, The University of Adelaide, Adelaide, South Australia, 5005, Australia
| | - Justin Wagner
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, 20899, USA
| | - Travis Wrightsman
- Section of Plant Breeding and Genetics, Cornell University, Ithaca, NY, 14853, USA
| | - Toshiyuki T Yokoyama
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Chiba, Japan
| | - Qiandong Zeng
- Laboratory Corporation of America Holdings, Westborough, MA, 01581, USA
| | - Justin M Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, 20899, USA
| | - Benedict Paten
- Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, 95064, USA
| | - Ben Busby
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| |
Collapse
|