1
|
Loh CA, Shields DA, Schwing A, Evrony GD. High-fidelity, large-scale targeted profiling of microsatellites. Genome Res 2024; 34:1008-1026. [PMID: 39013593 PMCID: PMC11368184 DOI: 10.1101/gr.278785.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Accepted: 07/11/2024] [Indexed: 07/18/2024]
Abstract
Microsatellites are highly mutable sequences that can serve as markers for relationships among individuals or cells within a population. The accuracy and resolution of reconstructing these relationships depends on the fidelity of microsatellite profiling and the number of microsatellites profiled. However, current methods for targeted profiling of microsatellites incur significant "stutter" artifacts that interfere with accurate genotyping, and sequencing costs preclude whole-genome microsatellite profiling of a large number of samples. We developed a novel method for accurate and cost-effective targeted profiling of a panel of more than 150,000 microsatellites per sample, along with a computational tool for designing large-scale microsatellite panels. Our method addresses the greatest challenge for microsatellite profiling-"stutter" artifacts-with a low-temperature hybridization capture that significantly reduces these artifacts. We also developed a computational tool for accurate genotyping of the resulting microsatellite sequencing data that uses an ensemble approach integrating three microsatellite genotyping tools, which we optimize by analysis of de novo microsatellite mutations in human trios. Altogether, our suite of experimental and computational tools enables high-fidelity, large-scale profiling of microsatellites, which may find utility in diverse applications such as lineage tracing, population genetics, ecology, and forensics.
Collapse
Affiliation(s)
- Caitlin A Loh
- Center for Human Genetics and Genomics, New York University Grossman School of Medicine, New York, New York 10016, USA
- Department of Pediatrics, Department of Neuroscience & Physiology, Institute for Systems Genetics, Perlmutter Cancer Center, and Neuroscience Institute, New York University Grossman School of Medicine, New York, New York 10016, USA
| | - Danielle A Shields
- Center for Human Genetics and Genomics, New York University Grossman School of Medicine, New York, New York 10016, USA
- Department of Pediatrics, Department of Neuroscience & Physiology, Institute for Systems Genetics, Perlmutter Cancer Center, and Neuroscience Institute, New York University Grossman School of Medicine, New York, New York 10016, USA
| | - Adam Schwing
- Center for Human Genetics and Genomics, New York University Grossman School of Medicine, New York, New York 10016, USA
- Department of Pediatrics, Department of Neuroscience & Physiology, Institute for Systems Genetics, Perlmutter Cancer Center, and Neuroscience Institute, New York University Grossman School of Medicine, New York, New York 10016, USA
| | - Gilad D Evrony
- Center for Human Genetics and Genomics, New York University Grossman School of Medicine, New York, New York 10016, USA;
- Department of Pediatrics, Department of Neuroscience & Physiology, Institute for Systems Genetics, Perlmutter Cancer Center, and Neuroscience Institute, New York University Grossman School of Medicine, New York, New York 10016, USA
| |
Collapse
|
2
|
Ye W, Li JS, Li W, Cui Y. Advancements and future perspectives of human tandem repeats. Sci Bull (Beijing) 2024:S2095-9273(24)00566-8. [PMID: 39164144 DOI: 10.1016/j.scib.2024.08.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/22/2024]
Affiliation(s)
- Wenbin Ye
- Department of Biological Chemistry, School of Medicine, University of California, Irvine, Irvine, CA 92697, USA.
| | - Jason Sheng Li
- Department of Biological Chemistry, School of Medicine, University of California, Irvine, Irvine, CA 92697, USA
| | - Wei Li
- Department of Biological Chemistry, School of Medicine, University of California, Irvine, Irvine, CA 92697, USA.
| | - Ya Cui
- Department of Biological Chemistry, School of Medicine, University of California, Irvine, Irvine, CA 92697, USA.
| |
Collapse
|
3
|
Sehgal A, Ziaei Jam H, Shen A, Gymrek M. Genome-wide detection of somatic mosaicism at short tandem repeats. Bioinformatics 2024; 40:btae485. [PMID: 39078205 PMCID: PMC11319640 DOI: 10.1093/bioinformatics/btae485] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Revised: 06/30/2024] [Accepted: 07/29/2024] [Indexed: 07/31/2024] Open
Abstract
MOTIVATION Somatic mosaicism has been implicated in several developmental disorders, cancers, and other diseases. Short tandem repeats (STRs) consist of repeated sequences of 1-6 bp and comprise >1 million loci in the human genome. Somatic mosaicism at STRs is known to play a key role in the pathogenicity of loci implicated in repeat expansion disorders and is highly prevalent in cancers exhibiting microsatellite instability. While a variety of tools have been developed to genotype germline variation at STRs, a method for systematically identifying mosaic STRs is lacking. RESULTS We introduce prancSTR, a novel method for detecting mosaic STRs from individual high-throughput sequencing datasets. prancSTR is designed to detect loci characterized by a single high-frequency mosaic allele, but can also detect loci with multiple mosaic alleles. Unlike many existing mosaicism detection methods for other variant types, prancSTR does not require a matched control sample as input. We show that prancSTR accurately identifies mosaic STRs in simulated data, demonstrate its feasibility by identifying candidate mosaic STRs in Illumina whole genome sequencing data derived from lymphoblastoid cell lines for individuals sequenced by the 1000 Genomes Project, and evaluate the use of prancSTR on Element and PacBio data. In addition to prancSTR, we present simTR, a novel simulation framework which simulates raw sequencing reads with realistic error profiles at STRs. AVAILABILITY AND IMPLEMENTATION prancSTR and simTR are freely available at https://github.com/gymrek-lab/trtools. Detailed documentation is available at https://trtools.readthedocs.io/.
Collapse
Affiliation(s)
- Aarushi Sehgal
- Department of Computer Science and Engineering, University of California San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, United States
| | - Helyaneh Ziaei Jam
- Department of Computer Science and Engineering, University of California San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, United States
| | - Andrew Shen
- Department of Computer Science and Engineering, University of California San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, United States
| | - Melissa Gymrek
- Department of Computer Science and Engineering, University of California San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, United States
- Department of Medicine, University of California San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, United States
| |
Collapse
|
4
|
Ziaei Jam H, Zook JM, Javadzadeh S, Park J, Sehgal A, Gymrek M. LongTR: genome-wide profiling of genetic variation at tandem repeats from long reads. Genome Biol 2024; 25:176. [PMID: 38965568 PMCID: PMC11229021 DOI: 10.1186/s13059-024-03319-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2024] [Accepted: 06/21/2024] [Indexed: 07/06/2024] Open
Abstract
Tandem repeats are frequent across the human genome, and variation in repeat length has been linked to a variety of traits. Recent improvements in long read sequencing technologies have the potential to greatly improve tandem repeat analysis, especially for long or complex repeats. Here, we introduce LongTR, which accurately genotypes tandem repeats from high-fidelity long reads available from both PacBio and Oxford Nanopore Technologies. LongTR is freely available at https://github.com/gymrek-lab/longtr and https://zenodo.org/doi/10.5281/zenodo.11403979 .
Collapse
Affiliation(s)
- Helyaneh Ziaei Jam
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
| | - Justin M Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Dr, Gaithersburg, MD, USA
| | - Sara Javadzadeh
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
| | - Jonghun Park
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
| | - Aarushi Sehgal
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
| | - Melissa Gymrek
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA.
- Department of Medicine, University of California San Diego, La Jolla, CA, USA.
| |
Collapse
|
5
|
Lamkin M, Gymrek M. The emerging role of tandem repeats in complex traits. Nat Rev Genet 2024; 25:452-453. [PMID: 38714860 DOI: 10.1038/s41576-024-00736-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/21/2024]
Affiliation(s)
- Michael Lamkin
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
| | - Melissa Gymrek
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA.
- Department of Medicine, University of California San Diego, La Jolla, CA, USA.
| |
Collapse
|
6
|
Tanudisastro HA, Deveson IW, Dashnow H, MacArthur DG. Sequencing and characterizing short tandem repeats in the human genome. Nat Rev Genet 2024; 25:460-475. [PMID: 38366034 DOI: 10.1038/s41576-024-00692-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/06/2023] [Indexed: 02/18/2024]
Abstract
Short tandem repeats (STRs) are highly polymorphic sequences throughout the human genome that are composed of repeated copies of a 1-6-bp motif. Over 1 million variable STR loci are known, some of which regulate gene expression and influence complex traits, such as height. Moreover, variants in at least 60 STR loci cause genetic disorders, including Huntington disease and fragile X syndrome. Accurately identifying and genotyping STR variants is challenging, in particular mapping short reads to repetitive regions and inferring expanded repeat lengths. Recent advances in sequencing technology and computational tools for STR genotyping from sequencing data promise to help overcome this challenge and solve genetically unresolved cases and the 'missing heritability' of polygenic traits. Here, we compare STR genotyping methods, analytical tools and their applications to understand the effect of STR variation on health and disease. We identify emergent opportunities to refine genotyping and quality-control approaches as well as to integrate STRs into variant-calling workflows and large cohort analyses.
Collapse
Affiliation(s)
- Hope A Tanudisastro
- Centre for Population Genomics, Garvan Institute of Medical Research, Sydney, New South Wales, Australia
- Centre for Population Genomics, Murdoch Children's Research Institute, Melbourne, Victoria, Australia
- Faculty of Medicine and Health, University of New South Wales, Sydney, New South Wales, Australia
- Faculty of Medicine and Health, University of Sydney, Sydney, New South Wales, Australia
| | - Ira W Deveson
- Faculty of Medicine and Health, University of New South Wales, Sydney, New South Wales, Australia
- Genomics and Inherited Disease Program, Garvan Institute of Medical Research, Sydney, New South Wales, Australia
| | - Harriet Dashnow
- Department of Human Genetics, University of Utah, Salt Lake City, UT, USA.
| | - Daniel G MacArthur
- Centre for Population Genomics, Garvan Institute of Medical Research, Sydney, New South Wales, Australia.
- Centre for Population Genomics, Murdoch Children's Research Institute, Melbourne, Victoria, Australia.
- Faculty of Medicine and Health, University of New South Wales, Sydney, New South Wales, Australia.
| |
Collapse
|
7
|
Chiu R, Rajan-Babu IS, Friedman JM, Birol I. A comprehensive tandem repeat catalog of the human genome. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.06.19.24309173. [PMID: 38947075 PMCID: PMC11213036 DOI: 10.1101/2024.06.19.24309173] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/02/2024]
Abstract
With the increasing availability of long-read sequencing data, high-quality human genome assemblies, and software for fully characterizing tandem repeats, genome-wide genotyping of tandem repeat loci on a population scale becomes more feasible. Such efforts not only expand our knowledge of the tandem repeat landscape in the human genome but also enhance our ability to differentiate pathogenic tandem repeat mutations from benign polymorphisms. To this end, we analyzed 272 genomes assembled using datasets from three public initiatives that employed different long-read sequencing technologies. Here, we report a catalog of over 18 million tandem repeat loci, many of which were previously unannotated. Some of these loci are highly polymorphic, and many of them reside within coding sequences.
Collapse
Affiliation(s)
- Readman Chiu
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Indhu-Shree Rajan-Babu
- Department of Medical Genetics, University of British Columbia, Vancouver, BC V5Z 4H4, Canada
| | - Jan M Friedman
- Department of Medical Genetics, University of British Columbia, Vancouver, BC V5Z 4H4, Canada
- BC Children's Hospital Research Institute, Vancouver, BC V5Z 4H4, Canada
| | - Inanc Birol
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
- Department of Medical Genetics, University of British Columbia, Vancouver, BC V5Z 4H4, Canada
| |
Collapse
|
8
|
Huang Y, Wang M, Liu C, He G. Comprehensive landscape of non-CODIS STRs in global populations provides new insights into challenging DNA profiles. Forensic Sci Int Genet 2024; 70:103010. [PMID: 38271830 DOI: 10.1016/j.fsigen.2024.103010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2023] [Revised: 01/13/2024] [Accepted: 01/14/2024] [Indexed: 01/27/2024]
Abstract
The worldwide implementation of short tandem repeats (STR) profiles in forensic genetics necessitated establishing and expanding the CODIS core loci set to facilitated efficient data management and exchange. Currently, the mainstay CODIS STRs are adopted in most general-purpose forensic kits. However, relying solely on these loci failed to yield satisfactory results for challenging tasks, such as bio-geographical ancestry inference, complex DNA mixture profile interpretation, and distant kinship analysis. In this context, non-CODIS STRs are potent supplements to enhance the systematic discriminating power, particularly when combined with the high-throughput next-generation sequencing (NGS) technique. Nevertheless, comprehensive evaluation on non-CODIS STRs in diverse populations was scarce, hindering their further application in routine caseworks. To address this gap, we investigated genetic variations of 178 historically available non-CODIS STRs from ethnolinguistically different worldwide populations and studied their characteristics and forensic potentials via high-coverage whole genome sequencing (WGS) data. Initially, we delineated the genomic properties of these non-CODIS markers through sequence searching, repeat structure scanning, and manual inspection. Subsequent population genetics analysis suggested that these non-CODIS STRs had comparable polymorphism levels and forensic utility to CODIS STRs. Furthermore, we constructed a theoretical next-generation sequencing (NGS) panel comprising 108 STRs (20 CODIS STRs and 88 non-CODIS STRs), and evaluated its performance in inferring bio-geographical ancestry origins, deconvoluting complex DNA mixtures, and differentiating distant kinships using real and simulated datasets. Our findings demonstrated that incorporating supplementary non-CODIS STRs enabled the extrapolation of multidimensional information from a single STR profile, thereby facilitating the analysis of challenging forensic tasks. In conclusion, this study presents an extensive genomic landscape of forensic non-CODIS STRs among global populations, and emphasized the imperative inclusion of additional polymorphic non-CODIS STRs in future NGS-based forensic systems.
Collapse
Affiliation(s)
- Yuguo Huang
- Institute of Rare Diseases, West China Hospital of Sichuan University, Sichuan University, Chengdu 610041, China.
| | - Mengge Wang
- Institute of Rare Diseases, West China Hospital of Sichuan University, Sichuan University, Chengdu 610041, China
| | - Chao Liu
- Anti-Drug Technology Center of Guangdong Province, Guangzhou 510230, China; Key Laboratory of Forensic Multi-Omics for Precision Identification, School of Forensic Medicine, Southern Medical University, Guangzhou 510515, China.
| | - Guanglin He
- Institute of Rare Diseases, West China Hospital of Sichuan University, Sichuan University, Chengdu 610041, China; Center for Archaeological Science, Sichuan University, Chengdu 610000, China.
| |
Collapse
|
9
|
Leitão E, Schröder C, Depienne C. Identification and characterization of repeat expansions in neurological disorders: Methodologies, tools, and strategies. Rev Neurol (Paris) 2024; 180:383-392. [PMID: 38594146 DOI: 10.1016/j.neurol.2024.03.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2024] [Revised: 03/25/2024] [Accepted: 03/26/2024] [Indexed: 04/11/2024]
Abstract
Tandem repeats are a common, highly polymorphic class of variation in human genomes. Their expansion beyond a pathogenic threshold is a process that contributes to a wide range of neurological and neuromuscular genetic disorders, of which over 60 have been identified to date. The last few years have seen a resurgence in repeat expansion discovery propelled by technological advancements, enabling the identification of over 20 novel repeat expansion disorders. These expansions can occur in coding or non-coding regions of genes, resulting in a range of pathogenic mechanisms. In this article, we review strategies, tools and methods that can be used for efficient detection and characterization of known repeat expansions and identification of new expansion disorders. Features that can be used to prioritize repeat expansions include anticipation, which is characterized by increased severity or earlier onset of symptoms across generations, and founder effects, which contribute to higher prevalence rates in certain populations. Classical technologies such as Southern blotting, repeat-primed polymerase chain reaction (PCR) and long-range PCR can still be used to detect known repeat expansions, although they usually have significant limitations linked to the absence of sequence context. Targeted sequencing of known expansions using either long-range PCR or CRISPR-Cas9 enrichment combined with long-read sequencing or adaptive nanopore sampling are usually better but more expensive alternatives. The development of new bioinformatics tools applied to short-read genome data can now be used to detect repeat expansions either in a targeted manner or at the genome-wide level. In addition, technological advances, particularly long-read technologies such as optical genome mapping (Bionano Genomics), Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) HiFi sequencing, offer promising avenues for the detection of repeat expansions. Despite challenges in specific DNA extraction requirements, computation resources needed and data interpretation, these technologies have an immense potential to advance our understanding of repeat expansion disorders and improve diagnostic accuracy.
Collapse
Affiliation(s)
- E Leitão
- Institute of Human Genetics, University Hospital Essen, University Duisburg-Essen, Essen, Germany
| | - C Schröder
- Institute of Human Genetics, University Hospital Essen, University Duisburg-Essen, Essen, Germany
| | - C Depienne
- Institute of Human Genetics, University Hospital Essen, University Duisburg-Essen, Essen, Germany.
| |
Collapse
|
10
|
English AC, Dolzhenko E, Ziaei Jam H, McKenzie SK, Olson ND, De Coster W, Park J, Gu B, Wagner J, Eberle MA, Gymrek M, Chaisson MJP, Zook JM, Sedlazeck FJ. Analysis and benchmarking of small and large genomic variants across tandem repeats. Nat Biotechnol 2024:10.1038/s41587-024-02225-z. [PMID: 38671154 DOI: 10.1038/s41587-024-02225-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2023] [Accepted: 03/28/2024] [Indexed: 04/28/2024]
Abstract
Tandem repeats (TRs) are highly polymorphic in the human genome, have thousands of associated molecular traits and are linked to over 60 disease phenotypes. However, they are often excluded from at-scale studies because of challenges with variant calling and representation, as well as a lack of a genome-wide standard. Here, to promote the development of TR methods, we created a catalog of TR regions and explored TR properties across 86 haplotype-resolved long-read human assemblies. We curated variants from the Genome in a Bottle (GIAB) HG002 individual to create a TR dataset to benchmark existing and future TR analysis methods. We also present an improved variant comparison method that handles variants greater than 4 bp in length and varying allelic representation. The 8.1% of the genome covered by the TR catalog holds ~24.9% of variants per individual, including 124,728 small and 17,988 large variants for the GIAB HG002 'truth-set' TR benchmark. We demonstrate the utility of this pipeline across short-read and long-read technologies.
Collapse
Affiliation(s)
- Adam C English
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA.
| | | | - Helyaneh Ziaei Jam
- Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA, USA
| | | | - Nathan D Olson
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Wouter De Coster
- Applied and Translational Neurogenomics Group, VIB Center for Molecular Neurology, VIB, Antwerp, Belgium
- Applied and Translational Neurogenomics Group, Department of Biomedical Sciences, University of Antwerp, Antwerp, Belgium
| | - Jonghun Park
- Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA, USA
| | - Bida Gu
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Justin Wagner
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | | | - Melissa Gymrek
- Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA, USA
- Department of Medicine, University of California, San Diego, La Jolla, CA, USA
| | - Mark J P Chaisson
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Justin M Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA.
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA.
- Department of Computer Science, Rice University, Houston, TX, USA.
| |
Collapse
|
11
|
Cui Y, Ye W, Li JS, Li JJ, Vilain E, Sallam T, Li W. A genome-wide spectrum of tandem repeat expansions in 338,963 humans. Cell 2024; 187:2336-2341.e5. [PMID: 38582080 PMCID: PMC11065452 DOI: 10.1016/j.cell.2024.03.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2023] [Revised: 01/23/2024] [Accepted: 03/05/2024] [Indexed: 04/08/2024]
Abstract
The Genome Aggregation Database (gnomAD), widely recognized as the gold-standard reference map of human genetic variation, has largely overlooked tandem repeat (TR) expansions, despite the fact that TRs constitute ∼6% of our genome and are linked to over 50 human diseases. Here, we introduce the TR-gnomAD (https://wlcb.oit.uci.edu/TRgnomAD), a biobank-scale reference of 0.86 million TRs derived from 338,963 whole-genome sequencing (WGS) samples of diverse ancestries (39.5% non-European samples). TR-gnomAD offers critical insights into ancestry-specific disease prevalence using disparities in TR unit number frequencies among ancestries. Moreover, TR-gnomAD is able to differentiate between common, presumably benign TR expansions, which are prevalent in TR-gnomAD, from those potentially pathogenic TR expansions, which are found more frequently in disease groups than within TR-gnomAD. Together, TR-gnomAD is an invaluable resource for researchers and physicians to interpret TR expansions in individuals with genetic diseases.
Collapse
Affiliation(s)
- Ya Cui
- Division of Computational Biomedicine, Department of Biological Chemistry, School of Medicine, University of California, Irvine, Irvine, CA 92697, USA.
| | - Wenbin Ye
- Division of Computational Biomedicine, Department of Biological Chemistry, School of Medicine, University of California, Irvine, Irvine, CA 92697, USA
| | - Jason Sheng Li
- Division of Computational Biomedicine, Department of Biological Chemistry, School of Medicine, University of California, Irvine, Irvine, CA 92697, USA
| | - Jingyi Jessica Li
- Department of Statistics, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Eric Vilain
- Institute for Clinical and Translational Science, University of California, Irvine, Irvine, CA 92697, USA; Department of Pediatrics, University of California, Irvine, Irvine, CA 92697, USA
| | - Tamer Sallam
- Division of Cardiology, Department of Medicine, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Wei Li
- Division of Computational Biomedicine, Department of Biological Chemistry, School of Medicine, University of California, Irvine, Irvine, CA 92697, USA.
| |
Collapse
|
12
|
Fazzari V, Moo-Choy A, Panoyan MA, Abbatangelo CL, Polimanti R, Novroski NM, Wendt FR. Multi-ancestry tandem repeat association study of hair colour using exome-wide sequencing. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.24.581865. [PMID: 38464141 PMCID: PMC10925195 DOI: 10.1101/2024.02.24.581865] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/12/2024]
Abstract
Hair colour variation is influenced by hundreds of positions across the human genome but this genetic contribution has only been narrowly explored. Genome-wide association studies identified single nucleotide polymorphisms (SNPs) influencing hair colour but the biology underlying these associations is challenging to interpret. We report 16 tandem repeats (TRs) with effects on different models of hair colour plus two TRs associated with hair colour in diverse ancestry groups. Several of these TRs expand or contract amino acid coding regions of their localized protein such that structure, and by extension function, may be altered. We also demonstrate that independent of SNP variation, these TRs can be used to great an additive polygenic score that predicts darker hair colour. This work adds to the growing body of evidence regarding TR influence on human traits with relatively large and independent effects relative to surrounding SNP variation.
Collapse
|
13
|
Jam HZ, Zook JM, Javadzadeh S, Park J, Sehgal A, Gymrek M. Genome-wide profiling of genetic variation at tandem repeat from long reads. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.20.576266. [PMID: 38328152 PMCID: PMC10849534 DOI: 10.1101/2024.01.20.576266] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/09/2024]
Abstract
Tandem repeats are frequent across the human genome, and variation in repeat length has been linked to a variety of traits. Recent improvements in long read sequencing technologies have the potential to greatly improve TR analysis, especially for long or complex repeats. Here we introduce LongTR, which accurately genotypes tandem repeats from high fidelity long reads available from both PacBio and Oxford Nanopore Technologies. LongTR is freely available at https://github.com/gymrek-lab/longtr.
Collapse
Affiliation(s)
- Helyaneh Ziaei Jam
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
| | - Justin M. Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Dr., Gaithersburg, MD, USA
| | - Sara Javadzadeh
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
| | - Jonghun Park
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
| | - Aarushi Sehgal
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
| | - Melissa Gymrek
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
- Department of Medicine, University of California San Diego, La Jolla, CA, USA
| |
Collapse
|
14
|
Dolzhenko E, English A, Dashnow H, De Sena Brandine G, Mokveld T, Rowell WJ, Karniski C, Kronenberg Z, Danzi MC, Cheung WA, Bi C, Farrow E, Wenger A, Chua KP, Martínez-Cerdeño V, Bartley TD, Jin P, Nelson DL, Zuchner S, Pastinen T, Quinlan AR, Sedlazeck FJ, Eberle MA. Characterization and visualization of tandem repeats at genome scale. Nat Biotechnol 2024:10.1038/s41587-023-02057-3. [PMID: 38168995 DOI: 10.1038/s41587-023-02057-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2023] [Accepted: 11/06/2023] [Indexed: 01/05/2024]
Abstract
Tandem repeat (TR) variation is associated with gene expression changes and numerous rare monogenic diseases. Although long-read sequencing provides accurate full-length sequences and methylation of TRs, there is still a need for computational methods to profile TRs across the genome. Here we introduce the Tandem Repeat Genotyping Tool (TRGT) and an accompanying TR database. TRGT determines the consensus sequences and methylation levels of specified TRs from PacBio HiFi sequencing data. It also reports reads that support each repeat allele. These reads can be subsequently visualized with a companion TR visualization tool. Assessing 937,122 TRs, TRGT showed a Mendelian concordance of 98.38%, allowing a single repeat unit difference. In six samples with known repeat expansions, TRGT detected all expansions while also identifying methylation signals and mosaicism and providing finer repeat length resolution than existing methods. Additionally, we released a database with allele sequences and methylation levels for 937,122 TRs across 100 genomes.
Collapse
Affiliation(s)
| | - Adam English
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Harriet Dashnow
- Departments of Human Genetics and Biomedical Informatics, University of Utah, Salt Lake City, UT, USA
| | | | - Tom Mokveld
- Pacific Biosciences of California, Menlo Park, CA, USA
| | | | | | | | - Matt C Danzi
- Dr. John T. Macdonald Foundation Department of Human Genetics and John P. Hussman Institute for Human Genomics, University of Miami Miller School of Medicine, Miami, FL, USA
| | - Warren A Cheung
- Genomic Medicine Center, Children's Mercy Kansas City, Kansas City, MO, USA
| | - Chengpeng Bi
- Genomic Medicine Center, Children's Mercy Kansas City, Kansas City, MO, USA
| | - Emily Farrow
- Genomic Medicine Center, Children's Mercy Kansas City, Kansas City, MO, USA
| | - Aaron Wenger
- Pacific Biosciences of California, Menlo Park, CA, USA
| | - Khi Pin Chua
- Pacific Biosciences of California, Menlo Park, CA, USA
| | - Verónica Martínez-Cerdeño
- Institute for Pediatric Regenerative Medicine, Shriner's Hospital for Children and UC Davis School of Medicine, Sacramento, CA, USA
- Department of Pathology & Laboratory Medicine, UC Davis School of Medicine, Sacramento, CA, USA
- MIND Institute, UC Davis School of Medicine, Sacramento, CA, USA
| | - Trevor D Bartley
- Institute for Pediatric Regenerative Medicine, Shriner's Hospital for Children and UC Davis School of Medicine, Sacramento, CA, USA
- Department of Pathology & Laboratory Medicine, UC Davis School of Medicine, Sacramento, CA, USA
| | - Peng Jin
- Department of Human Genetics, Emory University School of Medicine, Atlanta, GA, USA
| | - David L Nelson
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | - Stephan Zuchner
- Dr. John T. Macdonald Foundation Department of Human Genetics and John P. Hussman Institute for Human Genomics, University of Miami Miller School of Medicine, Miami, FL, USA
| | - Tomi Pastinen
- Genomic Medicine Center, Children's Mercy Kansas City, Kansas City, MO, USA
| | - Aaron R Quinlan
- Departments of Human Genetics and Biomedical Informatics, University of Utah, Salt Lake City, UT, USA
| | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
- Department of Computer Science, Rice University, Houston, TX, USA
| | | |
Collapse
|
15
|
Loh PR. Uncovering complex trait heritability hidden in the repeatome. CELL GENOMICS 2023; 3:100461. [PMID: 38116125 PMCID: PMC10726486 DOI: 10.1016/j.xgen.2023.100461] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 12/21/2023]
Abstract
Short tandem repeats (STRs) account for a substantial fraction of human genetic variation, but their contribution to complex human phenotypes is largely unknown. Margoliash et al. perform detailed genome-wide association analysis and fine-mapping of STRs in UK Biobank, identifying many STRs likely to influence variation in blood and serum traits.
Collapse
Affiliation(s)
- Po-Ru Loh
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA; Center for Data Sciences, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA; Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
| |
Collapse
|
16
|
Hannan AJ. Expanding horizons of tandem repeats in biology and medicine: Why 'genomic dark matter' matters. Emerg Top Life Sci 2023; 7:ETLS20230075. [PMID: 38088823 PMCID: PMC10754335 DOI: 10.1042/etls20230075] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2023] [Revised: 11/27/2023] [Accepted: 11/27/2023] [Indexed: 12/30/2023]
Abstract
Approximately half of the human genome includes repetitive sequences, and these DNA sequences (as well as their transcribed repetitive RNA and translated amino-acid repeat sequences) are known as the repeatome. Within this repeatome there are a couple of million tandem repeats, dispersed throughout the genome. These tandem repeats have been estimated to constitute ∼8% of the entire human genome. These tandem repeats can be located throughout exons, introns and intergenic regions, thus potentially affecting the structure and function of tandemly repetitive DNA, RNA and protein sequences. Over more than three decades, more than 60 monogenic human disorders have been found to be caused by tandem-repeat mutations. These monogenic tandem-repeat disorders include Huntington's disease, a variety of ataxias, amyotrophic lateral sclerosis and frontotemporal dementia, as well as many other neurodegenerative diseases. Furthermore, tandem-repeat disorders can include fragile X syndrome, related fragile X disorders, as well as other neurological and psychiatric disorders. However, these monogenic tandem-repeat disorders, which were discovered via their dominant or recessive modes of inheritance, may represent the 'tip of the iceberg' with respect to tandem-repeat contributions to human disorders. A previous proposal that tandem repeats may contribute to the 'missing heritability' of various common polygenic human disorders has recently been supported by a variety of new evidence. This includes genome-wide studies that associate tandem-repeat mutations with autism, schizophrenia, Parkinson's disease and various types of cancers. In this article, I will discuss how tandem-repeat mutations and polymorphisms could contribute to a wide range of common disorders, along with some of the many major challenges of tandem-repeat biology and medicine. Finally, I will discuss the potential of tandem repeats to be therapeutically targeted, so as to prevent and treat an expanding range of human disorders.
Collapse
Affiliation(s)
- Anthony J Hannan
- Florey Institute of Neuroscience and Mental Health, University of Melbourne, Parkville, Victoria 3010, Australia
- Department of Anatomy and Physiology, University of Melbourne, Parkville, Victoria 3010, Australia
| |
Collapse
|
17
|
Panoyan MA, Shi Y, Abbatangelo CL, Adler N, Moo-Choy A, Parra EJ, Polimanti R, Hu P, Wendt FR. Exome-wide tandem repeats confer large effects on subcortical volumes in UK Biobank participants. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.12.11.23299818. [PMID: 38168307 PMCID: PMC10760277 DOI: 10.1101/2023.12.11.23299818] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/05/2024]
Abstract
The human subcortex is involved in memory and cognition. Structural and functional changes in subcortical regions is implicated in psychiatric conditions. We performed an association study of subcortical volumes using 15,941 tandem repeats (TRs) derived from whole exome sequencing (WES) data in 16,527 unrelated European ancestry participants. We identified 17 loci, most of which were associated with accumbens volume, and nine of which had fine-mapping probability supporting their causal effect on subcortical volume independent of surrounding variation. The most significant association involved NTN1 -[GCGG] N and increased accumbens volume (β=5.93, P=8.16x10 -9 ). Three exonic TRs had large effects on thalamus volume ( LAT2 -[CATC] N β=-949, P=3.84x10 -6 and SLC39A4 -[CAG] N β=-1599, P=2.42x10 -8 ) and pallidum volume ( MCM2 -[AGG] N β=-404.9, P=147x10 -7 ). These genetic effects were consistent measurements of per-repeat expansion/contraction effects on organism fitness. With 3-dimensional modeling, we reinforced these effects to show that the expanded and contracted LAT2 -[CATC] N repeat causes a frameshift mutation that prevents appropriate protein folding. These TRs also exhibited independent effects on several psychiatric symptoms, including LAT2 -[CATC] N and the tiredness/low energy symptom of depression (β=0.340, P=0.003). These findings link genetic variation to tractable biology in the brain and relevant psychiatric symptoms. We also chart one pathway for TR prioritization in future complex trait genetic studies.
Collapse
|
18
|
Sehgal A, Ziaei-Jam H, Shen A, Gymrek M. Genome-wide detection of somatic mosaicism at short tandem repeats. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.22.568371. [PMID: 38045311 PMCID: PMC10690266 DOI: 10.1101/2023.11.22.568371] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/05/2023]
Abstract
Motivation Somatic mosaicism, in which a mutation occurs post-zygotically, has been implicated in several developmental disorders, cancers, and other diseases. Short tandem repeats (STRs) consist of repeated sequences of 1-6bp and comprise more than 1 million loci in the human genome. Somatic mosaicism at STRs is known to play a key role in the pathogenicity of loci implicated in repeat expansion disorders and is highly prevalent in cancers exhibiting microsatellite instability. While a variety of tools have been developed to genotype germline variation at STRs, a method for systematically identifying mosaic STRs (mSTRs) is lacking. Results We introduce prancSTR, a novel method for detecting mSTRs from individual high-throughput sequencing datasets. Unlike many existing mosaicism detection methods for other variant types, prancSTR does not require a matched control sample as input. We show that prancSTR accurately identifies mSTRs in simulated data and demonstrate its feasibility by identifying candidate mSTRs in whole genome sequencing (WGS) data derived from lymphoblastoid cell lines for individuals sequenced by the 1000 Genomes Project. Our analysis identified an average of 76 and 577 non-homopolymer and homopolymer mSTRs respectively per cell line as well as multiple cell lines with outlier mSTR counts more than 6 times the population average, suggesting a subset of cell lines have particularly high STR instability rates. Availability prancSTR is freely available at https://github.com/gymrek-lab/trtools. Documentation Detailed documentation is available at https://trtools.readthedocs.io/.
Collapse
Affiliation(s)
- Aarushi Sehgal
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, USA
| | - Helyaneh Ziaei-Jam
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, USA
| | - Andrew Shen
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, USA
| | - Melissa Gymrek
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, USA
- Department of Medicine, University of California San Diego, La Jolla, USA
| |
Collapse
|