1
|
Ziaei Jam H, Li Y, DeVito R, Mousavi N, Ma N, Lujumba I, Adam Y, Maksimov M, Huang B, Dolzhenko E, Qiu Y, Kakembo FE, Joseph H, Onyido B, Adeyemi J, Bakhtiari M, Park J, Javadzadeh S, Jjingo D, Adebiyi E, Bafna V, Gymrek M. A deep population reference panel of tandem repeat variation. Nat Commun 2023; 14:6711. [PMID: 37872149 PMCID: PMC10593948 DOI: 10.1038/s41467-023-42278-3] [Citation(s) in RCA: 13] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2023] [Accepted: 10/05/2023] [Indexed: 10/25/2023] Open
Abstract
Tandem repeats (TRs) represent one of the largest sources of genetic variation in humans and are implicated in a range of phenotypes. Here we present a deep characterization of TR variation based on high coverage whole genome sequencing from 3550 diverse individuals from the 1000 Genomes Project and H3Africa cohorts. We develop a method, EnsembleTR, to integrate genotypes from four separate methods resulting in high-quality genotypes at more than 1.7 million TR loci. Our catalog reveals novel sequence features influencing TR heterozygosity, identifies population-specific trinucleotide expansions, and finds hundreds of novel eQTL signals. Finally, we generate a phased haplotype panel which can be used to impute most TRs from nearby single nucleotide polymorphisms (SNPs) with high accuracy. Overall, the TR genotypes and reference haplotype panel generated here will serve as valuable resources for future genome-wide and population-wide studies of TRs and their role in human phenotypes.
Collapse
Affiliation(s)
- Helyaneh Ziaei Jam
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
| | - Yang Li
- Department of Medicine, University of California San Diego, La Jolla, CA, USA
| | - Ross DeVito
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
| | - Nima Mousavi
- Department of Electrical and Computer Engineering, University of California San Diego, La Jolla, CA, USA
| | - Nichole Ma
- Department of Medicine, University of California San Diego, La Jolla, CA, USA
| | - Ibra Lujumba
- The African Center of Excellence in Bioinformatics and Data Intensive Sciences, the Infectious Diseases Institute, Makerere University, Kampala, Uganda
| | - Yagoub Adam
- Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun, 112233, Nigeria
| | - Mikhail Maksimov
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
| | - Bonnie Huang
- Department of Bioengineering, University of California San Diego, La Jolla, CA, USA
| | | | - Yunjiang Qiu
- Illumina Incorporated, San Diego, CA, 92122, USA
| | - Fredrick Elishama Kakembo
- The African Center of Excellence in Bioinformatics and Data Intensive Sciences, the Infectious Diseases Institute, Makerere University, Kampala, Uganda
| | - Habi Joseph
- The African Center of Excellence in Bioinformatics and Data Intensive Sciences, the Infectious Diseases Institute, Makerere University, Kampala, Uganda
| | - Blessing Onyido
- Department of Computer & Information Sciences, Covenant University, Ota, Ogun, 112233, Nigeria
- Covenant Applied Informatics and Communication Africa Centre of Excellence (CApIC-ACE), Covenant University, Ota, Ogun, 112233, Nigeria
| | - Jumoke Adeyemi
- Department of Computer & Information Sciences, Covenant University, Ota, Ogun, 112233, Nigeria
- Covenant Applied Informatics and Communication Africa Centre of Excellence (CApIC-ACE), Covenant University, Ota, Ogun, 112233, Nigeria
| | - Mehrdad Bakhtiari
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
| | - Jonghun Park
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
| | - Sara Javadzadeh
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
| | - Daudi Jjingo
- The African Center of Excellence in Bioinformatics and Data Intensive Sciences, the Infectious Diseases Institute, Makerere University, Kampala, Uganda
- Department of Computer Science, Makerere University, Kampala, Uganda
| | - Ezekiel Adebiyi
- Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun, 112233, Nigeria
- Department of Computer & Information Sciences, Covenant University, Ota, Ogun, 112233, Nigeria
- Covenant Applied Informatics and Communication Africa Centre of Excellence (CApIC-ACE), Covenant University, Ota, Ogun, 112233, Nigeria
- Applied Bioinformatics Division, German Cancer Research Center (DKFZ), Heidelberg, Baden-Württemberg, 69120, Germany
| | - Vineet Bafna
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
| | - Melissa Gymrek
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA.
- Department of Medicine, University of California San Diego, La Jolla, CA, USA.
| |
Collapse
|
2
|
Jam HZ, Li Y, DeVito R, Mousavi N, Ma N, Lujumba I, Adam Y, Maksimov M, Huang B, Dolzhenko E, Qiu Y, Kakembo FE, Joseph H, Onyido B, Adeyemi J, Bakhtiari M, Park J, Javadzadeh S, Jjingo D, Adebiyi E, Bafna V, Gymrek M. A deep population reference panel of tandem repeat variation. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.03.09.531600. [PMID: 36945429 PMCID: PMC10028971 DOI: 10.1101/2023.03.09.531600] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/14/2023]
Abstract
Tandem repeats (TRs) represent one of the largest sources of genetic variation in humans and are implicated in a range of phenotypes. Here we present a deep characterization of TR variation based on high coverage whole genome sequencing from 3,550 diverse individuals from the 1000 Genomes Project and H3Africa cohorts. We develop a method, EnsembleTR, to integrate genotypes from four separate methods resulting in high-quality genotypes at more than 1.7 million TR loci. Our catalog reveals novel sequence features influencing TR heterozygosity, identifies population-specific trinucleotide expansions, and finds hundreds of novel eQTL signals. Finally, we generate a phased haplotype panel which can be used to impute most TRs from nearby single nucleotide polymorphisms (SNPs) with high accuracy. Overall, the TR genotypes and reference haplotype panel generated here will serve as valuable resources for future genome-wide and population-wide studies of TRs and their role in human phenotypes.
Collapse
Affiliation(s)
- Helyaneh Ziaei Jam
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA
| | - Yang Li
- Department of Medicine, University of California San Diego, La Jolla, CA
| | - Ross DeVito
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA
| | - Nima Mousavi
- Department of Electrical and Computer Engineering, University of California San Diego, La Jolla, CA
| | - Nichole Ma
- Department of Medicine, University of California San Diego, La Jolla, CA
| | - Ibra Lujumba
- The African Center of Excellence in Bioinformatics and Data Intensive Sciences, the Infectious Diseases Institute, Makerere University, Kampala-Uganda
| | - Yagoub Adam
- Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun, 112233, Nigeria
| | - Mikhail Maksimov
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA
| | - Bonnie Huang
- Department of Bioengineering, University of California San Diego, La Jolla, CA
| | | | - Yunjiang Qiu
- Illumina Incorporated, San Diego, California 92122, USA
| | - Fredrick Elishama Kakembo
- The African Center of Excellence in Bioinformatics and Data Intensive Sciences, the Infectious Diseases Institute, Makerere University, Kampala-Uganda
| | - Habi Joseph
- The African Center of Excellence in Bioinformatics and Data Intensive Sciences, the Infectious Diseases Institute, Makerere University, Kampala-Uganda
| | - Blessing Onyido
- Department of Computer & Information Sciences, Covenant University, Ota, Ogun, 112233, Nigeria
- Covenant Applied Informatics and Communication Africa Centre of Excellence (CApIC-ACE), Covenant University, Ota, Ogun, 112233, Nigeria
| | - Jumoke Adeyemi
- Department of Computer & Information Sciences, Covenant University, Ota, Ogun, 112233, Nigeria
- Covenant Applied Informatics and Communication Africa Centre of Excellence (CApIC-ACE), Covenant University, Ota, Ogun, 112233, Nigeria
| | - Mehrdad Bakhtiari
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA
| | - Jonghun Park
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA
| | - Sara Javadzadeh
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA
| | - Daudi Jjingo
- The African Center of Excellence in Bioinformatics and Data Intensive Sciences, the Infectious Diseases Institute, Makerere University, Kampala-Uganda
- Department of Computer Science, Makerere University, Kampala, Uganda
| | - Ezekiel Adebiyi
- Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun, 112233, Nigeria
- Department of Computer & Information Sciences, Covenant University, Ota, Ogun, 112233, Nigeria
- Covenant Applied Informatics and Communication Africa Centre of Excellence (CApIC-ACE), Covenant University, Ota, Ogun, 112233, Nigeria
- Applied Bioinformatics Division, German Cancer Research Center (DKFZ), Heidelberg, Baden-Württemberg, 69120, Germany
| | - Vineet Bafna
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA
| | - Melissa Gymrek
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA
- Department of Medicine, University of California San Diego, La Jolla, CA
| |
Collapse
|
3
|
Fazal S, Danzi MC, Cintra VP, Bis-Brewer DM, Dolzhenko E, Eberle MA, Zuchner S. Large scale in silico characterization of repeat expansion variation in human genomes. Sci Data 2020; 7:294. [PMID: 32901039 PMCID: PMC7479135 DOI: 10.1038/s41597-020-00633-9] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2020] [Accepted: 08/13/2020] [Indexed: 11/21/2022] Open
Abstract
Significant progress has been made in elucidating single nucleotide polymorphism diversity in the human population. However, the majority of the variation space in the genome is structural and remains partially elusive. One form of structural variation is tandem repeats (TRs). Expansion of TRs are responsible for over 40 diseases, but we hypothesize these represent only a fraction of the pathogenic repeat expansions that exist. Here we characterize long or expanded TR variation in 1,115 human genomes as well as a replication cohort of 2,504 genomes, identified using ExpansionHunter Denovo. We found that individual genomes typically harbor several rare, large TRs, generally in non-coding regions of the genome. We noticed that these large TRs are enriched in their proximity to Alu elements. The vast majority of these large TRs seem to be expansions of smaller TRs that are already present in the reference genome. We are providing this TR profile as a resource for comparison to undiagnosed rare disease genomes in order to detect novel disease-causing repeat expansions.
Collapse
Affiliation(s)
- Sarah Fazal
- Dr. John T. Macdonald Foundation Department of Human Genetics and John P. Hussman Institute for Human Genomics, University of Miami Miller School of Medicine, Miami, FL, USA
| | - Matt C Danzi
- Dr. John T. Macdonald Foundation Department of Human Genetics and John P. Hussman Institute for Human Genomics, University of Miami Miller School of Medicine, Miami, FL, USA
| | - Vivian P Cintra
- Dr. John T. Macdonald Foundation Department of Human Genetics and John P. Hussman Institute for Human Genomics, University of Miami Miller School of Medicine, Miami, FL, USA
| | - Dana M Bis-Brewer
- Dr. John T. Macdonald Foundation Department of Human Genetics and John P. Hussman Institute for Human Genomics, University of Miami Miller School of Medicine, Miami, FL, USA
| | | | | | - Stephan Zuchner
- Dr. John T. Macdonald Foundation Department of Human Genetics and John P. Hussman Institute for Human Genomics, University of Miami Miller School of Medicine, Miami, FL, USA.
| |
Collapse
|
4
|
Prentice MB, Bowman J, Lalor JL, McKay MM, Thomson LA, Watt CM, McAdam AG, Murray DL, Wilson PJ. Signatures of selection in mammalian clock genes with coding trinucleotide repeats: Implications for studying the genomics of high-pace adaptation. Ecol Evol 2017; 7:7254-7276. [PMID: 28944015 PMCID: PMC5606889 DOI: 10.1002/ece3.3223] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2017] [Revised: 05/31/2017] [Accepted: 06/06/2017] [Indexed: 12/14/2022] Open
Abstract
Climate change is predicted to affect the reproductive ecology of wildlife; however, we have yet to understand if and how species can adapt to the rapid pace of change. Clock genes are functional genes likely critical for adaptation to shifting seasonal conditions through shifts in timing cues. Many of these genes contain coding trinucleotide repeats, which offer the potential for higher rates of change than single nucleotide polymorphisms (SNPs) at coding sites, and, thus, may translate to faster rates of adaptation in changing environments. We characterized repeats in 22 clock genes across all annotated mammal species and evaluated the potential for selection on repeat motifs in three clock genes (NR1D1,CLOCK, and PER1) in three congeneric species pairs with different latitudinal range limits: Canada lynx and bobcat (Lynx canadensis and L. rufus), northern and southern flying squirrels (Glaucomys sabrinus and G. volans), and white‐footed and deer mouse (Peromyscus leucopus and P. maniculatus). Signatures of positive selection were found in both the interspecific comparison of Canada lynx and bobcat, and intraspecific analyses in Canada lynx. Northern and southern flying squirrels showed differing frequencies at common CLOCK alleles and a signature of balancing selection. Regional excess homozygosity was found in the deer mouse at PER1 suggesting disruptive selection, and further analyses suggested balancing selection in the white‐footed mouse. These preliminary signatures of selection and the presence of trinucleotide repeats within many clock genes warrant further consideration of the importance of candidate gene motifs for adaptation to climate change.
Collapse
Affiliation(s)
- Melanie B Prentice
- Department of Environmental and Life Sciences Trent University Peterborough ON Canada
| | - Jeff Bowman
- Wildlife Research and Monitoring Section Ontario Ministry of Natural Resources and Forestry Peterborough ON Canada
| | | | - Michelle M McKay
- Department of Environmental and Life Sciences Trent University Peterborough ON Canada
| | | | - Cristen M Watt
- Department of Environmental and Life Sciences Trent University Peterborough ON Canada
| | - Andrew G McAdam
- Department of Integrative Biology University of Guelph Guelph ON Canada
| | | | - Paul J Wilson
- Biology Department Trent University Peterborough ON Canada
| |
Collapse
|
5
|
Bilgin Sonay T, Carvalho T, Robinson MD, Greminger MP, Krützen M, Comas D, Highnam G, Mittelman D, Sharp A, Marques-Bonet T, Wagner A. Tandem repeat variation in human and great ape populations and its impact on gene expression divergence. Genome Res 2015; 25:1591-1599. [PMID: 26290536 DOI: 10.1101/015784] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2015] [Accepted: 08/14/2015] [Indexed: 05/25/2023]
Abstract
Tandem repeats (TRs) are stretches of DNA that are highly variable in length and mutate rapidly. They are thus an important source of genetic variation. This variation is highly informative for population and conservation genetics. It has also been associated with several pathological conditions and with gene expression regulation. However, genome-wide surveys of TR variation in humans and closely related species have been scarce due to technical difficulties derived from short-read technology. Here we explored the genome-wide diversity of TRs in a panel of 83 human and nonhuman great ape genomes, in a total of six different species, and studied their impact on gene expression evolution. We found that population diversity patterns can be efficiently captured with short TRs (repeat unit length, 1-5 bp). We examined the potential evolutionary role of TRs in gene expression differences between humans and primates by using 30,275 larger TRs (repeat unit length, 2-50 bp). Genes that contained TRs in the promoters, in their 3' untranslated region, in introns, and in exons had higher expression divergence than genes without repeats in the regions. Polymorphic small repeats (1-5 bp) had also higher expression divergence compared with genes with fixed or no TRs in the gene promoters. Our findings highlight the potential contribution of TRs to human evolution through gene regulation.
Collapse
Affiliation(s)
- Tugce Bilgin Sonay
- Institute of Evolutionary Biology and Environmental Studies, University of Zurich, CH-805 Zurich, Switzerland; The Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Tiago Carvalho
- Institute of Evolutionary Biology (CSIC-UPF), Department of Experimental and Health Sciences, Universitat Pompeu Fabra, 08003 Barcelona, Spain
| | - Mark D Robinson
- The Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland; Institute of Molecular Life Sciences, University of Zurich, 8057 Zurich, Switzerland
| | - Maja P Greminger
- Evolutionary Genetics Group, Anthropological Institute and Museum, University of Zurich, CH-8057 Zurich, Switzerland
| | - Michael Krützen
- Evolutionary Genetics Group, Anthropological Institute and Museum, University of Zurich, CH-8057 Zurich, Switzerland
| | - David Comas
- Institute of Evolutionary Biology (CSIC-UPF), Department of Experimental and Health Sciences, Universitat Pompeu Fabra, 08003 Barcelona, Spain
| | - Gareth Highnam
- Department of Biological Science and Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, Virginia 24061, USA
| | - David Mittelman
- Department of Biological Science and Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, Virginia 24061, USA
| | - Andrew Sharp
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai School, New York, New York 10029, USA
| | - Tomàs Marques-Bonet
- Institute of Evolutionary Biology (CSIC-UPF), Department of Experimental and Health Sciences, Universitat Pompeu Fabra, 08003 Barcelona, Spain; Centro Nacional de Análisis Genómico (CNAG), PCB, Barcelona, 08028 Catalonia, Spain; Catalan Institution for Research and Advanced Studies (ICREA), 08010 Barcelona, Spain
| | - Andreas Wagner
- Institute of Evolutionary Biology and Environmental Studies, University of Zurich, CH-805 Zurich, Switzerland; The Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland; The Santa Fe Institute, Santa Fe, New Mexico 87501, USA
| |
Collapse
|
6
|
Bilgin Sonay T, Carvalho T, Robinson MD, Greminger MP, Krützen M, Comas D, Highnam G, Mittelman D, Sharp A, Marques-Bonet T, Wagner A. Tandem repeat variation in human and great ape populations and its impact on gene expression divergence. Genome Res 2015; 25:1591-9. [PMID: 26290536 PMCID: PMC4617956 DOI: 10.1101/gr.190868.115] [Citation(s) in RCA: 57] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2015] [Accepted: 08/14/2015] [Indexed: 12/20/2022]
Abstract
Tandem repeats (TRs) are stretches of DNA that are highly variable in length and mutate rapidly. They are thus an important source of genetic variation. This variation is highly informative for population and conservation genetics. It has also been associated with several pathological conditions and with gene expression regulation. However, genome-wide surveys of TR variation in humans and closely related species have been scarce due to technical difficulties derived from short-read technology. Here we explored the genome-wide diversity of TRs in a panel of 83 human and nonhuman great ape genomes, in a total of six different species, and studied their impact on gene expression evolution. We found that population diversity patterns can be efficiently captured with short TRs (repeat unit length, 1–5 bp). We examined the potential evolutionary role of TRs in gene expression differences between humans and primates by using 30,275 larger TRs (repeat unit length, 2–50 bp). Genes that contained TRs in the promoters, in their 3′ untranslated region, in introns, and in exons had higher expression divergence than genes without repeats in the regions. Polymorphic small repeats (1–5 bp) had also higher expression divergence compared with genes with fixed or no TRs in the gene promoters. Our findings highlight the potential contribution of TRs to human evolution through gene regulation.
Collapse
Affiliation(s)
- Tugce Bilgin Sonay
- Institute of Evolutionary Biology and Environmental Studies, University of Zurich, CH-805 Zurich, Switzerland; The Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Tiago Carvalho
- Institute of Evolutionary Biology (CSIC-UPF), Department of Experimental and Health Sciences, Universitat Pompeu Fabra, 08003 Barcelona, Spain
| | - Mark D Robinson
- The Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland; Institute of Molecular Life Sciences, University of Zurich, 8057 Zurich, Switzerland
| | - Maja P Greminger
- Evolutionary Genetics Group, Anthropological Institute and Museum, University of Zurich, CH-8057 Zurich, Switzerland
| | - Michael Krützen
- Evolutionary Genetics Group, Anthropological Institute and Museum, University of Zurich, CH-8057 Zurich, Switzerland
| | - David Comas
- Institute of Evolutionary Biology (CSIC-UPF), Department of Experimental and Health Sciences, Universitat Pompeu Fabra, 08003 Barcelona, Spain
| | - Gareth Highnam
- Department of Biological Science and Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, Virginia 24061, USA
| | - David Mittelman
- Department of Biological Science and Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, Virginia 24061, USA
| | - Andrew Sharp
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai School, New York, New York 10029, USA
| | - Tomàs Marques-Bonet
- Institute of Evolutionary Biology (CSIC-UPF), Department of Experimental and Health Sciences, Universitat Pompeu Fabra, 08003 Barcelona, Spain; Centro Nacional de Análisis Genómico (CNAG), PCB, Barcelona, 08028 Catalonia, Spain; Catalan Institution for Research and Advanced Studies (ICREA), 08010 Barcelona, Spain
| | - Andreas Wagner
- Institute of Evolutionary Biology and Environmental Studies, University of Zurich, CH-805 Zurich, Switzerland; The Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland; The Santa Fe Institute, Santa Fe, New Mexico 87501, USA
| |
Collapse
|
7
|
Carlson KD, Sudmant PH, Press MO, Eichler EE, Shendure J, Queitsch C. MIPSTR: a method for multiplex genotyping of germline and somatic STR variation across many individuals. Genome Res 2015; 25:750-61. [PMID: 25659649 PMCID: PMC4417122 DOI: 10.1101/gr.182212.114] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2014] [Accepted: 02/05/2015] [Indexed: 12/21/2022]
Abstract
Short tandem repeats (STRs) are highly mutable genetic elements that often reside in regulatory and coding DNA. The cumulative evidence of genetic studies on individual STRs suggests that STR variation profoundly affects phenotype and contributes to trait heritability. Despite recent advances in sequencing technology, STR variation has remained largely inaccessible across many individuals compared to single nucleotide variation or copy number variation. STR genotyping with short-read sequence data is confounded by (1) the difficulty of uniquely mapping short, low-complexity reads; and (2) the high rate of STR amplification stutter. Here, we present MIPSTR, a robust, scalable, and affordable method that addresses these challenges. MIPSTR uses targeted capture of STR loci by single-molecule Molecular Inversion Probes (smMIPs) and a unique mapping strategy. Targeted capture and our mapping strategy resolve the first challenge; the use of single molecule information resolves the second challenge. Unlike previous methods, MIPSTR is capable of distinguishing technical error due to amplification stutter from somatic STR mutations. In proof-of-principle experiments, we use MIPSTR to determine germline STR genotypes for 102 STR loci with high accuracy across diverse populations of the plant A. thaliana. We show that putatively functional STRs may be identified by deviation from predicted STR variation and by association with quantitative phenotypes. Using DNA mixing experiments and a mutant deficient in DNA repair, we demonstrate that MIPSTR can detect low-frequency somatic STR variants. MIPSTR is applicable to any organism with a high-quality reference genome and is scalable to genotyping many thousands of STR loci in thousands of individuals.
Collapse
Affiliation(s)
- Keisha D Carlson
- Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA
| | - Peter H Sudmant
- Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA
| | - Maximilian O Press
- Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA; Howard Hughes Medical Institute, University of Washington School of Medicine, Seattle, Washington 98195, USA
| | - Jay Shendure
- Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA
| | - Christine Queitsch
- Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA
| |
Collapse
|
8
|
Willems T, Gymrek M, Highnam G, Mittelman D, Erlich Y. The landscape of human STR variation. Genome Res 2014; 24:1894-904. [PMID: 25135957 PMCID: PMC4216929 DOI: 10.1101/gr.177774.114] [Citation(s) in RCA: 174] [Impact Index Per Article: 17.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2014] [Accepted: 08/15/2014] [Indexed: 02/06/2023]
Abstract
Short tandem repeats are among the most polymorphic loci in the human genome. These loci play a role in the etiology of a range of genetic diseases and have been frequently utilized in forensics, population genetics, and genetic genealogy. Despite this plethora of applications, little is known about the variation of most STRs in the human population. Here, we report the largest-scale analysis of human STR variation to date. We collected information for nearly 700,000 STR loci across more than 1000 individuals in Phase 1 of the 1000 Genomes Project. Extensive quality controls show that reliable allelic spectra can be obtained for close to 90% of the STR loci in the genome. We utilize this call set to analyze determinants of STR variation, assess the human reference genome's representation of STR alleles, find STR loci with common loss-of-function alleles, and obtain initial estimates of the linkage disequilibrium between STRs and common SNPs. Overall, these analyses further elucidate the scale of genetic variation beyond classical point mutations.
Collapse
Affiliation(s)
- Thomas Willems
- Whitehead Institute for Biomedical Research, Cambridge, Massachusetts 02142, USA; Computational and Systems Biology Program, MIT, Cambridge, Massachusetts 02139, USA
| | - Melissa Gymrek
- Whitehead Institute for Biomedical Research, Cambridge, Massachusetts 02142, USA; Harvard-MIT Division of Health Sciences and Technology, MIT, Cambridge, Massachusetts 02139, USA; Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA; Department of Molecular Biology and Diabetes Unit, Massachusetts General Hospital, Boston, Massachusetts 02114, USA
| | - Gareth Highnam
- Virginia Bioinformatics Institute and Department of Biological Sciences, Virginia Tech, Blacksburg, Virginia 24061, USA
| | - David Mittelman
- Virginia Bioinformatics Institute and Department of Biological Sciences, Virginia Tech, Blacksburg, Virginia 24061, USA; Gene by Gene, Ltd., Houston, Texas 77008, USA
| | - Yaniv Erlich
- Whitehead Institute for Biomedical Research, Cambridge, Massachusetts 02142, USA;
| |
Collapse
|
9
|
Kay C, Skotte N, Southwell A, Hayden M. Personalized gene silencing therapeutics for Huntington disease. Clin Genet 2014; 86:29-36. [DOI: 10.1111/cge.12385] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2014] [Revised: 03/17/2014] [Accepted: 03/17/2014] [Indexed: 01/14/2023]
Affiliation(s)
- C. Kay
- Centre for Molecular Medicine and Therapeutics, Child and Family Research Institute; University of British Columbia; Vancouver Canada
| | - N.H. Skotte
- Centre for Molecular Medicine and Therapeutics, Child and Family Research Institute; University of British Columbia; Vancouver Canada
| | - A.L. Southwell
- Centre for Molecular Medicine and Therapeutics, Child and Family Research Institute; University of British Columbia; Vancouver Canada
| | - M.R. Hayden
- Centre for Molecular Medicine and Therapeutics, Child and Family Research Institute; University of British Columbia; Vancouver Canada
| |
Collapse
|
10
|
Guilmatre A, Highnam G, Borel C, Mittelman D, Sharp AJ. Rapid multiplexed genotyping of simple tandem repeats using capture and high-throughput sequencing. Hum Mutat 2013; 34:1304-11. [PMID: 23696428 DOI: 10.1002/humu.22359] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2013] [Accepted: 05/07/2013] [Indexed: 11/12/2022]
Abstract
Although simple tandem repeats (STRs) comprise ~2% of the human genome and represent an important source of polymorphism, this class of variation remains understudied. We have developed a cost-effective strategy for performing targeted enrichment of STR regions that utilizes capture probes targeting the flanking sequences of STR loci, enabling specific capture of DNA fragments containing STRs for subsequent high-throughput sequencing. Utilizing a capture design targeting 6,243 STR loci <94 bp and multiplexing eight individuals in a single Illumina HiSeq2000 sequencing lane we were able to call genotypes in at least one individual for 67.5% of the targeted STRs. We observed a strong relationship between (G+C) content and genotyping rate. STRs with moderate (G+C) content were recovered with >90% success rate, whereas only 12% of STRs with ≥ 80% (G+C) were genotyped in our assay. Analysis of a parent-offspring trio, complete hydatidiform mole samples, repeat analyses of the same individual, and Sanger sequencing-based validation indicated genotyping error rates between 7.6% and 12.4%. The majority of such errors were a single repeat unit at mono- or dinucleotide repeats. Altogether, our STR capture assay represents a cost-effective method that enables multiplexed genotyping of thousands of STR loci suitable for large-scale population studies.
Collapse
Affiliation(s)
- Audrey Guilmatre
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York 10029, USA
| | | | | | | | | |
Collapse
|
11
|
Background-dependent effects of polyglutamine variation in the Arabidopsis thaliana gene ELF3. Proc Natl Acad Sci U S A 2012; 109:19363-7. [PMID: 23129635 DOI: 10.1073/pnas.1211021109] [Citation(s) in RCA: 50] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Tandem repeats (TRs) have extremely high mutation rates and are often considered to be neutrally evolving DNA. However, in coding regions, TR copy number mutations can significantly affect phenotype and may facilitate rapid adaptation to new environments. In several human genes, TR copy number mutations that expand polyglutamine (polyQ) tracts beyond a certain threshold cause incurable neurodegenerative diseases. PolyQ-containing proteins exist at a considerable frequency in eukaryotes, yet the phenotypic consequences of natural variation in polyQ tracts that are not associated with disease remain largely unknown. Here, we use Arabidopsis thaliana to dissect the phenotypic consequences of natural variation in the polyQ tract encoded by EARLY FLOWERING 3 (ELF3), a key developmental gene. Changing ELF3 polyQ tract length affected complex ELF3-dependent phenotypes in a striking and nonlinear manner. Some natural ELF3 polyQ variants phenocopied elf3 loss-of-function mutants in a common reference background, although they are functional in their native genetic backgrounds. To test the existence of background-specific modifiers, we compared the phenotypic effects of ELF3 polyQ variants between two divergent backgrounds, Col and Ws, and found dramatic differences. In fact, the Col-ELF3 allele, encoding the shortest known ELF3 polyQ tract, was haploinsufficient in Ws × Col F(1) hybrids. Our data support a model in which variable polyQ tracts drive adaptation to internal genetic environments.
Collapse
|
12
|
|
13
|
Abstract
The history of genetic markers accurately partitions the progression of molecular genetics into three phases: the RFLP (restriction fragment length polymorphism), microsatellite and SNP (single nucleotide polymorphism) eras. This chapter focuses predominately on the current workhorse, the SNP, though briefly covers the former two and overviews current online databases and portals that act as central repositories as well as hubs to further detailed information. Central gene or disease-based searches are considered and then followed through systematically.
Collapse
|
14
|
Chen M, Tan Z, Zeng G. Microsatellite is an important component of complete Hepatitis C virus genomes. INFECTION GENETICS AND EVOLUTION 2011; 11:1646-54. [DOI: 10.1016/j.meegid.2011.06.012] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/11/2011] [Revised: 06/02/2011] [Accepted: 06/16/2011] [Indexed: 12/15/2022]
|
15
|
Cooper DN, Bacolla A, Férec C, Vasquez KM, Kehrer-Sawatzki H, Chen JM. On the sequence-directed nature of human gene mutation: the role of genomic architecture and the local DNA sequence environment in mediating gene mutations underlying human inherited disease. Hum Mutat 2011; 32:1075-99. [PMID: 21853507 PMCID: PMC3177966 DOI: 10.1002/humu.21557] [Citation(s) in RCA: 94] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2011] [Accepted: 06/17/2011] [Indexed: 12/21/2022]
Abstract
Different types of human gene mutation may vary in size, from structural variants (SVs) to single base-pair substitutions, but what they all have in common is that their nature, size and location are often determined either by specific characteristics of the local DNA sequence environment or by higher order features of the genomic architecture. The human genome is now recognized to contain "pervasive architectural flaws" in that certain DNA sequences are inherently mutation prone by virtue of their base composition, sequence repetitivity and/or epigenetic modification. Here, we explore how the nature, location and frequency of different types of mutation causing inherited disease are shaped in large part, and often in remarkably predictable ways, by the local DNA sequence environment. The mutability of a given gene or genomic region may also be influenced indirectly by a variety of noncanonical (non-B) secondary structures whose formation is facilitated by the underlying DNA sequence. Since these non-B DNA structures can interfere with subsequent DNA replication and repair and may serve to increase mutation frequencies in generalized fashion (i.e., both in the context of subtle mutations and SVs), they have the potential to serve as a unifying concept in studies of mutational mechanisms underlying human inherited disease.
Collapse
Affiliation(s)
- David N Cooper
- Institute of Medical Genetics, School of Medicine, Cardiff University, Cardiff, United Kingdom.
| | | | | | | | | | | |
Collapse
|
16
|
Krzyzosiak WJ, Sobczak K, Wojciechowska M, Fiszer A, Mykowska A, Kozlowski P. Triplet repeat RNA structure and its role as pathogenic agent and therapeutic target. Nucleic Acids Res 2011; 40:11-26. [PMID: 21908410 PMCID: PMC3245940 DOI: 10.1093/nar/gkr729] [Citation(s) in RCA: 122] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
This review presents detailed information about the structure of triplet repeat RNA and addresses the simple sequence repeats of normal and expanded lengths in the context of the physiological and pathogenic roles played in human cells. First, we discuss the occurrence and frequency of various trinucleotide repeats in transcripts and classify them according to the propensity to form RNA structures of different architectures and stabilities. We show that repeats capable of forming hairpin structures are overrepresented in exons, which implies that they may have important functions. We further describe long triplet repeat RNA as a pathogenic agent by presenting human neurological diseases caused by triplet repeat expansions in which mutant RNA gains a toxic function. Prominent examples of these diseases include myotonic dystrophy type 1 and fragile X-associated tremor ataxia syndrome, which are triggered by mutant CUG and CGG repeats, respectively. In addition, we discuss RNA-mediated pathogenesis in polyglutamine disorders such as Huntington's disease and spinocerebellar ataxia type 3, in which expanded CAG repeats may act as an auxiliary toxic agent. Finally, triplet repeat RNA is presented as a therapeutic target. We describe various concepts and approaches aimed at the selective inhibition of mutant transcript activity in experimental therapies developed for repeat-associated diseases.
Collapse
Affiliation(s)
- Wlodzimierz J Krzyzosiak
- Laboratory of Cancer Genetics, Institute of Bioorganic Chemistry, Polish Academy of Sciences, Noskowskiego 12/14, 61-704 Poznan, Poland.
| | | | | | | | | | | |
Collapse
|
17
|
Evers MM, Pepers BA, van Deutekom JCT, Mulders SAM, den Dunnen JT, Aartsma-Rus A, van Ommen GJB, van Roon-Mom WMC. Targeting several CAG expansion diseases by a single antisense oligonucleotide. PLoS One 2011; 6:e24308. [PMID: 21909428 PMCID: PMC3164722 DOI: 10.1371/journal.pone.0024308] [Citation(s) in RCA: 72] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2011] [Accepted: 08/04/2011] [Indexed: 12/16/2022] Open
Abstract
To date there are 9 known diseases caused by an expanded polyglutamine repeat, with the most prevalent being Huntington's disease. Huntington's disease is a progressive autosomal dominant neurodegenerative disorder for which currently no therapy is available. It is caused by a CAG repeat expansion in the HTT gene, which results in an expansion of a glutamine stretch at the N-terminal end of the huntingtin protein. This polyglutamine expansion plays a central role in the disease and results in the accumulation of cytoplasmic and nuclear aggregates. Here, we make use of modified 2'-O-methyl phosphorothioate (CUG)n triplet-repeat antisense oligonucleotides to effectively reduce mutant huntingtin transcript and protein levels in patient-derived Huntington's disease fibroblasts and lymphoblasts. The most effective antisense oligonucleotide, (CUG)(7), also reduced mutant ataxin-1 and ataxin-3 mRNA levels in spinocerebellar ataxia 1 and 3, respectively, and atrophin-1 in dentatorubral-pallidoluysian atrophy patient derived fibroblasts. This antisense oligonucleotide is not only a promising therapeutic tool to reduce mutant huntingtin levels in Huntington's disease but our results in spinocerebellar ataxia and dentatorubral-pallidoluysian atrophy cells suggest that this could also be applicable to other polyglutamine expansion disorders as well.
Collapse
Affiliation(s)
- Melvin M. Evers
- Center for Human and Clinical Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Barry A. Pepers
- Center for Human and Clinical Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | | | | | - Johan T. den Dunnen
- Center for Human and Clinical Genetics, Leiden University Medical Center, Leiden, The Netherlands
- Leiden Genome Technology Center, Leiden University Medical Center, Leiden, The Netherlands
| | - Annemieke Aartsma-Rus
- Center for Human and Clinical Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Gert-Jan B. van Ommen
- Center for Human and Clinical Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | | |
Collapse
|
18
|
Haerty W, Golding GB. Increased polymorphism near low-complexity sequences across the genomes of Plasmodium falciparum isolates. Genome Biol Evol 2011; 3:539-50. [PMID: 21602572 PMCID: PMC3140889 DOI: 10.1093/gbe/evr045] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Low-complexity regions (LCRs) within proteins sequences are often considered to evolve neutrally even though recent studies reported evidence for selection acting on some of them. Because of their widespread distribution among eukaryotes genomes and the potential deleterious effect of expansion/contraction of some of them in humans, low-complexity sequences are of major interest and numerous studies have attempted to describe their dynamic between genomes as well as the factors correlated to their variation and to assess their selective value. However, due to the scarcity of individual genomes within a species, most of the analyses so far have been performed at the species level with the implicit assumption that the variation both in composition and size within species is too small relative to the between-species divergence to affect the conclusions of the analysis. Here we used the available genomes of 14 Plasmodium falciparum isolates to assess the relationship between low-complexity sequence variation and factors such as nucleotide polymorphism across strains, sequence composition, and protein expression. We report that more than half of the 7,711 low-complexity sequences found within aligned coding sequences are variable in size among strains. Across strains, we observed an increasing density of polymorphic sites toward the LCR boundaries. This observation strongly suggests the joint effects of lowered selective constraints on low-complexity sequences and a mutagenic effect of these simple sequences.
Collapse
Affiliation(s)
- Wilfried Haerty
- Department of Biology, McMaster University, Hamilton, Ontario, Canada
| | | |
Collapse
|
19
|
Haerty W, Golding GB. Low-complexity sequences and single amino acid repeats: not just "junk" peptide sequences. Genome 2011; 53:753-62. [PMID: 20962881 DOI: 10.1139/g10-063] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
For decades proteins were thought to interact in a "lock and key" system, which led to the definition of a paradigm linking stable three-dimensional structure to biological function. As a consequence, any non-structured peptide was considered to be nonfunctional and to evolve neutrally. Surprisingly, the most commonly shared peptides between eukaryotic proteomes are low-complexity sequences that in most conditions do not present a stable three-dimensional structure. However, because these sequences evolve rapidly and because the size variation of a few of them can have deleterious effects, low-complexity sequences have been suggested to be the target of selection. Here we review evidence that supports the idea that these simple sequences should not be considered just "junk" peptides and that selection drives the evolution of many of them.
Collapse
Affiliation(s)
- Wilfried Haerty
- Biology Department, McMaster University, Hamilton, ON, Canada
| | | |
Collapse
|
20
|
Lin Y, Wilson JH. Transcription-induced DNA toxicity at trinucleotide repeats: double bubble is trouble. Cell Cycle 2011; 10:611-8. [PMID: 21293182 DOI: 10.4161/cc.10.4.14729] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Abstract
Trinucleotide repeats (TNR) are a blessing and a curse. In coding regions, where they are enriched, short repeats offer the potential for continuous, rapid length variation with linked incremental changes in the activity of the encoded protein, a valuable source of variation for evolution. But at the upper end of these benign and beneficial lengths, trinucleotide repeats become very unstable, with a dangerous bias toward continual expansion, which can lead to neurological diseases in humans. The mechanisms of expansion are varied and the links to disease are complex. Where they have been delineated, however, they have often revealed unexpected, fundamental aspects of the underlying cell biology. Nowhere is this more apparent than in recent studies, which indicate that expanded CAG repeats can form toxic sites in the genome, which can, upon interaction with normal components of DNA metabolism, trigger cell death. Here we discuss the phenomenon of TNR-induced DNA toxicity, with special emphasis on the role of transcription. Transcription-induced DNA toxicity may have profound biological consequences, with particular relevance to repeat-associated neurodegenerative diseases.
Collapse
Affiliation(s)
- Yunfu Lin
- Verna and Marrs McLean Department of Biochemistry and Molecular Biology, Baylor College of Medicine, Houston, TX USA.
| | | |
Collapse
|
21
|
Cer RZ, Bruce KH, Mudunuri US, Yi M, Volfovsky N, Luke BT, Bacolla A, Collins JR, Stephens RM. Non-B DB: a database of predicted non-B DNA-forming motifs in mammalian genomes. Nucleic Acids Res 2010; 39:D383-91. [PMID: 21097885 PMCID: PMC3013731 DOI: 10.1093/nar/gkq1170] [Citation(s) in RCA: 60] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
Although the capability of DNA to form a variety of non-canonical (non-B) structures has long been recognized, the overall significance of these alternate conformations in biology has only recently become accepted en masse. In order to provide access to genome-wide locations of these classes of predicted structures, we have developed non-B DB, a database integrating annotations and analysis of non-B DNA-forming sequence motifs. The database provides the most complete list of alternative DNA structure predictions available, including Z-DNA motifs, quadruplex-forming motifs, inverted repeats, mirror repeats and direct repeats and their associated subsets of cruciforms, triplex and slipped structures, respectively. The database also contains motifs predicted to form static DNA bends, short tandem repeats and homo(purine•pyrimidine) tracts that have been associated with disease. The database has been built using the latest releases of the human, chimp, dog, macaque and mouse genomes, so that the results can be compared directly with other data sources. In order to make the data interpretable in a genomic context, features such as genes, single-nucleotide polymorphisms and repetitive elements (SINE, LINE, etc.) have also been incorporated. The database is accessed through query pages that produce results with links to the UCSC browser and a GBrowse-based genomic viewer. It is freely accessible at http://nonb.abcc.ncifcrf.gov.
Collapse
Affiliation(s)
- Regina Z Cer
- Advanced Biomedical Computing Center, Information Systems Program, SAIC-Frederick, Inc, NCI-Frederick, Frederick, MD 21702, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
22
|
Payseur BA, Jing P, Haasl RJ. A genomic portrait of human microsatellite variation. Mol Biol Evol 2010; 28:303-12. [PMID: 20675409 DOI: 10.1093/molbev/msq198] [Citation(s) in RCA: 71] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
Rapid advances in DNA sequencing and genotyping technologies are beginning to reveal the scope and pattern of human genomic variation. Although single nucleotide polymorphisms (SNPs) have been intensively studied, the extent and form of variation at other types of molecular variants remain poorly understood. Polymorphism at the most variable loci in the human genome, microsatellites, has rarely been examined on a genomic scale without the ascertainment biases that attend typical genotyping studies. We conducted a genomic survey of variation at microsatellites with at least three perfect repeats by comparing two complete genome sequences, the Human Genome Reference sequence and the sequence of J. Craig Venter. The genomic proportion of polymorphic loci was 2.7%, much higher than the rate of SNP variation, with marked heterogeneity among classes of loci. The proportion of variable loci increased substantially with repeat number. Repeat lengths differed in levels of variation, with longer repeat lengths generally showing higher polymorphism at the same repeat number. Microsatellite variation was weakly correlated with regional SNP number, indicating modest effects of shared genealogical history. Reductions in variation were detected at microsatellites located in introns, in untranslated regions, in coding exons, and just upstream of transcription start sites, suggesting the presence of selective constraints. Our results provide new insights into microsatellite mutational processes and yield a preview of patterns of variation that will be obtained in genomic surveys of larger numbers of individuals.
Collapse
|
23
|
Kelkar YD, Strubczewski N, Hile SE, Chiaromonte F, Eckert KA, Makova KD. What is a microsatellite: a computational and experimental definition based upon repeat mutational behavior at A/T and GT/AC repeats. Genome Biol Evol 2010; 2:620-35. [PMID: 20668018 PMCID: PMC2940325 DOI: 10.1093/gbe/evq046] [Citation(s) in RCA: 86] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
Microsatellites are abundant in eukaryotic genomes and have high rates of strand slippage-induced repeat number alterations. They are popular genetic markers, and their mutations are associated with numerous neurological diseases. However, the minimal number of repeats required to constitute a microsatellite has been debated, and a definition of a microsatellite that considers its mutational behavior has been lacking. To define a microsatellite, we investigated slippage dynamics for a range of repeat sizes, utilizing two approaches. Computationally, we assessed length polymorphism at repeat loci in ten ENCODE regions resequenced in four human populations, assuming that the occurrence of polymorphism reflects strand slippage rates. Experimentally, we determined the in vitro DNA polymerase-mediated strand slippage error rates as a function of repeat number. In both approaches, we compared strand slippage rates at tandem repeats with the background slippage rates. We observed two distinct modes of mutational behavior. At small repeat numbers, slippage rates were low and indistinguishable from background measurements. A marked transition in mutability was observed as the repeat array lengthened, such that slippage rates at large repeat numbers were significantly higher than the background rates. For both mononucleotide and dinucleotide microsatellites studied, the transition length corresponded to a similar number of nucleotides (approximately 10). Thus, microsatellite threshold is determined not by the presence/absence of strand slippage at repeats but by an abrupt alteration in slippage rates relative to background. These findings have implications for understanding microsatellite mutagenesis, standardization of genome-wide microsatellite analyses, and predicting polymorphism levels of individual microsatellite loci.
Collapse
|
24
|
Shikano T, Ramadevi J, Shimada Y, Merilä J. Utility of sequenced genomes for microsatellite marker development in non-model organisms: a case study of functionally important genes in nine-spined sticklebacks (Pungitius pungitius). BMC Genomics 2010; 11:334. [PMID: 20507571 PMCID: PMC2891615 DOI: 10.1186/1471-2164-11-334] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2010] [Accepted: 05/27/2010] [Indexed: 12/04/2022] Open
Abstract
Background Identification of genes involved in adaptation and speciation by targeting specific genes of interest has become a plausible strategy also for non-model organisms. We investigated the potential utility of available sequenced fish genomes to develop microsatellite (cf. simple sequence repeat, SSR) markers for functionally important genes in nine-spined sticklebacks (Pungitius pungitius), as well as cross-species transferability of SSR primers from three-spined (Gasterosteus aculeatus) to nine-spined sticklebacks. In addition, we examined the patterns and degree of SSR conservation between these species using their aligned sequences. Results Cross-species amplification success was lower for SSR markers located in or around functionally important genes (27 out of 158) than for those randomly derived from genomic (35 out of 101) and cDNA (35 out of 87) libraries. Polymorphism was observed at a large proportion (65%) of the cross-amplified loci independently of SSR type. To develop SSR markers for functionally important genes in nine-spined sticklebacks, SSR locations were surveyed in or around 67 target genes based on the three-spined stickleback genome and these regions were sequenced with primers designed from conserved sequences in sequenced fish genomes. Out of the 81 SSRs identified in the sequenced regions (44,084 bp), 57 exhibited the same motifs at the same locations as in the three-spined stickleback. Di- and trinucleotide SSRs appeared to be highly conserved whereas mononucleotide SSRs were less so. Species-specific primers were designed to amplify 58 SSRs using the sequences of nine-spined sticklebacks. Conclusions Our results demonstrated that a large proportion of SSRs are conserved in the species that have diverged more than 10 million years ago. Therefore, the three-spined stickleback genome can be used to predict SSR locations in the nine-spined stickleback genome. While cross-species utility of SSR primers is limited due to low amplification success, SSR markers can be developed for target genes and genomic regions using our approach, which should be also applicable to other non-model organisms. The SSR markers developed in this study should be useful for identification of genes responsible for phenotypic variation and adaptive divergence of nine-spined stickleback populations, as well as for constructing comparative gene maps of nine-spined and three-spined sticklebacks.
Collapse
Affiliation(s)
- Takahito Shikano
- Ecological Genetics Research Unit, Department of Biosciences, University of Helsinki, P,O, Box 65, FI-00014, Helsinki, Finland.
| | | | | | | |
Collapse
|
25
|
Kozlowski P, de Mezer M, Krzyzosiak WJ. Trinucleotide repeats in human genome and exome. Nucleic Acids Res 2010; 38:4027-39. [PMID: 20215431 PMCID: PMC2896521 DOI: 10.1093/nar/gkq127] [Citation(s) in RCA: 94] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Trinucleotide repeats (TNRs) are of interest in genetics because they are used as markers for tracing genotype–phenotype relations and because they are directly involved in numerous human genetic diseases. In this study, we searched the human genome reference sequence and annotated exons (exome) for the presence of uninterrupted triplet repeat tracts composed of six or more repeated units. A list of 32 448 TNRs and 878 TNR-containing genes was generated and is provided herein. We found that some triplet repeats, specifically CNG, are overrepresented, while CTT, ATC, AAC and AAT are underrepresented in exons. This observation suggests that the occurrence of TNRs in exons is not random, but undergoes positive or negative selective pressure. Additionally, TNR types strongly determine their localization in mRNA sections (ORF, UTRs). Most genes containing exon-overrepresented TNRs are associated with gene ontology-defined functions. Surprisingly, many groups of genes that contain TNR types coding for different homo-amino acid tracts associate with the same transcription-related GO categories. We propose that TNRs have potential to be functional genetic elements and that their variation may be involved in the regulation of many common phenotypes; as such, TNR polymorphisms should be considered a priority in association studies.
Collapse
Affiliation(s)
- Piotr Kozlowski
- Institute of Bioorganic Chemistry, Polish Academy of Sciences, Noskowskiego 12/14, 61-704 Poznan, Poland.
| | | | | |
Collapse
|
26
|
Hannan AJ. Tandem repeat polymorphisms: modulators of disease susceptibility and candidates for ‘missing heritability’. Trends Genet 2010; 26:59-65. [PMID: 20036436 DOI: 10.1016/j.tig.2009.11.008] [Citation(s) in RCA: 106] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2009] [Revised: 11/27/2009] [Accepted: 11/30/2009] [Indexed: 01/26/2023]
|