1
|
Link V, Schraiber JG, Fan C, Dinh B, Mancuso N, Chiang CWK, Edge MD. Tree-based QTL mapping with expected local genetic relatedness matrices. Am J Hum Genet 2023; 110:2077-2091. [PMID: 38065072 PMCID: PMC10716520 DOI: 10.1016/j.ajhg.2023.10.017] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2023] [Revised: 10/26/2023] [Accepted: 10/27/2023] [Indexed: 12/18/2023] Open
Abstract
Understanding the genetic basis of complex phenotypes is a central pursuit of genetics. Genome-wide association studies (GWASs) are a powerful way to find genetic loci associated with phenotypes. GWASs are widely and successfully used, but they face challenges related to the fact that variants are tested for association with a phenotype independently, whereas in reality variants at different sites are correlated because of their shared evolutionary history. One way to model this shared history is through the ancestral recombination graph (ARG), which encodes a series of local coalescent trees. Recent computational and methodological breakthroughs have made it feasible to estimate approximate ARGs from large-scale samples. Here, we explore the potential of an ARG-based approach to quantitative-trait locus (QTL) mapping, echoing existing variance-components approaches. We propose a framework that relies on the conditional expectation of a local genetic relatedness matrix (local eGRM) given the ARG. Simulations show that our method is especially beneficial for finding QTLs in the presence of allelic heterogeneity. By framing QTL mapping in terms of the estimated ARG, we can also facilitate the detection of QTLs in understudied populations. We use local eGRM to analyze two chromosomes containing known body size loci in a sample of Native Hawaiians. Our investigations can provide intuition about the benefits of using estimated ARGs in population- and statistical-genetic methods in general.
Collapse
Affiliation(s)
- Vivian Link
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Joshua G Schraiber
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Caoqi Fan
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA; Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Bryan Dinh
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA; Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Nicholas Mancuso
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA; Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Charleston W K Chiang
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA; Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Michael D Edge
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA.
| |
Collapse
|
2
|
Karunarathna CB, Graham J. perfectphyloR: An R package for reconstructing perfect phylogenies. BMC Bioinformatics 2019; 20:729. [PMID: 31870286 PMCID: PMC6929499 DOI: 10.1186/s12859-019-3313-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2019] [Accepted: 12/11/2019] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND A perfect phylogeny is a rooted binary tree that recursively partitions sequences. The nested partitions of a perfect phylogeny provide insight into the pattern of ancestry of genetic sequence data. For example, sequences may cluster together in a partition indicating that they arise from a common ancestral haplotype. RESULTS We present an R package perfectphyloR to reconstruct the local perfect phylogenies underlying a sample of binary sequences. The package enables users to associate the reconstructed partitions with a user-defined partition. We describe and demonstrate the major functionality of the package. CONCLUSION The perfectphyloR package should be of use to researchers seeking insight into the ancestral structure of their sequence data. The reconstructed partitions have many applications, including the mapping of trait-influencing variants.
Collapse
Affiliation(s)
- Charith B Karunarathna
- Department of Statistics and Actuarial Science, 8888 University Drive, Burnaby, V5A 1S6, Canada
| | - Jinko Graham
- Department of Statistics and Actuarial Science, 8888 University Drive, Burnaby, V5A 1S6, Canada.
| |
Collapse
|
3
|
Karunarathna CB, Graham J. Using Gene Genealogies to Localize Rare Variants Associated with Complex Traits in Diploid Populations. Hum Hered 2018; 83:30-39. [PMID: 29763929 DOI: 10.1159/000486854] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2017] [Accepted: 01/16/2018] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND AND AIMS Many methods can detect trait association with causal variants in candidate genomic regions; however, a comparison of their ability to localize causal variants is lacking. We extend a previous study of the detection abilities of these methods to a comparison of their localization abilities. METHODS Through coalescent simulation, we compare several popular association methods. Cases and controls are sampled from a diploid population to mimic human studies. As benchmarks for comparison, we include two methods that cluster phenotypes on the true genealogical trees: a naive Mantel test considered previously in haploid populations and an extension that takes into account whether case haplotypes carry a causal variant. We first work through a simulated dataset to illustrate the methods. We then perform a simulation study to score the localization and detection properties. RESULTS In our simulations, the association signal was localized least precisely by the naive Mantel test and most precisely by its extension. Most other approaches had intermediate performance similar to the single-variant Fisher exact test. CONCLUSIONS Our results confirm earlier findings in haploid populations about potential gains in performance from genealogy-based approaches. They also highlight differences between haploid and diploid populations when localizing and detecting causal variants.
Collapse
|
4
|
Folk RA, Soltis PS, Soltis DE, Guralnick R. New prospects in the detection and comparative analysis of hybridization in the tree of life. AMERICAN JOURNAL OF BOTANY 2018; 105:364-375. [PMID: 29683488 DOI: 10.1002/ajb2.1018] [Citation(s) in RCA: 77] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/01/2017] [Accepted: 10/12/2017] [Indexed: 05/03/2023]
Abstract
Assessing the relative importance of the various pathways to diversification is a central goal of biodiversity researchers. For plant biologists, and increasingly across the spectrum of biological sciences, among these pathways of interest is hybridization. New methodological developments are moving the field away from questions of whether natural hybridization occurs or hybrids can persist and toward more direct assessments of the long-term impact of hybridization on diversification and genome organization. Advances in theory and new data, especially phylogenomic data, have changed the face of this field, revealing extensive occurrences of hybridization at both shallow and deep levels, but lacking is a synthesis of these advancements. Here we provide an overview of methods that have been proposed for detecting hybridization with molecular data and advocate a time-extended, comparative view of reticulate evolution. In particular, we pose three overarching questions, newly placed within reach, that are critical for advancing our understanding of hybridization pattern and process: (1) How often is introgression biased toward certain genomes and loci, and is this bias selectively neutral? (2) What are the relative rates of formation of hybrid species and introgressants, and how does this compare to their subsequent fates? (3) Has the frequency of hybridization increased under historical periods of greater dynamism in climate and geographic range, such as the Pleistocene?
Collapse
Affiliation(s)
- Ryan A Folk
- Florida Museum of Natural History, 1659 Museum Road, Gainesville, Florida, 32611, USA
| | - Pamela S Soltis
- Florida Museum of Natural History, 1659 Museum Road, Gainesville, Florida, 32611, USA
| | - Douglas E Soltis
- Department of Biology, University of Florida, 876 Newell Drive, Gainesville, Florida, 32611, USA
- Genetics Institute, University of Florida, 2033 Mowry Road, Gainesville, Florida, 32611, USA
| | - Robert Guralnick
- Florida Museum of Natural History, 1659 Museum Road, Gainesville, Florida, 32611, USA
| |
Collapse
|
5
|
Thompson KL, Linnen CR, Kubatko L. Tree-based quantitative trait mapping in the presence of external covariates. Stat Appl Genet Mol Biol 2016; 15:473-490. [PMID: 27875322 DOI: 10.1515/sagmb-2015-0107] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
A central goal in biological and biomedical sciences is to identify the molecular basis of variation in morphological and behavioral traits. Over the last decade, improvements in sequencing technologies coupled with the active development of association mapping methods have made it possible to link single nucleotide polymorphisms (SNPs) and quantitative traits. However, a major limitation of existing methods is that they are often unable to consider complex, but biologically-realistic, scenarios. Previous work showed that association mapping method performance can be improved by using the evolutionary history within each SNP to estimate the covariance structure among randomly-sampled individuals. Here, we propose a method that can be used to analyze a variety of data types, such as data including external covariates, while considering the evolutionary history among SNPs, providing an advantage over existing methods. Existing methods either do so at a computational cost, or fail to model these relationships altogether. By considering the broad-scale relationships among SNPs, the proposed approach is both computationally-feasible and informed by the evolutionary history among SNPs. We show that incorporating an approximate covariance structure during analysis of complex data sets increases performance in quantitative trait mapping, and apply the proposed method to deer mice data.
Collapse
|
6
|
Thompson KL, Fardo DW. Comparing performance of non-tree-based and tree-based association mapping methods. BMC Proc 2016; 10:405-410. [PMID: 27980669 PMCID: PMC5133494 DOI: 10.1186/s12919-016-0063-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
A central goal in the biomedical and biological sciences is to link variation in quantitative traits to locations along the genome (single nucleotide polymorphisms). Sequencing technology has rapidly advanced in recent decades, along with the statistical methodology to analyze genetic data. Two classes of association mapping methods exist: those that account for the evolutionary relatedness among individuals, and those that ignore the evolutionary relationships among individuals. While the former methods more fully use implicit information in the data, the latter methods are more flexible in the types of data they can handle. This study presents a comparison of the 2 types of association mapping methods when they are applied to simulated data.
Collapse
Affiliation(s)
| | - David W. Fardo
- Department of Biostatistics, University of Kentucky College of Public Health, Lexington, KY 40536-0003 USA
| |
Collapse
|
7
|
Burkett KM, Greenwood CMT, McNeney B, Graham J. Gene genealogies for genetic association mapping, with application to Crohn's disease. Front Genet 2013; 4:260. [PMID: 24348515 PMCID: PMC3845011 DOI: 10.3389/fgene.2013.00260] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2013] [Accepted: 11/12/2013] [Indexed: 11/30/2022] Open
Abstract
A gene genealogy describes relationships among haplotypes sampled from a population. Knowledge of the gene genealogy for a set of haplotypes is useful for estimation of population genetic parameters and it also has potential application in finding disease-predisposing genetic variants. As the true gene genealogy is unknown, Markov chain Monte Carlo (MCMC) approaches have been used to sample genealogies conditional on data at multiple genetic markers. We previously implemented an MCMC algorithm to sample from an approximation to the distribution of the gene genealogy conditional on haplotype data. Our approach samples ancestral trees, recombination and mutation rates at a genomic focal point. In this work, we describe how our sampler can be used to find disease-predisposing genetic variants in samples of cases and controls. We use a tree-based association statistic that quantifies the degree to which case haplotypes are more closely related to each other around the focal point than control haplotypes, without relying on a disease model. As the ancestral tree is a latent variable, so is the tree-based association statistic. We show how the sampler can be used to estimate the posterior distribution of the latent test statistic and corresponding latent p-values, which together comprise a fuzzy p-value. We illustrate the approach on a publicly-available dataset from a study of Crohn's disease that consists of genotypes at multiple SNP markers in a small genomic region. We estimate the posterior distribution of the tree-based association statistic and the recombination rate at multiple focal points in the region. Reassuringly, the posterior mean recombination rates estimated at the different focal points are consistent with previously published estimates. The tree-based association approach finds multiple sub-regions where the case haplotypes are more genetically related than the control haplotypes, and that there may be one or multiple disease-predisposing loci.
Collapse
Affiliation(s)
- Kelly M Burkett
- Department of Statistics and Actuarial Science, Simon Fraser University Burnaby, BC, Canada ; Department of Epidemiology, Biostatistics and Occupational Health, McGill University Montreal, QC, Canada
| | - Celia M T Greenwood
- Department of Oncology, Department of Epidemiology, Biostatistics and Occupational Health, and Division of Cancer Epidemiology, McGill University Montreal, QC, Canada ; Lady Davis Institute for Medical Research, Jewish General Hospital Montreal, QC, Canada
| | - Brad McNeney
- Department of Statistics and Actuarial Science, Simon Fraser University Burnaby, BC, Canada
| | - Jinko Graham
- Department of Statistics and Actuarial Science, Simon Fraser University Burnaby, BC, Canada
| |
Collapse
|
8
|
Thompson KL, Kubatko LS. Using ancestral information to detect and localize quantitative trait loci in genome-wide association studies. BMC Bioinformatics 2013; 14:200. [PMID: 23786262 PMCID: PMC3706278 DOI: 10.1186/1471-2105-14-200] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2013] [Accepted: 06/06/2013] [Indexed: 11/15/2022] Open
Abstract
Background In mammalian genetics, many quantitative traits, such as blood pressure, are thought to be influenced by specific genes, but are also affected by environmental factors, making the associated genes difficult to identify and locate from genetic data alone. In particular, the application of classical statistical methods to single nucleotide polymorphism (SNP) data collected in genome-wide association studies has been especially challenging. We propose a coalescent approach to search for SNPs associated with quantitative traits in genome-wide association study (GWAS) data by taking into account the evolutionary history among SNPs. Results We evaluate the performance of the new method using simulated data, and find that it performs at least as well as existing methods with an increase in performance in the case of population structure. Application of the methodology to a real data set consisting of high-density lipoprotein cholesterol measurements in mice shows the method performs well for empirical data, as well. Conclusions By combining methods from stochastic processes and phylogenetics, this work provides an innovative avenue for the development of new statistical methodology in the analysis of GWAS data.
Collapse
|
9
|
Edriss V, Fernando RL, Su G, Lund MS, Guldbrandtsen B. The effect of using genealogy-based haplotypes for genomic prediction. Genet Sel Evol 2013; 45:5. [PMID: 23496971 PMCID: PMC3655921 DOI: 10.1186/1297-9686-45-5] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2012] [Accepted: 02/13/2013] [Indexed: 11/10/2022] Open
Abstract
Background Genomic prediction uses two sources of information: linkage disequilibrium between markers and quantitative trait loci, and additive genetic relationships between individuals. One way to increase the accuracy of genomic prediction is to capture more linkage disequilibrium by regression on haplotypes instead of regression on individual markers. The aim of this study was to investigate the accuracy of genomic prediction using haplotypes based on local genealogy information. Methods A total of 4429 Danish Holstein bulls were genotyped with the 50K SNP chip. Haplotypes were constructed using local genealogical trees. Effects of haplotype covariates were estimated with two types of prediction models: (1) assuming that effects had the same distribution for all haplotype covariates, i.e. the GBLUP method and (2) assuming that a large proportion (π) of the haplotype covariates had zero effect, i.e. a Bayesian mixture method. Results About 7.5 times more covariate effects were estimated when fitting haplotypes based on local genealogical trees compared to fitting individuals markers. Genealogy-based haplotype clustering slightly increased the accuracy of genomic prediction and, in some cases, decreased the bias of prediction. With the Bayesian method, accuracy of prediction was less sensitive to parameter π when fitting haplotypes compared to fitting markers. Conclusions Use of haplotypes based on genealogy can slightly increase the accuracy of genomic prediction. Improved methods to cluster the haplotypes constructed from local genealogy could lead to additional gains in accuracy.
Collapse
Affiliation(s)
- Vahid Edriss
- Center for Quantitative Genetics and Genomics, Department of Molecular Biology and Genetics, Aarhus University, Tjele DK-8830, Denmark.
| | | | | | | | | |
Collapse
|
10
|
Dashab GR, Kadri NK, Shariati MM, Sahana G. Comparison of linear mixed model analysis and genealogy-based haplotype clustering with a Bayesian approach for association mapping in a pedigreed population. BMC Proc 2012; 6 Suppl 2:S4. [PMID: 22640641 PMCID: PMC3363158 DOI: 10.1186/1753-6561-6-s2-s4] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
Background Despite many success stories of genome wide association studies (GWAS), challenges exist in QTL detection especially in datasets with many levels of relatedness. In this study we compared four methods of GWA on a dataset simulated for the 15th QTL-MAS workshop. The four methods were 1) Mixed model analysis (MMA), 2) Random haplotype model (RHM), 3) Genealogy-based mixed model (GENMIX), and 4) Bayesian variable selection (BVS). The data consisted of phenotypes of 2000 animals from 20 sire families and were genotyped with 9990 SNPs on five chromosomes. Results Out of the eight simulated QTL, these four methods MMA, RHM, GENMIX and BVS identified 6, 6, 8 and 7 QTL respectively and 4 QTL were common across the methods. GENMIX had the highest power to detect QTL however it also produced 4 false positives. BVS was the second best method in terms of power, detecting all QTL except the one on chromosome 5 with epistatic interaction. Two spurious associations were obtained across methods. Though all the methods considered the full pedigree in the analyses, it was not sufficient to avoid all the spurious associations arising due to family structure. Conclusions Using several methods with divergent approaches for GWAS can be useful in gaining confidence on the QTL identified. In our comparison, GENMIX was found to be the best method in terms of power but it needs appropriate correction for multiple testing to avoid the false positives. This study shows that the issues of multiple testing and the relatedness among study samples need special attention in GWAS.
Collapse
Affiliation(s)
- Golam R Dashab
- Department of Molecular Biology and Genetics, Faculty of Science and Technology, Aarhus University, DK-8830 Tjele, Denmark.,Department of Animal Science, Ferdowsi University of Mashhad, 91775 Mashhad, Iran
| | - Naveen K Kadri
- Department of Molecular Biology and Genetics, Faculty of Science and Technology, Aarhus University, DK-8830 Tjele, Denmark
| | - Mohammad M Shariati
- Department of Molecular Biology and Genetics, Faculty of Science and Technology, Aarhus University, DK-8830 Tjele, Denmark.,Department of Animal Science, Ferdowsi University of Mashhad, 91775 Mashhad, Iran
| | - Goutam Sahana
- Department of Molecular Biology and Genetics, Faculty of Science and Technology, Aarhus University, DK-8830 Tjele, Denmark
| |
Collapse
|
11
|
HTreeQA: Using Semi-Perfect Phylogeny Trees in Quantitative Trait Loci Study on Genotype Data. G3-GENES GENOMES GENETICS 2012; 2:175-89. [PMID: 22384396 PMCID: PMC3284325 DOI: 10.1534/g3.111.001768] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/01/2011] [Accepted: 10/22/2011] [Indexed: 01/30/2023]
Abstract
With the advances in high-throughput genotyping technology, the study of quantitative trait loci (QTL) has emerged as a promising tool to understand the genetic basis of complex traits. Methodology development for the study of QTL recently has attracted significant research attention. Local phylogeny-based methods have been demonstrated to be powerful tools for uncovering significant associations between phenotypes and single-nucleotide polymorphism markers. However, most existing methods are designed for homozygous genotypes, and a separate haplotype reconstruction step is often needed to resolve heterozygous genotypes. This approach has limited power to detect nonadditive genetic effects and imposes an extensive computational burden. In this article, we propose a new method, HTreeQA, that uses a tristate semi-perfect phylogeny tree to approximate the perfect phylogeny used in existing methods. The semi-perfect phylogeny trees are used as high-level markers for association study. HTreeQA uses the genotype data as direct input without phasing. HTreeQA can handle complex local population structures. It is suitable for QTL mapping on any mouse populations, including the incipient Collaborative Cross lines. Applied HTreeQA, significant QTLs are found for two phenotypes of the PreCC lines, white head spot and running distance at day 5/6. These findings are consistent with known genes and QTL discovered in independent studies. Simulation studies under three different genetic models show that HTreeQA can detect a wider range of genetic effects and is more efficient than existing phylogeny-based approaches. We also provide rigorous theoretical analysis to show that HTreeQA has a lower error rate than alternative methods.
Collapse
|
12
|
Sahana G, Mailund T, Lund MS, Guldbrandtsen B. Local genealogies in a linear mixed model for genome-wide association mapping in complex pedigreed populations. PLoS One 2011; 6:e27061. [PMID: 22073255 PMCID: PMC3206889 DOI: 10.1371/journal.pone.0027061] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2011] [Accepted: 10/10/2011] [Indexed: 11/18/2022] Open
Abstract
INTRODUCTION The state-of-the-art for dealing with multiple levels of relationship among the samples in genome-wide association studies (GWAS) is unified mixed model analysis (MMA). This approach is very flexible, can be applied to both family-based and population-based samples, and can be extended to incorporate other effects in a straightforward and rigorous fashion. Here, we present a complementary approach, called 'GENMIX (genealogy based mixed model)' which combines advantages from two powerful GWAS methods: genealogy-based haplotype grouping and MMA. SUBJECTS AND METHODS We validated GENMIX using genotyping data of Danish Jersey cattle and simulated phenotype and compared to the MMA. We simulated scenarios for three levels of heritability (0.21, 0.34, and 0.64), seven levels of MAF (0.05, 0.10, 0.15, 0.20, 0.25, 0.35, and 0.45) and five levels of QTL effect (0.1, 0.2, 0.5, 0.7 and 1.0 in phenotypic standard deviation unit). Each of these 105 possible combinations (3 h(2) x 7 MAF x 5 effects) of scenarios was replicated 25 times. RESULTS GENMIX provides a better ranking of markers close to the causative locus' location. GENMIX outperformed MMA when the QTL effect was small and the MAF at the QTL was low. In scenarios where MAF was high or the QTL affecting the trait had a large effect both GENMIX and MMA performed similarly. CONCLUSION In discovery studies, where high-ranking markers are identified and later examined in validation studies, we therefore expect GENMIX to enrich candidates brought to follow-up studies with true positives over false positives more than the MMA would.
Collapse
Affiliation(s)
- Goutam Sahana
- Department of Molecular Biology and Genetics, Faculty of Science and Technology, Aarhus University, Tjele, Denmark.
| | | | | | | |
Collapse
|
13
|
He Y, Li C, Amos CI, Xiong M, Ling H, Jin L. Accelerating haplotype-based genome-wide association study using perfect phylogeny and phase-known reference data. PLoS One 2011; 6:e22097. [PMID: 21789217 PMCID: PMC3137625 DOI: 10.1371/journal.pone.0022097] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2010] [Accepted: 06/17/2011] [Indexed: 11/18/2022] Open
Abstract
The genome-wide association study (GWAS) has become a routine approach for mapping disease risk loci with the advent of large-scale genotyping technologies. Multi-allelic haplotype markers can provide superior power compared with single-SNP markers in mapping disease loci. However, the application of haplotype-based analysis to GWAS is usually bottlenecked by prohibitive time cost for haplotype inference, also known as phasing. In this study, we developed an efficient approach to haplotype-based analysis in GWAS. By using a reference panel, our method accelerated the phasing process and reduced the potential bias generated by unrealistic assumptions in phasing process. The haplotype-based approach delivers great power and no type I error inflation for association studies. With only a medium-size reference panel, phasing error in our method is comparable to the genotyping error afforded by commercial genotyping solutions.
Collapse
Affiliation(s)
- Yungang He
- Department of Computational Genomics, CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
- Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Chinese Academy of Sciences, Shanghai, China
- * E-mail: (YH); (LJ)
| | - Cong Li
- Department of Computational Genomics, CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
- Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Chinese Academy of Sciences, Shanghai, China
| | - Christopher I. Amos
- Department of Epidemiology, University of Texas MD Anderson Cancer Center, Houston, Texas, United States of America
| | - Momiao Xiong
- Human Genetics Center, University of Texas School of Public Health, Houston, Texas, United States of America
- State Key Laboratory of Genetic Engineering and Ministry of Education Key Laboratory of Contemporary Anthropology, School of Life Sciences and Institutes of Biomedical Sciences, Fudan University, Shanghai, China
| | - Hua Ling
- Center for Inherited Disease Research, Johns Hopkins University, Baltimore, Maryland, United States of America
| | - Li Jin
- Department of Computational Genomics, CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
- Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Chinese Academy of Sciences, Shanghai, China
- State Key Laboratory of Genetic Engineering and Ministry of Education Key Laboratory of Contemporary Anthropology, School of Life Sciences and Institutes of Biomedical Sciences, Fudan University, Shanghai, China
- * E-mail: (YH); (LJ)
| |
Collapse
|
14
|
Günther T, Gawenda I, Schmid KJ. phenosim--A software to simulate phenotypes for testing in genome-wide association studies. BMC Bioinformatics 2011; 12:265. [PMID: 21714868 PMCID: PMC3150295 DOI: 10.1186/1471-2105-12-265] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2011] [Accepted: 06/29/2011] [Indexed: 01/15/2023] Open
Abstract
Background There is a great interest in understanding the genetic architecture of complex traits in natural populations. Genome-wide association studies (GWAS) are becoming routine in human, animal and plant genetics to understand the connection between naturally occurring genotypic and phenotypic variation. Coalescent simulations are commonly used in population genetics to simulate genotypes under different parameters and demographic models. Results Here, we present phenosim, a software to add a phenotype to genotypes generated in time-efficient coalescent simulations. Both qualitative and quantitative phenotypes can be generated and it is possible to partition phenotypic variation between additive effects and epistatic interactions between causal variants. The output formats of phenosim are directly usable as input for different GWAS tools. The applicability of phenosim is shown by simulating a genome-wide association study in Arabidopsis thaliana. Conclusions By using the coalescent approach to generate genotypes and phenosim to add phenotypes, the data sets can be used to assess the influence of various factors such as demography, genetic architecture or selection on the statistical power of association methods to detect causal genetic variants under a wide variety of population genetic scenarios. phenosim is freely available from the authors' website http://evoplant.uni-hohenheim.de
Collapse
Affiliation(s)
- Torsten Günther
- Institute of Plant Breeding, Seed Science and Population Genetics, University of Hohenheim, Stuttgart, Germany.
| | | | | |
Collapse
|
15
|
Wu Y. New methods for inference of local tree topologies with recombinant SNP sequences in populations. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:182-193. [PMID: 21071806 DOI: 10.1109/tcbb.2009.27] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
Large amount of population-scale genetic variation data are being collected in populations. One potentially important biological problem is to infer the population genealogical history from these genetic variation data. Partly due to recombination, genealogical history of a set of DNA sequences in a population usually cannot be represented by a single tree. Instead, genealogy is better represented by a genealogical network, which is a compact representation of a set of correlated local genealogical trees, each for a short region of genome and possibly with different topology. Inference of genealogical history for a set of DNA sequences under recombination has many potential applications, including association mapping of complex diseases. In this paper, we present two new methods for reconstructing local tree topologies with the presence of recombination, which extend and improve the previous work in. We first show that the "tree scan" method can be converted to a probabilistic inference method based on a hidden Markov model. We then focus on developing a novel local tree inference method called RENT that is both accurate and scalable to larger data. Through simulation, we demonstrate the usefulness of our methods by showing that the hidden-Markov-model-based method is comparable with the original method in terms of accuracy. We also show that RENT is competitive with other methods in terms of inference accuracy, and its inference error rate is often lower and can handle large data.
Collapse
Affiliation(s)
- Yufeng Wu
- Department of Computer Science and Engineering, University of Connecticut, 371 Fairfield Road, Unit 2155, Storrs, CT 06269, USA.
| |
Collapse
|
16
|
Su SY, Asher JE, Jarvelin MR, Froguel P, Blakemore AIF, Balding DJ, Coin LJM. Inferring combined CNV/SNP haplotypes from genotype data. ACTA ACUST UNITED AC 2010; 26:1437-45. [PMID: 20406911 DOI: 10.1093/bioinformatics/btq157] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Abstract
MOTIVATION Copy number variations (CNVs) are increasingly recognized as an substantial source of individual genetic variation, and hence there is a growing interest in investigating the evolutionary history of CNVs as well as their impact on complex disease susceptibility. CNV/SNP haplotypes are critical for this research, but although many methods have been proposed for inferring integer copy number, few have been designed for inferring CNV haplotypic phase and none of these are applicable at genome-wide scale. Here, we present a method for inferring missing CNV genotypes, predicting CNV allelic configuration and for inferring CNV haplotypic phase from SNP/CNV genotype data. Our method, implemented in the software polyHap v2.0, is based on a hidden Markov model, which models the joint haplotype structure between CNVs and SNPs. Thus, haplotypic phase of CNVs and SNPs are inferred simultaneously. A sampling algorithm is employed to obtain a measure of confidence/credibility of each estimate. RESULTS We generated diploid phase-known CNV-SNP genotype datasets by pairing male X chromosome CNV-SNP haplotypes. We show that polyHap provides accurate estimates of missing CNV genotypes, allelic configuration and CNV haplotypic phase on these datasets. We applied our method to a non-simulated dataset-a region on Chromosome 2 encompassing a short deletion. The results confirm that polyHap's accuracy extends to real-life datasets. AVAILABILITY Our method is implemented in version 2.0 of the polyHap software package and can be downloaded from http://www.imperial.ac.uk/medicine/people/l.coin.
Collapse
Affiliation(s)
- Shu-Yi Su
- Department of Epidemiology and Biostatistics, School of Public Health, Imperial College, London W2 1PG, UK
| | | | | | | | | | | | | |
Collapse
|
17
|
Liu Z, Lin S, Tan MT. Sparse support vector machines with Lp penalty for biomarker identification. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2010; 7:100-107. [PMID: 20150672 DOI: 10.1109/tcbb.2008.17] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
The development of high-throughput technology has generated a massive amount of high-dimensional data, and many of them are of discrete type. Robust and efficient learning algorithms such as LASSO [1] are required for feature selection and overfitting control. However, most feature selection algorithms are only applicable to the continuous data type. In this paper, we propose a novel method for sparse support vector machines (SVMs) with L_(p) (p < 1) regularization. Efficient algorithms (LpSVM) are developed for learning the classifier that is applicable to high-dimensional data sets with both discrete and continuous data types. The regularization parameters are estimated through maximizing the area under the ROC curve (AUC) of the cross-validation data. Experimental results on protein sequence and SNP data attest to the accuracy, sparsity, and efficiency of the proposed algorithm. Biomarkers identified with our methods are compared with those from other methods in the literature. The software package in Matlab is available upon request.
Collapse
Affiliation(s)
- Zhenqiu Liu
- Division of Biostatistics and Bioinformatics, Department of Epidemiology and Preventive Medicine, Greenebaum Cancer Center, School of Medicine, University of Maryland, Baltimore, MD 21201, USA.
| | | | | |
Collapse
|
18
|
Wang J, de Villena FPM, Moore KJ, Wang W, Zhang Q, McMillan L. Genome-wide compatible SNP intervals and their properties. THE 2010 ACM INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND COMPUTATIONAL BIOLOGY : ACM-BCB 2010 : NIAGARA FALLS, NEW YORK, U.S.A., AUGUST 2-4, 2010. ACM INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND COMPUTATIONAL BIOLOGY (1ST : 2010 :... 2010; 2010:43-52. [PMID: 29152612 DOI: 10.1145/1854776.1854788] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Abstract
Intraspecific genomes can be subdivided into blocks with limited diversity. Understanding the distribution and structure of these blocks will help to unravel many biological problems including the identification of genes associated with complex diseases, finding the ancestral origins of a given population, and localizing regions of historical recombination, gene conversion, and homoplasy. We present methods for partitioning a genome into blocks for which there are no apparent recombinations, thus providing parsimonious sets of compatible genome intervals based on the four-gamete test. Our contribution is a thorough analysis of the problem of dividing a genome into compatible intervals, in terms of its computational complexity, and by providing an achievable lower-bound on the minimal number of intervals required to cover an entire data set. In general, such minimal interval partitions are not unique. However, we identify properties that are common to every possible solution. We also define the notion of an interval set that achieves the interval lower-bound, yet maximizes interval overlap. We demonstrate algorithms for partitioning both haplotype data from inbred mice as well as outbred heterozygous genotype data using extensions of the standard four-gamete test. These methods allow our algorithms to be applied to a wide range of genomic data sets.
Collapse
Affiliation(s)
- Jeremy Wang
- Dept. of Computer Science, University of North Carolina, Chapel Hill, NC 27599, USA
| | | | - Kyle J Moore
- Dept. of Computer Science, University of North Carolina, Chapel Hill, NC 27599, USA
| | - Wei Wang
- Dept. of Computer Science, University of North Carolina, Chapel Hill, NC 27599, USA
| | - Qi Zhang
- Dept. of Computer Science, University of North Carolina, Chapel Hill, NC 27599, USA
| | - Leonard McMillan
- Dept. of Computer Science, University of North Carolina, Chapel Hill, NC 27599, USA
| |
Collapse
|
19
|
Schierup MH, Mailund T, Li H, Wang J, Tjønneland A, Vogel U, Bolund L, Nexø BA. Haplotype frequencies in a sub-region of chromosome 19q13.3, related to risk and prognosis of cancer, differ dramatically between ethnic groups. BMC MEDICAL GENETICS 2009; 10:20. [PMID: 19257887 PMCID: PMC2654437 DOI: 10.1186/1471-2350-10-20] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/05/2008] [Accepted: 03/03/2009] [Indexed: 11/10/2022]
Abstract
BACKGROUND A small region of about 70 kb on human chromosome 19q13.3 encompasses 4 genes of which 3, ERCC1, ERCC2, and PPP1R13L (aka RAI) are related to DNA repair and cell survival, and one, CD3EAP, aka ASE1, may be related to cell proliferation. The whole region seems related to the cellular response to external damaging agents and markers in it are associated with risk of several cancers. METHODS We downloaded the genotypes of all markers typed in the 19q13.3 region in the HapMap populations of European, Asian and African descent and inferred haplotypes. We combined the European HapMap individuals with a Danish breast cancer case-control data set and inferred the association between HapMap haplotypes and disease risk. RESULTS We found that the susceptibility haplotype in our European sample had increased from 2 to 50 percent very recently in the European population, and to almost the same extent in the Asian population. The cause of this increase is unknown. The maximal proportion of overall genetic variation due to differences between groups for Europeans versus Africans and Europeans versus Asians (the Fst value) closely matched the putative location of the susceptibility variant as judged from haplotype-based association mapping. CONCLUSION The combined observation that a common haplotype causing an increased risk of cancer in Europeans and a high differentiation between human populations is highly unusual and suggests a causal relationship with a recent increase in Europeans caused either by genetic drift overruling selection against the susceptibility variant or a positive selection for the same haplotype. The data does not allow us to distinguish between these two scenarios. The analysis suggests that the region is not involved in cancer risk in Africans and that the susceptibility variants may be more finely mapped in Asian populations.
Collapse
Affiliation(s)
- Mikkel H Schierup
- Institute of Human Genetics, University of Aarhus, DK-8000 Aarhus C, Denmark.
| | | | | | | | | | | | | | | |
Collapse
|
20
|
Ledur MC, Navarro N, Pérez-Enciso M. Data modeling as a main source of discrepancies in single and multiple marker association methods. BMC Proc 2009; 3 Suppl 1:S9. [PMID: 19278548 PMCID: PMC2654503 DOI: 10.1186/1753-6561-3-s1-s9] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Background Genome-wide association studies have successfully identified several loci underlying complex diseases in humans. The development of high density SNP maps in domestic animal species should allow the detection of QTLs for economically important traits through association studies with much higher accuracy than traditional linkage analysis. Here we report the association analysis of the dataset simulated for the XII QTL-MAS meeting (Uppsala). We used two strategies, single marker association and haplotype-based association (Blossoc) that were applied to i) the raw data, and ii) the data corrected for infinitesimal, sex and generation effects. Results Both methods performed similarly in detecting the most strongly associated SNPs, about ten loci in total. The most significant ones were located in chromosomes 1, 4 and 5. Overall, the largest differences were found between corrected and raw data, rather than between single and multiple marker analysis. The use of raw data increased greatly the number of significant loci, but possibly also the rate of false positives. Bootstrap model aggregation removed most of discrepancies between adjusted and raw data when SMA was employed. Conclusion Model choice should be carefully considered in genome-wide association studies.
Collapse
Affiliation(s)
- Mônica Corrêa Ledur
- Embrapa Suínos e Aves, BR 153, Km 110, 89700-000, Concórdia, SC, Brazil.,Dept. Ciencia Animal i dels Aliments, Facultat de Veterinaria, Universitat Autonoma de Barcelona, 08193, Bellaterra, Spain
| | - Nicolas Navarro
- Dept. Ciencia Animal i dels Aliments, Facultat de Veterinaria, Universitat Autonoma de Barcelona, 08193, Bellaterra, Spain
| | - Miguel Pérez-Enciso
- Dept. Ciencia Animal i dels Aliments, Facultat de Veterinaria, Universitat Autonoma de Barcelona, 08193, Bellaterra, Spain.,Institut Català de Recerca i Estudis Avançats (ICREA), Pg. Lluis Companys 23, 08010 Barcelona, Spain
| |
Collapse
|
21
|
Crooks L, Sahana G, de Koning DJ, Lund MS, Carlborg O. Comparison of analyses of the QTLMAS XII common dataset. II: genome-wide association and fine mapping. BMC Proc 2009; 3 Suppl 1:S2. [PMID: 19278541 PMCID: PMC2654496 DOI: 10.1186/1753-6561-3-s1-s2] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
As part of the QTLMAS XII workshop, a simulated dataset was distributed and participants were invited to submit analyses of the data based on genome-wide association, fine mapping and genomic selection. We have evaluated the findings from the groups that reported fine mapping and genome-wide association (GWA) efforts to map quantitative trait loci (QTL). Generally the power to detect QTL was high and the Type 1 error was low. Estimates of QTL locations were generally very accurate. Some methods were much better than others at estimating QTL effects, and with some the accuracy depended on simulated effect size or minor allele frequency. There were also indications of bias in the effect estimates. No epistasis was simulated, but the two studies that included searches for epistasis reported several interacting loci, indicating a problem with controlling the Type I error rate in these analyses. Although this study is based on a single dataset, it indicates that there is a need to improve fine mapping and GWA methods with respect to estimation of genetic effects, appropriate choice of significance thresholds and analysis of epistasis.
Collapse
Affiliation(s)
- Lucy Crooks
- Department of Animal Breeding and Genetics, Swedish University of Agricultural Sciences, Box 7023, SE-75007 Uppsala, Sweden.
| | | | | | | | | |
Collapse
|
22
|
Besenbacher S, Pedersen CNS, Mailund T. A fast algorithm for genome-wide haplotype pattern mining. BMC Bioinformatics 2009; 10 Suppl 1:S74. [PMID: 19208179 PMCID: PMC2648728 DOI: 10.1186/1471-2105-10-s1-s74] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023] Open
Abstract
Background Identifying the genetic components of common diseases has long been an important area of research. Recently, genotyping technology has reached the level where it is cost effective to genotype single nucleotide polymorphism (SNP) markers covering the entire genome, in thousands of individuals, and analyse such data for markers associated with a diseases. The statistical power to detect association, however, is limited when markers are analysed one at a time. This can be alleviated by considering multiple markers simultaneously. The Haplotype Pattern Mining (HPM) method is a machine learning approach to do exactly this. Results We present a new, faster algorithm for the HPM method. The new approach use patterns of haplotype diversity in the genome: locally in the genome, the number of observed haplotypes is much smaller than the total number of possible haplotypes. We show that the new approach speeds up the HPM method with a factor of 2 on a genome-wide dataset with 5009 individuals typed in 491208 markers using default parameters and more if the pattern length is increased. Conclusion The new algorithm speeds up the HPM method and we show that it is feasible to apply HPM to whole genome association mapping with thousands of individuals and hundreds of thousands of markers.
Collapse
|
23
|
Pan F, McMillan L, de Villena FPM, Threadgill D, Wang W. TreeQA: quantitative genome wide association mapping using local perfect phylogeny trees. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2009:415-426. [PMID: 19209719 PMCID: PMC2739990] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
The goal of genome wide association (GWA) mapping in modern genetics is to identify genes or narrow regions in the genome that contribute to genetically complex phenotypes such as morphology or disease. Among the existing methods, tree-based association mapping methods show obvious advantages over single marker-based and haplotype-based methods because they incorporate information about the evolutionary history of the genome into the analysis. However, existing tree-based methods are designed primarily for binary phenotypes derived from case/control studies or fail to scale genome-wide. In this paper, we introduce TreeQA, a quantitative GWA mapping algorithm. TreeQA utilizes local perfect phylogenies constructed in genomic regions exhibiting no evidence of historical recombination. By efficient algorithm design and implementation, TreeQA can efficiently conduct quantitative genom-wide association analysis and is more effective than the previous methods. We conducted extensive experiments on both simulated datasets and mouse inbred lines to demonstrate the efficiency and effectiveness of TreeQA.
Collapse
Affiliation(s)
- Feng Pan
- Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Leonard McMillan
- Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | | | - David Threadgill
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Wei Wang
- Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| |
Collapse
|
24
|
Local phylogeny mapping of quantitative traits: higher accuracy and better ranking than single-marker association in genomewide scans. Genetics 2008; 181:747-53. [PMID: 19064712 DOI: 10.1534/genetics.108.092643] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
We present a new method, termed QBlossoc, for linkage disequilibrium (LD) mapping of genetic variants underlying a quantitative trait. The method uses principles similar to a previously published method, Blossoc, for LD mapping of case/control studies. The method builds local genealogies along the genome and looks for a significant clustering of quantitative trait values in these trees. We analyze its efficiency in terms of localization and ranking of true positives among a large number of negatives and compare the results with single-marker approaches. Simulation results of markers at densities comparable to contemporary genotype chips show that QBlossoc is more accurate in localization of true positives as expected since it uses the additional information of LD between markers simultaneously. More importantly, however, for genomewide surveys, QBlossoc places regions with true positives higher on a ranked list than single-marker approaches, again suggesting that a true signal displays itself more strongly in a set of adjacent markers than a spurious (false) signal. The method is both memory and central processing unit (CPU) efficient. It has been tested on a real data set of height data for 5000 individuals measured at approximately 317,000 markers and completed analysis within 5 CPU days.
Collapse
|
25
|
Nielsen J, Mailund T. SNPFile--a software library and file format for large scale association mapping and population genetics studies. BMC Bioinformatics 2008; 9:526. [PMID: 19063732 PMCID: PMC2633306 DOI: 10.1186/1471-2105-9-526] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2008] [Accepted: 12/08/2008] [Indexed: 11/16/2022] Open
Abstract
Background High-throughput genotyping technology has enabled cost effective typing of thousands of individuals in hundred of thousands of markers for use in genome wide studies. This vast improvement in data acquisition technology makes it an informatics challenge to efficiently store and manipulate the data. While spreadsheets and at text files were adequate solutions earlier, the increased data size mandates more efficient solutions. Results We describe a new binary file format for SNP data, together with a software library for file manipulation. The file format stores genotype data together with any kind of additional data, using a flexible serialisation mechanism. The format is designed to be IO efficient for the access patterns of most multi-locus analysis methods. Conclusion The new file format has been very useful for our own studies where it has significantly reduced the informatics burden in keeping track of various secondary data, and where the memory and IO efficiency has greatly simplified analysis runs. A main limitation with the file format is that it is only supported by the very limited set of analysis tools developed in our own lab. This is somewhat alleviated by a scripting interfaces that makes it easy to write converters to and from the format.
Collapse
Affiliation(s)
- Jesper Nielsen
- Bioinformatics Research Center, University of Aarhus, Denmark.
| | | |
Collapse
|
26
|
Su SY, White J, Balding DJ, Coin LJM. Inference of haplotypic phase and missing genotypes in polyploid organisms and variable copy number genomic regions. BMC Bioinformatics 2008; 9:513. [PMID: 19046436 PMCID: PMC2647950 DOI: 10.1186/1471-2105-9-513] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2008] [Accepted: 12/01/2008] [Indexed: 12/18/2022] Open
Abstract
Background The power of haplotype-based methods for association studies, identification of regions under selection, and ancestral inference, is well-established for diploid organisms. For polyploids, however, the difficulty of determining phase has limited such approaches. Polyploidy is common in plants and is also observed in animals. Partial polyploidy is sometimes observed in humans (e.g. trisomy 21; Down's syndrome), and it arises more frequently in some human tissues. Local changes in ploidy, known as copy number variations (CNV), arise throughout the genome. Here we present a method, implemented in the software polyHap, for the inference of haplotype phase and missing observations from polyploid genotypes. PolyHap allows each individual to have a different ploidy, but ploidy cannot vary over the genomic region analysed. It employs a hidden Markov model (HMM) and a sampling algorithm to infer haplotypes jointly in multiple individuals and to obtain a measure of uncertainty in its inferences. Results In the simulation study, we combine real haplotype data to create artificial diploid, triploid, and tetraploid genotypes, and use these to demonstrate that polyHap performs well, in terms of both switch error rate in recovering phase and imputation error rate for missing genotypes. To our knowledge, there is no comparable software for phasing a large, densely genotyped region of chromosome from triploids and tetraploids, while for diploids we found polyHap to be more accurate than fastPhase. We also compare the results of polyHap to SATlotyper on an experimentally haplotyped tetraploid dataset of 12 SNPs, and show that polyHap is more accurate. Conclusion With the availability of large SNP data in polyploids and CNV regions, we believe that polyHap, our proposed method for inferring haplotypic phase from genotype data, will be useful in enabling researchers analysing such data to exploit the power of haplotype-based analyses.
Collapse
Affiliation(s)
- Shu-Yi Su
- Department of Epidemiology and Public Health, Imperial College, London, W2 1PG, UK.
| | | | | | | |
Collapse
|
27
|
Ding Z, Mailund T, Song YS. Efficient whole-genome association mapping using local phylogenies for unphased genotype data. Bioinformatics 2008; 24:2215-21. [PMID: 18667442 PMCID: PMC2553438 DOI: 10.1093/bioinformatics/btn406] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2008] [Revised: 07/25/2008] [Accepted: 07/29/2008] [Indexed: 01/09/2023] Open
Abstract
MOTIVATION Recent advances in genotyping technology has made data acquisition for whole-genome association study cost effective, and a current active area of research is developing efficient methods to analyze such large-scale datasets. Most sophisticated association mapping methods that are currently available take phased haplotype data as input. However, phase information is not readily available from sequencing methods and inferring the phase via computational approaches is time-consuming, taking days to phase a single chromosome. RESULTS In this article, we devise an efficient method for scanning unphased whole-genome data for association. Our approach combines a recently found linear-time algorithm for phasing genotypes on trees with a recently proposed tree-based method for association mapping. From unphased genotype data, our algorithm builds local phylogenies along the genome, and scores each tree according to the clustering of cases and controls. We assess the performance of our new method on both simulated and real biological datasets. AVAILABILITY The software described in this article is available at http://www.daimi.au.dk/~mailund/Blossoc and distributed under the GNU General Public License.
Collapse
Affiliation(s)
- Zhihong Ding
- Department of Computer Science, University of California, Davis, USA
| | | | | |
Collapse
|