1
|
van der Burg LLJ, de Wreede LC, Baldauf H, Sauter J, Schetelig J, Putter H, Böhringer S. Haplotype reconstruction for genetically complex regions with ambiguous genotype calls: Illustration by the KIR gene region. Genet Epidemiol 2024; 48:3-26. [PMID: 37830494 DOI: 10.1002/gepi.22538] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2023] [Revised: 09/06/2023] [Accepted: 09/25/2023] [Indexed: 10/14/2023]
Abstract
Advances in DNA sequencing technologies have enabled genotyping of complex genetic regions exhibiting copy number variation and high allelic diversity, yet it is impossible to derive exact genotypes in all cases, often resulting in ambiguous genotype calls, that is, partially missing data. An example of such a gene region is the killer-cell immunoglobulin-like receptor (KIR) genes. These genes are of special interest in the context of allogeneic hematopoietic stem cell transplantation. For such complex gene regions, current haplotype reconstruction methods are not feasible as they cannot cope with the complexity of the data. We present an expectation-maximization (EM)-algorithm to estimate haplotype frequencies (HTFs) which deals with the missing data components, and takes into account linkage disequilibrium (LD) between genes. To cope with the exponential increase in the number of haplotypes as genes are added, we add three components to a standard EM-algorithm implementation. First, reconstruction is performed iteratively, adding one gene at a time. Second, after each step, haplotypes with frequencies below a threshold are collapsed in a rare haplotype group. Third, the HTF of the rare haplotype group is profiled in subsequent iterations to improve estimates. A simulation study evaluates the effect of combining information of multiple genes on the estimates of these frequencies. We show that estimated HTFs are approximately unbiased. Our simulation study shows that the EM-algorithm is able to combine information from multiple genes when LD is high, whereas increased ambiguity levels increase bias. Linear regression models based on this EM, show that a large number of haplotypes can be problematic for unbiased effect size estimation and that models need to be sparse. In a real data analysis of KIR genotypes, we compare HTFs to those obtained in an independent study. Our new EM-algorithm-based method is the first to account for the full genetic architecture of complex gene regions, such as the KIR gene region. This algorithm can handle the numerous observed ambiguities, and allows for the collapsing of haplotypes to perform implicit dimension reduction. Combining information from multiple genes improves haplotype reconstruction.
Collapse
Affiliation(s)
| | - Liesbeth C de Wreede
- Biomedical Data Sciences, LUMC, Leiden, The Netherlands
- DKMS, Dresden/Tübingen, Germany
| | | | | | - Johannes Schetelig
- DKMS, Dresden/Tübingen, Germany
- Department of Internal Medicine I, University Hospital Carl Gustav Carus, Dresden, Germany
| | - Hein Putter
- Biomedical Data Sciences, LUMC, Leiden, The Netherlands
| | | |
Collapse
|
2
|
Brugger M, Lutz M, Müller-Nurasyid M, Lichtner P, Slater EP, Matthäi E, Bartsch DK, Strauch K. Joint Linkage and Association Analysis Using GENEHUNTER-MODSCORE with an Application to Familial Pancreatic Cancer. Hum Hered 2024; 89:8-31. [PMID: 38198765 DOI: 10.1159/000535840] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2023] [Accepted: 12/07/2023] [Indexed: 01/12/2024] Open
Abstract
INTRODUCTION Joint linkage and association (JLA) analysis combines two disease gene mapping strategies: linkage information contained in families and association information contained in populations. Such a JLA analysis can increase mapping power, especially when the evidence for both linkage and association is low to moderate. Similarly, an association analysis based on haplotypes instead of single markers can increase mapping power when the association pattern is complex. METHODS In this paper, we present an extension to the GENEHUNTER-MODSCORE software package that enables a JLA analysis based on haplotypes and uses information from arbitrary pedigree types and unrelated individuals. Our new JLA method is an extension of the MOD score approach for linkage analysis, which allows the estimation of trait-model and linkage disequilibrium (LD) parameters, i.e., penetrance, disease-allele frequency, and haplotype frequencies. LD is modeled between alleles at a single diallelic disease locus and up to three diallelic test markers. Linkage information is contributed by additional multi-allelic flanking markers. We investigated the statistical properties of our JLA implementation using extensive simulations, and we compared our approach to another commonly used single-marker JLA test. To demonstrate the applicability of our new method in practice, we analyzed pedigree data from the German National Case Collection for Familial Pancreatic Cancer (FaPaCa). RESULTS Based on the simulated data, we demonstrated the validity of our JLA-MOD score analysis implementation and identified scenarios in which haplotype-based tests outperformed the single-marker test. The estimated trait-model and LD parameters were in good accordance with the simulated values. Our method outperformed another commonly used JLA single-marker test when the LD pattern was complex. The exploratory analysis of the FaPaCa families led to the identification of a promising genetic region on chromosome 22q13.33, which can serve as a starting point for future mutation analysis and molecular research in pancreatic cancer. CONCLUSION Our newly proposed JLA-MOD score method proves to be a valuable gene mapping and characterization tool, especially when either linkage or association information alone provide insufficient power to identify the disease-causing genetic variants.
Collapse
Affiliation(s)
- Markus Brugger
- Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), University Medical Center, Johannes Gutenberg University, Mainz, Germany
- Institute of Medical Information Processing, Biometry and Epidemiology - IBE, LMU Munich, Munich, Germany
- Institute of Genetic Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
| | - Manuel Lutz
- Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), University Medical Center, Johannes Gutenberg University, Mainz, Germany
- Institute of Medical Information Processing, Biometry and Epidemiology - IBE, LMU Munich, Munich, Germany
- Institute of Genetic Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
| | - Martina Müller-Nurasyid
- Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), University Medical Center, Johannes Gutenberg University, Mainz, Germany
- Institute of Medical Information Processing, Biometry and Epidemiology - IBE, LMU Munich, Munich, Germany
- Institute of Genetic Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
| | - Peter Lichtner
- Institute of Human Genetics, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
| | - Emily P Slater
- Department of Visceral, Thoracic and Vascular Surgery, Philipps University, Marburg, Germany
| | - Elvira Matthäi
- Department of Visceral, Thoracic and Vascular Surgery, Philipps University, Marburg, Germany
| | - Detlef K Bartsch
- Department of Visceral, Thoracic and Vascular Surgery, Philipps University, Marburg, Germany
| | - Konstantin Strauch
- Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), University Medical Center, Johannes Gutenberg University, Mainz, Germany
- Institute of Medical Information Processing, Biometry and Epidemiology - IBE, LMU Munich, Munich, Germany
- Institute of Genetic Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
| |
Collapse
|
3
|
Genome-wide haplotype association study in imaging genetics using whole-brain sulcal openings of 16,304 UK Biobank subjects. Eur J Hum Genet 2021; 29:1424-1437. [PMID: 33664500 PMCID: PMC8440755 DOI: 10.1038/s41431-021-00827-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2020] [Revised: 12/18/2020] [Accepted: 02/04/2021] [Indexed: 11/29/2022] Open
Abstract
Neuroimaging-genetics cohorts gather two types of data: brain imaging and genetic data. They allow the discovery of associations between genetic variants and brain imaging features. They are invaluable resources to study the influence of genetics and environment in the brain features variance observed in normal and pathological populations. This study presents a genome-wide haplotype analysis for 123 brain sulcus opening value (a measure of sulcal width) across the whole brain that include 16,304 subjects from UK Biobank. Using genetic maps, we defined 119,548 blocks of low recombination rate distributed along the 22 autosomal chromosomes and analyzed 1,051,316 haplotypes. To test associations between haplotypes and complex traits, we designed three statistical approaches. Two of them use a model that includes all the haplotypes for a single block, while the last approach considers each haplotype independently. All the statistics produced were assessed as rigorously as possible. Thanks to the rich imaging dataset at hand, we used resampling techniques to assess False Positive Rate for each statistical approach in a genome-wide and brain-wide context. The results on real data show that genome-wide haplotype analyses are more sensitive than single-SNP approach and account for local complex Linkage Disequilibrium (LD) structure, which makes genome-wide haplotype analysis an interesting and statistically sound alternative to the single-SNP counterpart.
Collapse
|
4
|
Balliu B, Houwing‐Duistermaat JJ, Böhringer S. Powerful testing via hierarchical linkage disequilibrium in haplotype association studies. Biom J 2019; 61:747-768. [PMID: 30693553 PMCID: PMC6637384 DOI: 10.1002/bimj.201800053] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2018] [Revised: 08/09/2018] [Accepted: 09/08/2018] [Indexed: 12/03/2022]
Abstract
Marginal tests based on individual SNPs are routinely used in genetic association studies. Studies have shown that haplotype-based methods may provide more power in disease mapping than methods based on single markers when, for example, multiple disease-susceptibility variants occur within the same gene. A limitation of haplotype-based methods is that the number of parameters increases exponentially with the number of SNPs, inducing a commensurate increase in the degrees of freedom and weakening the power to detect associations. To address this limitation, we introduce a hierarchical linkage disequilibrium model for disease mapping, based on a reparametrization of the multinomial haplotype distribution, where every parameter corresponds to the cumulant of each possible subset of a set of loci. This hierarchy present in the parameters enables us to employ flexible testing strategies over a range of parameter sets: from standard single SNP analyses through the full haplotype distribution tests, reducing degrees of freedom and increasing the power to detect associations. We show via extensive simulations that our approach maintains the type I error at nominal level and has increased power under many realistic scenarios, as compared to single SNP and standard haplotype-based studies. To evaluate the performance of our proposed methodology in real data, we analyze genome-wide data from the Wellcome Trust Case-Control Consortium.
Collapse
Affiliation(s)
- Brunilda Balliu
- Department of BiomathematicsDavid Geffen School of MedicineUCLALos AngelesCAUSA
| | | | - Stefan Böhringer
- Department of Biomedical Data SciencesSection Medical Statistics and BioinformaticsLeiden University Medical CenterLeidenThe Netherlands
| |
Collapse
|