1
|
Hong MM, Froelicher D, Magner R, Popic V, Berger B, Cho H. Secure discovery of genetic relatives across large-scale and distributed genomic data sets. Genome Res 2024; 34:1312-1323. [PMID: 39111815 DOI: 10.1101/gr.279057.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2024] [Accepted: 07/31/2024] [Indexed: 10/02/2024]
Abstract
Finding relatives within a study cohort is a necessary step in many genomic studies. However, when the cohort is distributed across multiple entities subject to data-sharing restrictions, performing this step often becomes infeasible. Developing a privacy-preserving solution for this task is challenging owing to the burden of estimating kinship between all the pairs of individuals across data sets. We introduce SF-Relate, a practical and secure federated algorithm for identifying genetic relatives across data silos. SF-Relate vastly reduces the number of individual pairs to compare while maintaining accurate detection through a novel locality-sensitive hashing (LSH) approach. We assign individuals who are likely to be related together into buckets and then test relationships only between individuals in matching buckets across parties. To this end, we construct an effective hash function that captures identity-by-descent (IBD) segments in genetic sequences, which, along with a new bucketing strategy, enable accurate and practical private relative detection. To guarantee privacy, we introduce an efficient algorithm based on multiparty homomorphic encryption (MHE) to allow data holders to cooperatively compute the relatedness coefficients between individuals and to further classify their degrees of relatedness, all without sharing any private data. We demonstrate the accuracy and practical runtimes of SF-Relate on the UK Biobank and All of Us data sets. On a data set of 200,000 individuals split between two parties, SF-Relate detects 97% of third-degree or closer relatives within 15 h of runtime. Our work enables secure identification of relatives across large-scale genomic data sets.
Collapse
Affiliation(s)
- Matthew M Hong
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| | - David Froelicher
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
- Broad Institute of the Massachusetts Institute of Technology and Harvard, Cambridge, Massachusetts 02142, USA
| | - Ricky Magner
- Broad Institute of the Massachusetts Institute of Technology and Harvard, Cambridge, Massachusetts 02142, USA
| | - Victoria Popic
- Broad Institute of the Massachusetts Institute of Technology and Harvard, Cambridge, Massachusetts 02142, USA;
| | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA;
- Broad Institute of the Massachusetts Institute of Technology and Harvard, Cambridge, Massachusetts 02142, USA
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| | - Hyunghoon Cho
- Department of Biomedical Informatics and Data Science, Yale University, New Haven, Connecticut 06510, USA
| |
Collapse
|
2
|
Temple SD, Waples RK, Browning SR. Modeling recent positive selection using identity-by-descent segments. Am J Hum Genet 2024:S0002-9297(24)00333-1. [PMID: 39362217 DOI: 10.1016/j.ajhg.2024.08.023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Revised: 08/29/2024] [Accepted: 08/30/2024] [Indexed: 10/05/2024] Open
Abstract
Recent positive selection can result in an excess of long identity-by-descent (IBD) haplotype segments overlapping a locus. The statistical methods that we propose here address three major objectives in studying selective sweeps: scanning for regions of interest, identifying possible sweeping alleles, and estimating a selection coefficient s. First, we implement a selection scan to locate regions with excess IBD rates. Second, we estimate the allele frequency and location of an unknown sweeping allele by aggregating over variants that are more abundant in an inferred outgroup with excess IBD rate versus the rest of the sample. Third, we propose an estimator for the selection coefficient and quantify uncertainty using the parametric bootstrap. Comparing against state-of-the-art methods in extensive simulations, we show that our methods are more precise at estimating s when s≥0.015. We also show that our 95% confidence intervals contain s in nearly 95% of our simulations. We apply these methods to study positive selection in European ancestry samples from the Trans-Omics for Precision Medicine project. We analyze eight loci where IBD rates are more than four standard deviations above the genome-wide median, including LCT where the maximum IBD rate is 35 standard deviations above the genome-wide median. Overall, we present robust and accurate approaches to study recent adaptive evolution without knowing the identity of the causal allele or using time series data.
Collapse
Affiliation(s)
- Seth D Temple
- Department of Statistics, University of Washington, Seattle, WA, USA.
| | - Ryan K Waples
- Department of Biostatistics, University of Washington, Seattle, WA, USA
| | - Sharon R Browning
- Department of Biostatistics, University of Washington, Seattle, WA, USA.
| |
Collapse
|
3
|
Lancaster MC, Chen HH, Shoemaker MB, Fleming MR, Strickland TL, Baker JT, Evans GF, Polikowsky HG, Samuels DC, Huff CD, Roden DM, Below JE. Detection of distant relatedness in biobanks to identify undiagnosed cases of Mendelian disease as applied to Long QT syndrome. Nat Commun 2024; 15:7507. [PMID: 39209900 PMCID: PMC11362435 DOI: 10.1038/s41467-024-51977-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2023] [Accepted: 08/21/2024] [Indexed: 09/04/2024] Open
Abstract
Rare genetic diseases are typically studied in referral populations, resulting in underdiagnosis and biased assessment of penetrance and phenotype. To address this, we develop a generalizable method of genotype inference based on distant relatedness and deploy this to identify undiagnosed Type 5 Long QT Syndrome (LQT5) rare variant carriers in a non-referral population. We identify 9 LQT5 families referred to a single specialty clinic, each carrying p.Asp76Asn, the most common LQT5 variant. We uncover recent common ancestry and a single shared haplotype among probands. Application to a non-referral population of 69,819 BioVU biobank subjects identifies 22 additional subjects sharing this haplotype, which we confirm to carry p.Asp76Asn. Referral and non-referral carriers have prolonged QT interval corrected for heart rate (QTc) compared to controls, and, among carriers, the QTc polygenic score is independently associated with QTc prolongation. Thus, our innovative analysis of shared chromosomal segments identifies undiagnosed cases of genetic disease and refines the understanding of LQT5 penetrance and phenotype.
Collapse
Affiliation(s)
- Megan C Lancaster
- Division of Cardiovascular Medicine, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, 37232, USA
| | - Hung-Hsin Chen
- Division of Genetic Medicine, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, 37232, USA
- Institute of Biomedical Sciences, Academia Sinica, Taipei, 11524, Taiwan
| | - M Benjamin Shoemaker
- Division of Cardiovascular Medicine, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, 37232, USA
| | - Matthew R Fleming
- Division of Cardiovascular Medicine, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, 37232, USA
| | - Teresa L Strickland
- Division of Clinical Pharmacology, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, 37232, USA
| | - James T Baker
- Division of Genetic Medicine, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, 37232, USA
| | - Grahame F Evans
- Division of Genetic Medicine, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, 37232, USA
| | - Hannah G Polikowsky
- Division of Genetic Medicine, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, 37232, USA
| | - David C Samuels
- Department of Molecular Physiology and Biophysics, Vanderbilt University, Nashville, TN, 37232, USA
| | - Chad D Huff
- Division of Cancer Prevention and Population Sciences, Department of Epidemiology, University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA
| | - Dan M Roden
- Division of Cardiovascular Medicine, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, 37232, USA
- Division of Clinical Pharmacology, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, 37232, USA
| | - Jennifer E Below
- Division of Genetic Medicine, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, 37232, USA.
| |
Collapse
|
4
|
Guo B, Takala-Harrison S, O’Connor TD. Benchmarking and Optimization of Methods for the Detection of Identity-By-Descent in High-Recombining Plasmodium falciparum Genomes. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.04.592538. [PMID: 38746392 PMCID: PMC11092787 DOI: 10.1101/2024.05.04.592538] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2024]
Abstract
Genomic surveillance is crucial for identifying at-risk populations for targeted malaria control and elimination. Identity-by-descent (IBD) is increasingly being used in Plasmodium population genomics to estimate genetic relatedness, effective population size (N e ), population structure, and signals of positive selection. Despite its potential, a thorough evaluation of IBD segment detection tools for species with high recombination rates, such as P. falciparum, remains absent. Here, we perform comprehensive benchmarking of IBD callers - probabilistic (hmmIBD, isoRelate), identity-by-state-based (hap-IBD, phased IBD) and others (Refined IBD) - using population genetic simulations tailored for high recombination, and IBD quality metrics at both the IBD segment level and the IBD-based downstream inference level. Our results demonstrate that low marker density per genetic unit, related to high recombination relative to mutation, significantly compromises the accuracy of detected IBD segments. In genomes with high recombination rates resembling P. falciparum, most IBD callers exhibit high false negative rates for shorter IBD segments, which can be partially mitigated through optimization of IBD caller parameters, especially those related to marker density. Notably, IBD detected with optimized parameters allows for more accurate capture of selection signals and population structure; IBD-based N e inference is very sensitive to IBD detection errors, with IBD called from hmmIBD uniquely providing less biased estimates of N e in this context. Validation with empirical data from the MalariaGEN Pf 7 database, representing different transmission settings, corroborates these findings. We conclude that context-specific evaluation and parameter optimization are essential for accurate IBD detection in high-recombining species and recommend hmmIBD for quality-sensitive analysis, such as estimation of N e in these species. Our optimization and high-level benchmarking methods not only improve IBD segment detection in high-recombining genomes but also enhance overall genomic analysis, paving the way for more accurate genomic surveillance and targeted intervention strategies for malaria.
Collapse
Affiliation(s)
- Bing Guo
- Center for Vaccine Development and Global Health, University of Maryland School of Medicine, Baltimore, MD USA
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Shannon Takala-Harrison
- Center for Vaccine Development and Global Health, University of Maryland School of Medicine, Baltimore, MD USA
| | - Timothy D. O’Connor
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA
| |
Collapse
|
5
|
Temple SD, Thompson EA. Identity-by-descent segments in large samples. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.05.597656. [PMID: 38895476 PMCID: PMC11185678 DOI: 10.1101/2024.06.05.597656] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/21/2024]
Abstract
If two haplotypes share the same alleles for an extended gene tract, these haplotypes are likely to derive identical-by-descent from a recent common ancestor. Identity-by-descent segment lengths are correlated via unobserved tree and recombination processes, which commonly presents challenges to the derivation of theoretical results in population genetics. Under interpretable regularity conditions, we show that the proportion of detectable identity-by-descent segments at a locus is normally distributed for large sample size and large scaled population size. We use efficient and exact simulations to study the distributional behavior of the detectable identity-by-descent rate in finite samples. One consequence of non-normality in finite samples is that genome-wide scans based on identity-by-descent rates may be subject to anti-conservative Type 1 error control. Highlights We show the asymptotic normality of the identity-by-descent rate, a mean of correlated binary random variables that arises in population genetics studies.We describe an efficient algorithm capable of simulating long identity-by-descent segments around a locus in large sample sizes.In enormous simulation studies, we use this algorithm to characterize the distributional properties of the identity-by-descent rate.In finite samples, we reject the null hypothesis of normality more often than the nominal significance level, indicating that genome-wide scans based on identity-by-descent rates may be anti-conservative.
Collapse
|
6
|
Stoneman HR, Price A, Trout NS, Lamont R, Tifour S, Pozdeyev N, Crooks K, Lin M, Rafaels N, Gignoux CR, Marker KM, Hendricks AE. Characterizing substructure via mixture modeling in large-scale genetic summary statistics. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.29.577805. [PMID: 38766180 PMCID: PMC11100604 DOI: 10.1101/2024.01.29.577805] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2024]
Abstract
Genetic summary data are broadly accessible and highly useful including for risk prediction, causal inference, fine mapping, and incorporation of external controls. However, collapsing individual-level data into groups masks intra- and inter-sample heterogeneity, leading to confounding, reduced power, and bias. Ultimately, unaccounted substructure limits summary data usability, especially for understudied or admixed populations. Here, we present Summix2, a comprehensive set of methods and software based on a computationally efficient mixture model to estimate and adjust for substructure in genetic summary data. In extensive simulations and application to public data, Summix2 characterizes finer-scale population structure, identifies ascertainment bias, and identifies potential regions of selection due to local substructure deviation. Summix2 increases the robust use of diverse publicly available summary data resulting in improved and more equitable research.
Collapse
Affiliation(s)
- Hayley R Stoneman
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
- Human Medical Genetics and Genomics Program, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Adelle Price
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
- Mathematical and Statistical Sciences, University of Colorado Denver, Denver, CO 80204, USA
| | - Nikole Scribner Trout
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
- Mathematical and Statistical Sciences, University of Colorado Denver, Denver, CO 80204, USA
| | - Riley Lamont
- Mathematical and Statistical Sciences, University of Colorado Denver, Denver, CO 80204, USA
| | - Souha Tifour
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
- Mathematical and Statistical Sciences, University of Colorado Denver, Denver, CO 80204, USA
| | - Nikita Pozdeyev
- Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
- Division of Endocrinology, Diabetes and Metabolism, Department of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Kristy Crooks
- Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
- Department of Pathology, School of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Meng Lin
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
- Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Nicholas Rafaels
- Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Christopher R Gignoux
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
- Human Medical Genetics and Genomics Program, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
- Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Katie M Marker
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
- Human Medical Genetics and Genomics Program, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
- Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Audrey E Hendricks
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
- Human Medical Genetics and Genomics Program, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
- Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
- Mathematical and Statistical Sciences, University of Colorado Denver, Denver, CO 80204, USA
| |
Collapse
|
7
|
Browning SR, Browning BL. Biobank-scale inference of multi-individual identity by descent and gene conversion. Am J Hum Genet 2024; 111:691-700. [PMID: 38513668 PMCID: PMC11023918 DOI: 10.1016/j.ajhg.2024.02.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2023] [Revised: 02/26/2024] [Accepted: 02/27/2024] [Indexed: 03/23/2024] Open
Abstract
We present a method for efficiently identifying clusters of identical-by-descent haplotypes in biobank-scale sequence data. Our multi-individual approach enables much more computationally efficient inference of identity by descent (IBD) than approaches that infer pairwise IBD segments and provides locus-specific IBD clusters rather than IBD segments. Our method's computation time, memory requirements, and output size scale linearly with the number of individuals in the dataset. We also present a method for using multi-individual IBD to detect alleles changed by gene conversion. Application of our methods to the autosomal sequence data for 125,361 White British individuals in the UK Biobank detects more than 9 million converted alleles. This is 2,900 times more alleles changed by gene conversion than were detected in a previous analysis of familial data. We estimate that more than 250,000 sequenced probands and a much larger number of additional genomes from multi-generational family members would be required to find a similar number of alleles changed by gene conversion using a family-based approach. Our IBD clustering method is implemented in the open-source ibd-cluster software package.
Collapse
Affiliation(s)
- Sharon R Browning
- Department of Biostatistics, University of Washington, Seattle, WA, USA.
| | - Brian L Browning
- Department of Biostatistics, University of Washington, Seattle, WA, USA; Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, WA, USA.
| |
Collapse
|
8
|
Hong MM, Froelicher D, Magner R, Popic V, Berger B, Cho H. Secure Discovery of Genetic Relatives across Large-Scale and Distributed Genomic Datasets. RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY : ... ANNUAL INTERNATIONAL CONFERENCE, RECOMB ... : PROCEEDINGS. RECOMB (CONFERENCE : 2005- ) 2024; 14758:308-313. [PMID: 39027313 PMCID: PMC11257153 DOI: 10.1007/978-1-0716-3989-4_19] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 07/20/2024]
Abstract
Finding relatives within a study cohort is a necessary step in many genomic studies. However, when the cohort is distributed across multiple entities subject to data-sharing restrictions, performing this step often becomes infeasible. Developing a privacy-preserving solution for this task is challenging due to the significant burden of estimating kinship between all pairs of individuals across datasets. We introduce SF-Relate, a practical and secure federated algorithm for identifying genetic relatives across data silos. SF-Relate vastly reduces the number of individual pairs to compare while maintaining accurate detection through a novel locality-sensitive hashing approach. We assign individuals who are likely to be related together into buckets and then test relationships only between individuals in matching buckets across parties. To this end, we construct an effective hash function that captures identity-by-descent (IBD) segments in genetic sequences, which, along with a new bucketing strategy, enable accurate and practical private relative detection. To guarantee privacy, we introduce an efficient algorithm based on multiparty homomorphic encryption (MHE) to allow data holders to cooperatively compute the relatedness coefficients between individuals, and to further classify their degrees of relatedness, all without sharing any private data. We demonstrate the accuracy and practical runtimes of SF-Relate on the UK Biobank and All of Us datasets. On a dataset of 200K individuals split between two parties, SF-Relate detects 94.9% of third-degree relatives, and 99.9% of second-degree or closer relatives, within 15 hours of runtime. Our work enables secure identification of relatives across large-scale genomic datasets.
Collapse
Affiliation(s)
| | - David Froelicher
- MIT, Cambridge, 02139, USA
- Broad Institute of MIT and Harvard, Cambridge, 02142, USA
| | - Ricky Magner
- Broad Institute of MIT and Harvard, Cambridge, 02142, USA
| | - Victoria Popic
- Broad Institute of MIT and Harvard, Cambridge, 02142, USA
| | - Bonnie Berger
- MIT, Cambridge, 02139, USA
- Broad Institute of MIT and Harvard, Cambridge, 02142, USA
| | | |
Collapse
|
9
|
Ji Q, Yao Y, Li Z, Zhou Z, Qian J, Tang Q, Xie J. Characterizing identity by descent segments in Chinese interpopulation unrelated individual pairs. Mol Genet Genomics 2024; 299:37. [PMID: 38494535 DOI: 10.1007/s00438-024-02132-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2023] [Accepted: 02/22/2024] [Indexed: 03/19/2024]
Abstract
Identity by descent (IBD) segments, uninterrupted DNA segments derived from the same ancestral chromosomes, are widely used as indicators of relationships in genetics. A great deal of research focuses on IBD segments between related pairs, while the statistical analyses of segments in irrelevant individuals are rare. In this study, we investigated the basic informative features of IBD segments in unrelated pairs in Chinese populations from the 1000 Genome Project. A total of 5922 IBD segments in Chinese interpopulation unrelated individual pairs were detected via IBIS and the average length of IBD was 3.71 Mb in length. It was found that 17.86% of unrelated pairs shared at least one IBD segment in the Chinese cohort. Furthermore, a total of 49 chromosomal regions where IBD segments clustered in high abundance were identified, which might be sharing hotspots in the human genome. Such regions could also be observed in other ancestry populations, which implies that similar IBD backgrounds also exist. Altogether, these results demonstrated the distribution of common background IBD segments, which helps improve the accuracy in pedigree studies based on IBD analysis.
Collapse
Affiliation(s)
- Qiqi Ji
- Department of Forensic Medicine, School of Basic Medical Sciences, Fudan University, 138 Yixueyuan Road, Shanghai, 200032, China
| | - Yining Yao
- Department of Forensic Medicine, School of Basic Medical Sciences, Fudan University, 138 Yixueyuan Road, Shanghai, 200032, China
| | - Zhimin Li
- Department of Forensic Medicine, School of Basic Medical Sciences, Fudan University, 138 Yixueyuan Road, Shanghai, 200032, China
| | - Zhihan Zhou
- Department of Forensic Medicine, School of Basic Medical Sciences, Fudan University, 138 Yixueyuan Road, Shanghai, 200032, China
| | - Jinglei Qian
- Department of Forensic Medicine, School of Basic Medical Sciences, Fudan University, 138 Yixueyuan Road, Shanghai, 200032, China
| | - Qiqun Tang
- Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Fudan University, Shanghai, 200032, China
| | - Jianhui Xie
- Department of Forensic Medicine, School of Basic Medical Sciences, Fudan University, 138 Yixueyuan Road, Shanghai, 200032, China.
| |
Collapse
|
10
|
Chen H, Naseri A, Zhi D. FiMAP: A fast identity-by-descent mapping test for biobank-scale cohorts. PLoS Genet 2023; 19:e1011057. [PMID: 38039339 PMCID: PMC10718418 DOI: 10.1371/journal.pgen.1011057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2023] [Revised: 12/13/2023] [Accepted: 11/07/2023] [Indexed: 12/03/2023] Open
Abstract
Although genome-wide association studies (GWAS) have identified tens of thousands of genetic loci, the genetic architecture is still not fully understood for many complex traits. Most GWAS and sequencing association studies have focused on single nucleotide polymorphisms or copy number variations, including common and rare genetic variants. However, phased haplotype information is often ignored in GWAS or variant set tests for rare variants. Here we leverage the identity-by-descent (IBD) segments inferred from a random projection-based IBD detection algorithm in the mapping of genetic associations with complex traits, to develop a computationally efficient statistical test for IBD mapping in biobank-scale cohorts. We used sparse linear algebra and random matrix algorithms to speed up the computation, and a genome-wide IBD mapping scan of more than 400,000 samples finished within a few hours. Simulation studies showed that our new method had well-controlled type I error rates under the null hypothesis of no genetic association in large biobank-scale cohorts, and outperformed traditional GWAS single-variant tests when the causal variants were untyped and rare, or in the presence of haplotype effects. We also applied our method to IBD mapping of six anthropometric traits using the UK Biobank data and identified a total of 3,442 associations, 2,131 (62%) of which remained significant after conditioning on suggestive tag variants in the ± 3 centimorgan flanking regions from GWAS.
Collapse
Affiliation(s)
- Han Chen
- Human Genetics Center, Department of Epidemiology, School of Public Health, The University of Texas Health Science Center at Houston, Houston, Texas, United States of America
| | - Ardalan Naseri
- Center for Artificial Intelligence and Genome Informatics, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, United States of America
| | - Degui Zhi
- Center for Artificial Intelligence and Genome Informatics, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, United States of America
| |
Collapse
|
11
|
Browning SR, Browning BL. Biobank-scale inference of multi-individual identity by descent and gene conversion. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.03.565574. [PMID: 37961601 PMCID: PMC10635131 DOI: 10.1101/2023.11.03.565574] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
We present a method for efficiently identifying clusters of identical-by-descent haplotypes in biobank-scale sequence data. Our multi-individual approach enables much more efficient collection and storage of identity by descent (IBD) information than approaches that detect and store pairwise IBD segments. Our method's computation time, memory requirements, and output size scale linearly with the number of individuals in the dataset. We also present a method for using multi-individual IBD to detect alleles changed by gene conversion. Application of our methods to the autosomal sequence data for 125,361 White British individuals in the UK Biobank detects more than 9 million converted alleles. This is 2900 times more alleles changed by gene conversion than were detected in a previous analysis of familial data. We estimate that more than 250,000 sequenced probands and a much larger number of additional genomes from multi-generational family members would be required to find a similar number of alleles changed by gene conversion using a family-based approach.
Collapse
Affiliation(s)
| | - Brian L. Browning
- Department of Biostatistics, University of Washington, Seattle, WA
- Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, WA
| |
Collapse
|
12
|
Caggiano C, Boudaie A, Shemirani R, Mefford J, Petter E, Chiu A, Ercelen D, He R, Tward D, Paul KC, Chang TS, Pasaniuc B, Kenny EE, Shortt JA, Gignoux CR, Balliu B, Arboleda VA, Belbin G, Zaitlen N. Disease risk and healthcare utilization among ancestrally diverse groups in the Los Angeles region. Nat Med 2023; 29:1845-1856. [PMID: 37464048 PMCID: PMC11121511 DOI: 10.1038/s41591-023-02425-1] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2022] [Accepted: 05/30/2023] [Indexed: 07/20/2023]
Abstract
An individual's disease risk is affected by the populations that they belong to, due to shared genetics and environmental factors. The study of fine-scale populations in clinical care is important for identifying and reducing health disparities and for developing personalized interventions. To assess patterns of clinical diagnoses and healthcare utilization by fine-scale populations, we leveraged genetic data and electronic medical records from 35,968 patients as part of the UCLA ATLAS Community Health Initiative. We defined clusters of individuals using identity by descent, a form of genetic relatedness that utilizes shared genomic segments arising due to a common ancestor. In total, we identified 376 clusters, including clusters with patients of Afro-Caribbean, Puerto Rican, Lebanese Christian, Iranian Jewish and Gujarati ancestry. Our analysis uncovered 1,218 significant associations between disease diagnoses and clusters and 124 significant associations with specialty visits. We also examined the distribution of pathogenic alleles and found 189 significant alleles at elevated frequency in particular clusters, including many that are not regularly included in population screening efforts. Overall, this work progresses the understanding of health in understudied communities and can provide the foundation for further study into health inequities.
Collapse
Affiliation(s)
- Christa Caggiano
- Interdepartmental Program in Bioinformatics, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Neurology, University of California, Los Angeles, Los Angeles, CA, USA
| | | | - Ruhollah Shemirani
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Joel Mefford
- Semel Institute for Neuroscience and Human Behavior, University of California, Los Angeles, Los Angeles, CA, USA
| | - Ella Petter
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA, USA
| | - Alec Chiu
- Interdepartmental Program in Bioinformatics, University of California, Los Angeles, Los Angeles, CA, USA
| | - Defne Ercelen
- Computational and Systems Biology Interdepartmental Program, University of California, Los Angeles, Los Angeles, CA, USA
| | - Rosemary He
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Computational Medicine, University of California, Los Angeles, Los Angeles, CA, USA
| | - Daniel Tward
- Department of Neurology, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Computational Medicine, University of California, Los Angeles, Los Angeles, CA, USA
| | - Kimberly C Paul
- Department of Neurology, University of California, Los Angeles, Los Angeles, CA, USA
| | - Timothy S Chang
- Department of Neurology, University of California, Los Angeles, Los Angeles, CA, USA
| | - Bogdan Pasaniuc
- Department of Computational Medicine, University of California, Los Angeles, Los Angeles, CA, USA
- Institute of Precision Health, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Pathology and Laboratory Medicine, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA, USA
| | - Eimear E Kenny
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Jonathan A Shortt
- Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
- Division of Bioinformatics and Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Christopher R Gignoux
- Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
- Division of Bioinformatics and Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Brunilda Balliu
- Department of Computational Medicine, University of California, Los Angeles, Los Angeles, CA, USA
| | - Valerie A Arboleda
- Department of Pathology and Laboratory Medicine, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA, USA
| | - Gillian Belbin
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Noah Zaitlen
- Department of Neurology, University of California, Los Angeles, Los Angeles, CA, USA.
- Department of Computational Medicine, University of California, Los Angeles, Los Angeles, CA, USA.
- Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA, USA.
| |
Collapse
|
13
|
Wei Y, Naseri A, Zhi D, Zhang S. RaPID-Query for fast identity by descent search and genealogical analysis. Bioinformatics 2023; 39:btad312. [PMID: 37166451 PMCID: PMC10244210 DOI: 10.1093/bioinformatics/btad312] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2022] [Revised: 04/26/2023] [Accepted: 05/09/2023] [Indexed: 05/12/2023] Open
Abstract
MOTIVATION Due to the rapid growth of the genetic database size, genealogical search, a process of inferring familial relatedness by identifying DNA matches, has become a viable approach to help individuals finding missing family members or law enforcement agencies locating suspects. A fast and accurate method is needed to search an out-of-database individual against millions of individuals. Most existing approaches only offer all-versus-all within panel match. Some prototype algorithms offer one-versus-all query from out-of-panel individual, but they do not tolerate errors. RESULTS A new method, random projection-based identity-by-descent (IBD) detection (RaPID) query, is introduced to make fast genealogical search possible. RaPID-Query identifies IBD segments between a query haplotype and a panel of haplotypes. By integrating matches over multiple PBWT indexes, RaPID-Query manages to locate IBD segments quickly with a given cutoff length while allowing mismatched sites. A single query against all UK biobank autosomal chromosomes was completed within 2.76 seconds on average, with the minimum length 7 cM and 700 markers. RaPID-Query achieved a 0.016 false negative rate and a 0.012 false positive rate simultaneously on a chromosome 20 sequencing panel having 86 265 sites. This is comparable to the state-of-the-art IBD detection method TPBWT(out-of-sample) and Hap-IBD. The high-quality IBD segments yielded by RaPID-Query were able to distinguish up to fourth degree of the familial relatedness for a given individual pair, and the area under the receiver operating characteristic curve values are at least 97.28%. AVAILABILITY AND IMPLEMENTATION The RaPID-Query program is available at https://github.com/ucfcbb/RaPID-Query.
Collapse
Affiliation(s)
- Yuan Wei
- Department of Computer Science, University of Central Florida, Orlando, FL 32816, United States
| | - Ardalan Naseri
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Degui Zhi
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Shaojie Zhang
- Department of Computer Science, University of Central Florida, Orlando, FL 32816, United States
| |
Collapse
|
14
|
Cybulski C, Zamani N, Kluźniak W, Milano L, Wokołorczyk D, Stempa K, Rudnicka H, Zhang S, Zadeh M, Huzarski T, Jakubowska A, Dębniak T, Lener M, Szwiec M, Domagała P, Samani AA, Narod S, Gronwald J, Masson JY, Lubiński J, Akbari MR. Variants in ATRIP are associated with breast cancer susceptibility in the Polish population and UK Biobank. Am J Hum Genet 2023; 110:648-662. [PMID: 36977412 PMCID: PMC10119148 DOI: 10.1016/j.ajhg.2023.03.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2022] [Accepted: 03/02/2023] [Indexed: 03/29/2023] Open
Abstract
Several breast cancer susceptibility genes have been discovered, but more are likely to exist. To identify additional breast cancer susceptibility genes, we used the founder population of Poland and performed whole-exome sequencing on 510 women with familial breast cancer and 308 control subjects. We identified a rare mutation in ATRIP (GenBank: NM_130384.3: c.1152_1155del [p.Gly385Ter]) in two women with breast cancer. At the validation phase, we found this variant in 42/16,085 unselected Polish breast cancer-affected individuals and in 11/9,285 control subjects (OR = 2.14, 95% CI = 1.13-4.28, p = 0.02). By analyzing the sequence data of the UK Biobank study participants (450,000 individuals), we identified ATRIP loss-of-function variants among 13/15,643 breast cancer-affected individuals versus 40/157,943 control subjects (OR = 3.28, 95% CI = 1.76-6.14, p < 0.001). Immunohistochemistry and functional studies showed the ATRIP c.1152_1155del variant allele is weakly expressed compared to the wild-type allele, and truncated ATRIP fails to perform its normal function to prevent replicative stress. We showed that tumors of women with breast cancer who have a germline ATRIP mutation have loss of heterozygosity at the site of ATRIP mutation and genomic homologous recombination deficiency. ATRIP is a critical partner of ATR that binds to RPA coating single-stranded DNA at sites of stalled DNA replication forks. Proper activation of ATR-ATRIP elicits a DNA damage checkpoint crucial in regulating cellular responses to DNA replication stress. Based on our observations, we conclude ATRIP is a breast cancer susceptibility gene candidate linking DNA replication stress to breast cancer.
Collapse
Affiliation(s)
- Cezary Cybulski
- International Hereditary Cancer Center, Department of Genetics and Pathology, Pomeranian Medical University in Szczecin, Szczecin, Poland
| | - Neda Zamani
- Women's College Research Institute, Women's College Hospital, University of Toronto, Toronto, ON, Canada; Institute of Medical Science, Faculty of Medicine, University of Toronto, Toronto, ON, Canada
| | - Wojciech Kluźniak
- International Hereditary Cancer Center, Department of Genetics and Pathology, Pomeranian Medical University in Szczecin, Szczecin, Poland
| | - Larissa Milano
- Genome Stability Laboratory, CHU de Québec Research Center, Oncology Axis; Department of Molecular Biology, Medical Biochemistry and Pathology; Laval University Cancer Research Center, Québec City, QC, Canada
| | - Dominika Wokołorczyk
- International Hereditary Cancer Center, Department of Genetics and Pathology, Pomeranian Medical University in Szczecin, Szczecin, Poland
| | - Klaudia Stempa
- International Hereditary Cancer Center, Department of Genetics and Pathology, Pomeranian Medical University in Szczecin, Szczecin, Poland
| | - Helena Rudnicka
- International Hereditary Cancer Center, Department of Genetics and Pathology, Pomeranian Medical University in Szczecin, Szczecin, Poland
| | - Shiyu Zhang
- Women's College Research Institute, Women's College Hospital, University of Toronto, Toronto, ON, Canada
| | - Maryam Zadeh
- Women's College Research Institute, Women's College Hospital, University of Toronto, Toronto, ON, Canada; Institute of Medical Science, Faculty of Medicine, University of Toronto, Toronto, ON, Canada
| | - Tomasz Huzarski
- Department of Clinical Genetics and Pathology, University of Zielona Góra, Zielona Góra, Poland
| | - Anna Jakubowska
- International Hereditary Cancer Center, Department of Genetics and Pathology, Pomeranian Medical University in Szczecin, Szczecin, Poland; Independent Laboratory of Molecular Biology and Genetic Diagnostics, Pomeranian Medical University in Szczecin, Szczecin, Poland
| | - Tadeusz Dębniak
- International Hereditary Cancer Center, Department of Genetics and Pathology, Pomeranian Medical University in Szczecin, Szczecin, Poland
| | - Marcin Lener
- International Hereditary Cancer Center, Department of Genetics and Pathology, Pomeranian Medical University in Szczecin, Szczecin, Poland
| | - Marek Szwiec
- Department of Surgery and Oncology, University of Zielona Góra, Zielona Góra, Poland
| | - Paweł Domagała
- Department of Pathology, Pomeranian Medical University in Szczecin, Szczecin, Poland
| | - Amir Abbas Samani
- Department of Laboratory Medicine and Pathology, Faculty of Medicine, University of Toronto, Toronto, ON, Canada; Humber River Hospital, University of Toronto, Toronto, ON, Canada
| | - Steven Narod
- Women's College Research Institute, Women's College Hospital, University of Toronto, Toronto, ON, Canada; Institute of Medical Science, Faculty of Medicine, University of Toronto, Toronto, ON, Canada; Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada
| | - Jacek Gronwald
- International Hereditary Cancer Center, Department of Genetics and Pathology, Pomeranian Medical University in Szczecin, Szczecin, Poland
| | - Jean-Yves Masson
- Genome Stability Laboratory, CHU de Québec Research Center, Oncology Axis; Department of Molecular Biology, Medical Biochemistry and Pathology; Laval University Cancer Research Center, Québec City, QC, Canada
| | - Jan Lubiński
- International Hereditary Cancer Center, Department of Genetics and Pathology, Pomeranian Medical University in Szczecin, Szczecin, Poland
| | - Mohammad R Akbari
- Women's College Research Institute, Women's College Hospital, University of Toronto, Toronto, ON, Canada; Institute of Medical Science, Faculty of Medicine, University of Toronto, Toronto, ON, Canada; Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada.
| |
Collapse
|
15
|
Kurant DE. Opportunities and Challenges with Artificial Intelligence in Genomics. Clin Lab Med 2023; 43:87-97. [PMID: 36764810 DOI: 10.1016/j.cll.2022.09.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
The development of artificial intelligence and machine learning algorithms may allow for advances in patient care. There are existing and potential applications in cancer diagnosis and monitoring, identification of at-risk groups of individuals, classification of genetic variants, and even prediction of patient ancestry. This article provides an overview of some current and future applications of artificial intelligence in genomic medicine, in addition to discussing challenges and considerations when bringing these tools into clinical practice.
Collapse
Affiliation(s)
- Danielle E Kurant
- Medical Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
16
|
Tang K, Naseri A, Wei Y, Zhang S, Zhi D. Open-source benchmarking of IBD segment detection methods for biobank-scale cohorts. Gigascience 2022; 11:giac111. [PMID: 36472573 PMCID: PMC9724555 DOI: 10.1093/gigascience/giac111] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2022] [Revised: 08/04/2022] [Accepted: 09/28/2022] [Indexed: 12/12/2022] Open
Abstract
In the recent biobank era of genetics, the problem of identical-by-descent (IBD) segment detection received renewed interest, as IBD segments in large cohorts offer unprecedented opportunities in the study of population and genealogical history, as well as genetic association of long haplotypes. While a new generation of efficient methods for IBD segment detection becomes available, direct comparison of these methods is difficult: existing benchmarks were often evaluated in different datasets, with some not openly accessible; methods benchmarked were run under suboptimal parameters; and benchmark performance metrics were not defined consistently. Here, we developed a comprehensive and completely open-source evaluation of the power, accuracy, and resource consumption of these IBD segment detection methods using realistic population genetic simulations with various settings. Our results pave the road for fair evaluation of IBD segment detection methods and provide an practical guide for users.
Collapse
Affiliation(s)
- Kecong Tang
- Department of Computer Science, University of Central Florida, Orlando, FL 32816, USA
| | - Ardalan Naseri
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Yuan Wei
- Department of Computer Science, University of Central Florida, Orlando, FL 32816, USA
| | - Shaojie Zhang
- Department of Computer Science, University of Central Florida, Orlando, FL 32816, USA
| | - Degui Zhi
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| |
Collapse
|
17
|
Balagué-Dobón L, Cáceres A, González JR. Fully exploiting SNP arrays: a systematic review on the tools to extract underlying genomic structure. Brief Bioinform 2022; 23:bbac043. [PMID: 35211719 PMCID: PMC8921734 DOI: 10.1093/bib/bbac043] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2021] [Revised: 01/25/2022] [Accepted: 01/28/2022] [Indexed: 12/12/2022] Open
Abstract
Single nucleotide polymorphisms (SNPs) are the most abundant type of genomic variation and the most accessible to genotype in large cohorts. However, they individually explain a small proportion of phenotypic differences between individuals. Ancestry, collective SNP effects, structural variants, somatic mutations or even differences in historic recombination can potentially explain a high percentage of genomic divergence. These genetic differences can be infrequent or laborious to characterize; however, many of them leave distinctive marks on the SNPs across the genome allowing their study in large population samples. Consequently, several methods have been developed over the last decade to detect and analyze different genomic structures using SNP arrays, to complement genome-wide association studies and determine the contribution of these structures to explain the phenotypic differences between individuals. We present an up-to-date collection of available bioinformatics tools that can be used to extract relevant genomic information from SNP array data including population structure and ancestry; polygenic risk scores; identity-by-descent fragments; linkage disequilibrium; heritability and structural variants such as inversions, copy number variants, genetic mosaicisms and recombination histories. From a systematic review of recently published applications of the methods, we describe the main characteristics of R packages, command-line tools and desktop applications, both free and commercial, to help make the most of a large amount of publicly available SNP data.
Collapse
|
18
|
Sticca EL, Belbin GM, Gignoux CR. Current Developments in Detection of Identity-by-Descent Methods and Applications. Front Genet 2021; 12:722602. [PMID: 34567074 PMCID: PMC8461052 DOI: 10.3389/fgene.2021.722602] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2021] [Accepted: 08/24/2021] [Indexed: 01/23/2023] Open
Abstract
Identity-by-descent (IBD), the detection of shared segments inherited from a common ancestor, is a fundamental concept in genomics with broad applications in the characterization and analysis of genomes. While historically the concept of IBD was extensively utilized through linkage analyses and in studies of founder populations, applications of IBD-based methods subsided during the genome-wide association study era. This was primarily due to the computational expense of IBD detection, which becomes increasingly relevant as the field moves toward the analysis of biobank-scale datasets that encompass individuals from highly diverse backgrounds. To address these computational barriers, the past several years have seen new methodological advances enabling IBD detection for datasets in the hundreds of thousands to millions of individuals, enabling novel analyses at an unprecedented scale. Here, we describe the latest innovations in IBD detection and describe opportunities for the application of IBD-based methods across a broad range of questions in the field of genomics.
Collapse
Affiliation(s)
- Evan L Sticca
- Human Medical Genetics and Genomics Program and Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, United States
| | - Gillian M Belbin
- Institute for Genomic Health, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| | - Christopher R Gignoux
- Human Medical Genetics and Genomics Program and Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, United States
| |
Collapse
|
19
|
Kivisild T, Saag L, Hui R, Biagini SA, Pankratov V, D'Atanasio E, Pagani L, Saag L, Rootsi S, Mägi R, Metspalu E, Valk H, Malve M, Irdt K, Reisberg T, Solnik A, Scheib CL, Seidman DN, Williams AL, Tambets K, Metspalu M. Patterns of genetic connectedness between modern and medieval Estonian genomes reveal the origins of a major ancestry component of the Finnish population. Am J Hum Genet 2021; 108:1792-1806. [PMID: 34411538 DOI: 10.1016/j.ajhg.2021.07.012] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Accepted: 07/23/2021] [Indexed: 11/20/2022] Open
Abstract
The Finnish population is a unique example of a genetic isolate affected by a recent founder event. Previous studies have suggested that the ancestors of Finnic-speaking Finns and Estonians reached the circum-Baltic region by the 1st millennium BC. However, high linguistic similarity points to a more recent split of their languages. To study genetic connectedness between Finns and Estonians directly, we first assessed the efficacy of imputation of low-coverage ancient genomes by sequencing a medieval Estonian genome to high depth (23×) and evaluated the performance of its down-sampled replicas. We find that ancient genomes imputed from >0.1× coverage can be reliably used in principal-component analyses without projection. By searching for long shared allele intervals (LSAIs; similar to identity-by-descent segments) in unphased data for >143,000 present-day Estonians, 99 Finns, and 14 imputed ancient genomes from Estonia, we find unexpectedly high levels of individual connectedness between Estonians and Finns for the last eight centuries in contrast to their clear differentiation by allele frequencies. High levels of sharing of these segments between Estonians and Finns predate the demographic expansion and late settlement process of Finland. One plausible source of this extensive sharing is the 8th-10th centuries AD migration event from North Estonia to Finland that has been proposed to explain uniquely shared linguistic features between the Finnish language and the northern dialect of Estonian and shared Christianity-related loanwords from Slavic. These results suggest that LSAI detection provides a computationally tractable way to detect fine-scale structure in large cohorts.
Collapse
Affiliation(s)
- Toomas Kivisild
- Department of Human Genetics, KU Leuven, Leuven 3000, Belgium; Estonian Biocentre, Institute of Genomics, University of Tartu, Tartu 51010, Estonia; McDonald Institute for Archaeological Research, University of Cambridge, Cambridge CB2 3ER, UK.
| | - Lehti Saag
- Estonian Biocentre, Institute of Genomics, University of Tartu, Tartu 51010, Estonia; Research Department of Genetics, Evolution, and Environment, University College London, London WC1E 6BT, UK
| | - Ruoyun Hui
- McDonald Institute for Archaeological Research, University of Cambridge, Cambridge CB2 3ER, UK; The Alan Turing Institute, British Library, 96 Euston Road, London NW1 2DB, UK
| | | | - Vasili Pankratov
- Estonian Biocentre, Institute of Genomics, University of Tartu, Tartu 51010, Estonia
| | - Eugenia D'Atanasio
- Instituto di Biologia e Patologia Molecolari, Consiglio Nazionale delle Ricerche, Rome, Italy
| | - Luca Pagani
- Estonian Biocentre, Institute of Genomics, University of Tartu, Tartu 51010, Estonia; Department of Biology, University of Padova, 35131 Padova, Italy
| | - Lauri Saag
- Estonian Biocentre, Institute of Genomics, University of Tartu, Tartu 51010, Estonia
| | - Siiri Rootsi
- Estonian Biocentre, Institute of Genomics, University of Tartu, Tartu 51010, Estonia
| | - Reedik Mägi
- Estonian Biocentre, Institute of Genomics, University of Tartu, Tartu 51010, Estonia
| | - Ene Metspalu
- Estonian Biocentre, Institute of Genomics, University of Tartu, Tartu 51010, Estonia
| | - Heiki Valk
- Department of Archaeology, Institute of History and Archaeology, University of Tartu, Tartu 51014, Estonia
| | - Martin Malve
- Department of Archaeology, Institute of History and Archaeology, University of Tartu, Tartu 51014, Estonia
| | - Kadri Irdt
- Estonian Biocentre, Institute of Genomics, University of Tartu, Tartu 51010, Estonia
| | - Tuuli Reisberg
- Core Facility, Institute of Genomics, University of Tartu, Tartu 51010, Estonia
| | - Anu Solnik
- Core Facility, Institute of Genomics, University of Tartu, Tartu 51010, Estonia
| | - Christiana L Scheib
- Estonian Biocentre, Institute of Genomics, University of Tartu, Tartu 51010, Estonia; McDonald Institute for Archaeological Research, University of Cambridge, Cambridge CB2 3ER, UK; St John's College, University of Cambridge, Cambridge CB2 1TP, UK
| | - Daniel N Seidman
- Department of Computational Biology, Cornell University, Ithaca, NY 14853, USA
| | - Amy L Williams
- Department of Computational Biology, Cornell University, Ithaca, NY 14853, USA
| | - Kristiina Tambets
- Estonian Biocentre, Institute of Genomics, University of Tartu, Tartu 51010, Estonia
| | - Mait Metspalu
- Estonian Biocentre, Institute of Genomics, University of Tartu, Tartu 51010, Estonia
| |
Collapse
|