1
|
Finke K, Kourakos M, Brown G, Dang HT, Tan SJS, Simons YB, Ramdas S, Schäffer AA, Kember RL, Bućan M, Mathieson S. Ancestral haplotype reconstruction in endogamous populations using identity-by-descent. PLoS Comput Biol 2021; 17:e1008638. [PMID: 33635861 PMCID: PMC7946327 DOI: 10.1371/journal.pcbi.1008638] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2020] [Revised: 03/10/2021] [Accepted: 12/15/2020] [Indexed: 12/24/2022] Open
Abstract
In this work we develop a novel algorithm for reconstructing the genomes of ancestral individuals, given genotype or sequence data from contemporary individuals and an extended pedigree of family relationships. A pedigree with complete genomes for every individual enables the study of allele frequency dynamics and haplotype diversity across generations, including deviations from neutrality such as transmission distortion. When studying heritable diseases, ancestral haplotypes can be used to augment genome-wide association studies and track disease inheritance patterns. The building blocks of our reconstruction algorithm are segments of Identity-By-Descent (IBD) shared between two or more genotyped individuals. The method alternates between identifying a source for each IBD segment and assembling IBD segments placed within each ancestral individual. Unlike previous approaches, our method is able to accommodate complex pedigree structures with hundreds of individuals genotyped at millions of SNPs. We apply our method to an Old Order Amish pedigree from Lancaster, Pennsylvania, whose founders came to North America from Europe during the early 18th century. The pedigree includes 1338 individuals from the past 12 generations, 394 with genotype data. The motivation for reconstruction is to understand the genetic basis of diseases segregating in the family through tracking haplotype transmission over time. Using our algorithm thread, we are able to reconstruct an average of 224 ancestral individuals per chromosome. For these ancestral individuals, on average we reconstruct 79% of their haplotypes. We also identify a region on chromosome 16 that is difficult to reconstruct—we find that this region harbors a short Amish-specific copy number variation and the gene HYDIN. thread was developed for endogamous populations, but can be applied to any extensive pedigree with the recent generations genotyped. We anticipate that this type of practical ancestral reconstruction will become more common and necessary to understand rare and complex heritable diseases in extended families. When analyzing complex heritable traits, genomic data from many generations of an extended family increases the amount of information available for statistical inference. However, typically only genomic data from the recent generations of a pedigree are available, as ancestral individuals are deceased. In this work we present an algorithm, called thread, for reconstructing the genomes of ancestral individuals, given a complex pedigree and genomic data from the recent generations. Previous approaches have not been able to accommodate large datasets (both in terms of sites and individuals), made simplifying assumptions about pedigree structure, or did not tie reconstructed sequences back to specific individuals. We apply thread to a complex Old Order Amish pedigree of 1338 individuals, 394 with genotype data.
Collapse
Affiliation(s)
- Kelly Finke
- Department of Computer Science, Swarthmore College, Swarthmore, Pennsylvania, United States of America
- Department of Biology, Swarthmore College, Swarthmore, Pennsylvania, United States of America
| | - Michael Kourakos
- Department of Computer Science, Swarthmore College, Swarthmore, Pennsylvania, United States of America
| | - Gabriela Brown
- Department of Computer Science, Swarthmore College, Swarthmore, Pennsylvania, United States of America
| | - Huyen Trang Dang
- Department of Computer Science, Bryn Mawr College, Bryn Mawr, Pennsylvania, United States of America
| | - Shi Jie Samuel Tan
- Department of Computer Science, Haverford College, Haverford, Pennsylvania, United States of America
| | - Yuval B. Simons
- Department of Genetics, Stanford University, Stanford, California, United States of America
| | - Shweta Ramdas
- Department of Genetics, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
| | - Alejandro A. Schäffer
- Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Rachel L. Kember
- Department of Psychiatry, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
| | - Maja Bućan
- Department of Genetics, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
| | - Sara Mathieson
- Department of Computer Science, Haverford College, Haverford, Pennsylvania, United States of America
- * E-mail:
| |
Collapse
|
2
|
Ko A, Nielsen R. Joint Estimation of Pedigrees and Effective Population Size Using Markov Chain Monte Carlo. Genetics 2019; 212:855-868. [PMID: 31123041 PMCID: PMC6614905 DOI: 10.1534/genetics.119.302280] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2018] [Accepted: 05/16/2019] [Indexed: 12/31/2022] Open
Abstract
Pedigrees provide the genealogical relationships among individuals at a fine resolution and serve an important function in many areas of genetic studies. One such use of pedigree information is in the estimation of the short-term effective population size [Formula: see text], which is of great relevance in fields such as conservation genetics. Despite the usefulness of pedigrees, however, they are often an unknown parameter and must be inferred from genetic data. In this study, we present a Bayesian method to jointly estimate pedigrees and [Formula: see text] from genetic markers using Markov Chain Monte Carlo. Our method supports analysis of a large number of markers and individuals within a single generation with the use of a composite likelihood, which significantly increases computational efficiency. We show, on simulated data, that our method is able to jointly estimate relationships up to first cousins and [Formula: see text] with high accuracy. We also apply the method on a real dataset of house sparrows to reconstruct their previously unreported pedigree.
Collapse
Affiliation(s)
- Amy Ko
- Department of Integrative Biology, University of California, Berkeley, 94720 California
| | - Rasmus Nielsen
- Department of Integrative Biology, University of California, Berkeley, 94720 California
- Department of Statistics, University of California, Berkeley, 94720 California
- Museum of Natural History, University of Copenhagen, 1123 Denmark
| |
Collapse
|
3
|
Mo SK, Ren ZL, Yang YR, Liu YC, Zhang JJ, Wu HJ, Li Z, Bo XC, Wang SQ, Yan JW, Ni M. A 472-SNP panel for pairwise kinship testing of second-degree relatives. Forensic Sci Int Genet 2018; 34:178-185. [PMID: 29510334 DOI: 10.1016/j.fsigen.2018.02.019] [Citation(s) in RCA: 31] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2017] [Revised: 02/22/2018] [Accepted: 02/25/2018] [Indexed: 10/17/2022]
Abstract
Kinship testing based on genetic markers, as forensic short tandem repeats (STRs) and single nucleotide polymorphisms (SNPs), has valuable practical applications. Paternity and first-degree relationship can be accurately identified by current commonly-used forensic STRs and reported SNP markers. However, second-degree and more distant relationships remain challenging. Although ∼105-106 SNPs can be used to estimate relatedness of higher degrees, genome-wide genotyping and analysis may be impractical for forensic use. With rapid growth of human genome data sets, it is worthwhile to explore additional markers, especially SNPs, for kinship analysis. Here, we reported an autosomal SNP panel consisted of 342 SNP selected from >84 million SNPs and 131 SNPs from previous systems. We genotyped these SNPs in 136 Chinese individuals by multiplex amplicon Massively Parallel Sequencing, and performed pairwise gender-independent kinship testing. The specificity and sensitivity of these SNPs to distinguish second-degree relatives and the unrelated was 99.9% and 100%, respectively, compared with 53.7% and 99.9% of 19 commonly-used forensic STRs. Moreover, the specificity increased to 100% by the combined use of these STRs and SNPs. The 472-SNP panel could also greatly facilitate the discrimination among different relationships. We estimated that the power of ∼6.45 SNPs were equivalent to one forensic STR in the scenario of 2nd-degree relative pedigree. Altogether, we proposed a panel of 472 SNP markers for kinship analysis, which could be important supplementary of current forensic STRs to solve the problem of second-degree relative testing.
Collapse
Affiliation(s)
- Shao-Kang Mo
- Department of Biotechnology, Beijing Institute of Radiation Medicine, Beijing 100850, China; Department of Reproductive Center, General Hospital of Lanzhou Military Region, Lanzhou 730050, China.
| | - Zi-Lin Ren
- Department of Biotechnology, Beijing Institute of Radiation Medicine, Beijing 100850, China.
| | - Ya-Ran Yang
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China.
| | - Ya-Cheng Liu
- Department of Genetics, Beijing Tongda Shoucheng Institute of Forensic Science, Beijing 100192, China.
| | - Jing-Jing Zhang
- Department of Biotechnology, Beijing Center for Physical and Chemical Analysis, Beijing 100089, China.
| | - Hui-Juan Wu
- Department of Biotechnology, Beijing Center for Physical and Chemical Analysis, Beijing 100089, China.
| | - Zhen Li
- Department of Biotechnology, Beijing Institute of Radiation Medicine, Beijing 100850, China.
| | - Xiao-Chen Bo
- Department of Biotechnology, Beijing Institute of Radiation Medicine, Beijing 100850, China.
| | - Sheng-Qi Wang
- Department of Biotechnology, Beijing Institute of Radiation Medicine, Beijing 100850, China.
| | - Jiang-Wei Yan
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China; University of Chinese Academy of Sciences, Beijing, 100049, China.
| | - Ming Ni
- Department of Biotechnology, Beijing Institute of Radiation Medicine, Beijing 100850, China.
| |
Collapse
|
4
|
Ko A, Nielsen R. Composite likelihood method for inferring local pedigrees. PLoS Genet 2017; 13:e1006963. [PMID: 28827797 PMCID: PMC5578687 DOI: 10.1371/journal.pgen.1006963] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2017] [Revised: 08/31/2017] [Accepted: 08/07/2017] [Indexed: 12/21/2022] Open
Abstract
Pedigrees contain information about the genealogical relationships among individuals and are of fundamental importance in many areas of genetic studies. However, pedigrees are often unknown and must be inferred from genetic data. Despite the importance of pedigree inference, existing methods are limited to inferring only close relationships or analyzing a small number of individuals or loci. We present a simulated annealing method for estimating pedigrees in large samples of otherwise seemingly unrelated individuals using genome-wide SNP data. The method supports complex pedigree structures such as polygamous families, multi-generational families, and pedigrees in which many of the member individuals are missing. Computational speed is greatly enhanced by the use of a composite likelihood function which approximates the full likelihood. We validate our method on simulated data and show that it can infer distant relatives more accurately than existing methods. Furthermore, we illustrate the utility of the method on a sample of Greenlandic Inuit. Pedigrees contain information about the genealogical relationships among individuals. This information can be used in many areas of genetic studies such as disease association studies, conservation efforts, and for inferences about the demographic history and social structure of a population. Despite their importance, pedigrees are often unknown and must be estimated from genetic information. However, pedigree inference remains a difficult problem due to the high cost of likelihood computation and the enormous number of possible pedigrees that must be considered. These difficulties limit existing methods in their ability to infer pedigrees when the sample size or the number of markers is large, or when the sample contains only distant relatives. In this report, we present a method that circumvents these computational challenges in order to infer pedigrees of complex structure for a large number of individuals. Using simulations, we find that the method can infer distant relatives much more accurately than existing methods. Furthermore, we show that even pairwise inferences of relatedness can be improved substantially by consideration of the pedigree structure with other related individuals in the sample.
Collapse
Affiliation(s)
- Amy Ko
- Department of Integrative Biology, University of California, Berkeley, Berkeley, California, United States of America
- * E-mail:
| | - Rasmus Nielsen
- Department of Integrative Biology, University of California, Berkeley, Berkeley, California, United States of America
- Department of Statistics, University of California, Berkeley, Berkeley, California, United States of America
- Museum of Natural History, University of Copenhagen, Copenhagen, Denmark
| |
Collapse
|
5
|
Heinrich V, Kamphans T, Mundlos S, Robinson PN, Krawitz PM. A likelihood ratio-based method to predict exact pedigrees for complex families from next-generation sequencing data. Bioinformatics 2017; 33:72-78. [PMID: 27565584 PMCID: PMC5408770 DOI: 10.1093/bioinformatics/btw550] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2015] [Revised: 07/06/2016] [Accepted: 08/22/2016] [Indexed: 12/30/2022] Open
Abstract
MOTIVATION Next generation sequencing technology considerably changed the way we screen for pathogenic mutations in rare Mendelian disorders. However, the identification of the disease-causing mutation amongst thousands of variants of partly unknown relevance is still challenging and efficient techniques that reduce the genomic search space play a decisive role. Often segregation- or linkage analysis are used to prioritize candidates, however, these approaches require correct information about the degree of relationship among the sequenced samples. For quality assurance an automated control of pedigree structures and sample assignment is therefore highly desirable in order to detect label mix-ups that might otherwise corrupt downstream analysis. RESULTS We developed an algorithm based on likelihood ratios that discriminates between different classes of relationship for an arbitrary number of genotyped samples. By identifying the most likely class we are able to reconstruct entire pedigrees iteratively, even for highly consanguineous families. We tested our approach on exome data of different sequencing studies and achieved high precision for all pedigree predictions. By analyzing the precision for varying degrees of relatedness or inbreeding we could show that a prediction is robust down to magnitudes of a few hundred loci. AVAILABILITY AND IMPLEMENTATION A java standalone application that computes the relationships between multiple samples as well as a Rscript that visualizes the pedigree information is available for download as well as a web service at www.gene-talk.de CONTACT: heinrich@molgen.mpg.deSupplementary information: Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Verena Heinrich
- Max Planck Institute for Molecular Genetics, Ihnestraße 63-73, Berlin
- Institute for Medical Genetics and Human Genetics, Charité Universitätsmedizin Berlin, Augustenburger Platz 1, Berlin
| | | | - Stefan Mundlos
- Max Planck Institute for Molecular Genetics, Ihnestraße 63-73, Berlin
- Institute for Medical Genetics and Human Genetics, Charité Universitätsmedizin Berlin, Augustenburger Platz 1, Berlin
| | - Peter N Robinson
- Institute for Medical Genetics and Human Genetics, Charité Universitätsmedizin Berlin, Augustenburger Platz 1, Berlin
| | - Peter M Krawitz
- Institute for Medical Genetics and Human Genetics, Charité Universitätsmedizin Berlin, Augustenburger Platz 1, Berlin
| |
Collapse
|
6
|
Staples J, Qiao D, Cho M, Silverman E, Nickerson D, Below J, Below JE. PRIMUS: rapid reconstruction of pedigrees from genome-wide estimates of identity by descent. Am J Hum Genet 2014; 95:553-64. [PMID: 25439724 DOI: 10.1016/j.ajhg.2014.10.005] [Citation(s) in RCA: 102] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2014] [Accepted: 10/02/2014] [Indexed: 11/29/2022] Open
Abstract
Understanding and correctly utilizing relatedness among samples is essential for genetic analysis; however, managing sample records and pedigrees can often be error prone and incomplete. Data sets ascertained by random sampling often harbor cryptic relatedness that can be leveraged in genetic analyses for maximizing power. We have developed a method that uses genome-wide estimates of pairwise identity by descent to identify families and quickly reconstruct and score all possible pedigrees that fit the genetic data by using up to third-degree relatives, and we have included it in the software package PRIMUS (Pedigree Reconstruction and Identification of the Maximally Unrelated Set). Here, we validate its performance on simulated, clinical, and HapMap pedigrees. Among these samples, we demonstrate that PRIMUS can verify reported pedigree structures and identify cryptic relationships. Finally, we show that PRIMUS reconstructed pedigrees, all of which were previously unknown, for 203 families from a cohort collected in Starr County, TX (1,890 samples).
Collapse
Affiliation(s)
| | | | | | | | | | | | - Jennifer E Below
- Epidemiology, Human Genetics, & Environmental Sciences, University of Texas Health Science Center, Houston, TX 77225, USA.
| |
Collapse
|
7
|
Shem-Tov D, Halperin E. Historical pedigree reconstruction from extant populations using PArtitioning of RElatives (PREPARE). PLoS Comput Biol 2014; 10:e1003610. [PMID: 24945698 PMCID: PMC4063675 DOI: 10.1371/journal.pcbi.1003610] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2013] [Accepted: 03/13/2014] [Indexed: 11/18/2022] Open
Abstract
Recent technological improvements in the field of genetic data extraction give rise to the possibility of reconstructing the historical pedigrees of entire populations from the genotypes of individuals living today. Current methods are still not practical for real data scenarios as they have limited accuracy and assume unrealistic assumptions of monogamy and synchronized generations. In order to address these issues, we develop a new method for pedigree reconstruction, , which is based on formulations of the pedigree reconstruction problem as variants of graph coloring. The new formulation allows us to consider features that were overlooked by previous methods, resulting in a reconstruction of up to 5 generations back in time, with an order of magnitude improvement of false-negatives rates over the state of the art, while keeping a lower level of false positive rates. We demonstrate the accuracy of compared to previous approaches using simulation studies over a range of population sizes, including inbred and outbred populations, monogamous and polygamous mating patterns, as well as synchronous and asynchronous mating. Learning the correct relationships between individuals from genetic data is a basic theoretical problem in the field of genetics, and has many practical consequences. A wide variety of statistical methods for genetic analysis assume the relationships between individuals are known, and can manifest relatedness information to improve inference. The current state-of-the-art methods for relationship inference consider pair-wise genetic similarity, and use it to infer the relationship between each pair of individuals. Reconstructing the pedigrees of an entire population directly has the potential to use more elaborate relationship information, and thus obtains a better prediction of the familial relationships in the population. In contrast to the full set of pair-wise relationships in a population, genetic pedigrees provide a lossless and conflict-free structure for depicting the relationships between individuals. In an effort to make pedigree reconstruction practical we developed a new method, which is an order of magnitude more accurate than previous methods, and is the first method that has the ability to reconstruct polygamous pedigrees.
Collapse
Affiliation(s)
- Doron Shem-Tov
- The Balvatnic School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel
- * E-mail:
| | - Eran Halperin
- The Balvatnic School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel
- International Computer Science Institute, Berkeley, California, United States of America
- Molecular Microbiology and Biotechnology Department, Tel-Aviv University, Tel-Aviv, Israel
| |
Collapse
|
8
|
He D. IBD-Groupon: an efficient method for detecting group-wise identity-by-descent regions simultaneously in multiple individuals based on pairwise IBD relationships. Bioinformatics 2013; 29:i162-70. [PMID: 23812980 PMCID: PMC3694672 DOI: 10.1093/bioinformatics/btt237] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Detecting IBD tracts is an important problem in genetics. Most of the existing methods focus on detecting pairwise IBD tracts, which have relatively low power to detect short IBD tracts. Methods to detect IBD tracts among multiple individuals simultaneously, or group-wise IBD tracts, have better performance for short IBD tracts detection. Group-wise IBD tracts can be applied to a wide range of applications, such as disease mapping, pedigree reconstruction and so forth. The existing group-wise IBD tract detection method is computationally inefficient and is only able to handle small datasets, such as 20, 30 individuals with hundreds of SNPs. It also requires a previous specification of the number of IBD groups, or partitions of the individuals where all the individuals in the same partition are IBD with each other, which may not be realistic in many cases. The method can only handle a small number of IBD groups, such as two or three, because of scalability issues. What is more, it does not take LD (linkage disequilibrium) into consideration. RESULTS In this work, we developed an efficient method IBD-Groupon, which detects group-wise IBD tracts based on pairwise IBD relationships, and it is able to address all the drawbacks aforementioned. To our knowledge, our method is the first practical group-wise IBD tracts detection method that is scalable to very large datasets, for example, hundreds of individuals with thousands of SNPs, and in the meanwhile, it is powerful to detect short IBD tracts. Our method does not need to specify the number of IBD groups, which will be detected automatically. And our method takes LD into consideration, as it is based on pairwise IBD tracts where LD can be easily incorporated.
Collapse
Affiliation(s)
- Dan He
- Computational Genomics, IBM TJ Watson Research, Yorktown Heights, NY 10598, USA.
| |
Collapse
|