1
|
Galván-Femenía I, Barceló-Vidal C, Sumoy L, Moreno V, de Cid R, Graffelman J. A likelihood ratio approach for identifying three-quarter siblings in genetic databases. Heredity (Edinb) 2021; 126:537-547. [PMID: 33452467 PMCID: PMC8027836 DOI: 10.1038/s41437-020-00392-8] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2020] [Revised: 11/04/2020] [Accepted: 11/16/2020] [Indexed: 11/09/2022] Open
Abstract
The detection of family relationships in genetic databases is of interest in various scientific disciplines such as genetic epidemiology, population and conservation genetics, forensic science, and genealogical research. Nowadays, screening genetic databases for related individuals forms an important aspect of standard quality control procedures. Relatedness research is usually based on an allele sharing analysis of identity by state (IBS) or identity by descent (IBD) alleles. Existing IBS/IBD methods mainly aim to identify first-degree relationships (parent-offspring or full siblings) and second degree (half-siblings, avuncular, or grandparent-grandchild) pairs. Little attention has been paid to the detection of in-between first and second-degree relationships such as three-quarter siblings (3/4S) who share fewer alleles than first-degree relationships but more alleles than second-degree relationships. With the progressively increasing sample sizes used in genetic research, it becomes more likely that such relationships are present in the database under study. In this paper, we extend existing likelihood ratio (LR) methodology to accurately infer the existence of 3/4S, distinguishing them from full siblings and second-degree relatives. We use bootstrap confidence intervals to express uncertainty in the LRs. Our proposal accounts for linkage disequilibrium (LD) by using marker pruning, and we validate our methodology with a pedigree-based simulation study accounting for both LD and recombination. An empirical genome-wide array data set from the GCAT Genomes for Life cohort project is used to illustrate the method.
Collapse
Affiliation(s)
- Iván Galván-Femenía
- Department of Computer Science, Applied Mathematics and Statistics, Universitat de Girona, Girona, Spain.,Genomes For Life - GCAT lab, Institute for Health Science Research Germans Trias i Pujol (IGTP), Can Ruti Campus, Badalona, Barcelona, Spain
| | - Carles Barceló-Vidal
- Department of Computer Science, Applied Mathematics and Statistics, Universitat de Girona, Girona, Spain
| | - Lauro Sumoy
- High Content Genomics and Bioinformatics Unit, Institute for Health Science Research Germans Trias i Pujol (IGTP), Can Ruti Campus, Badalona, Barcelona, Spain
| | - Victor Moreno
- Oncology Data Analytics Program, Catalan Institute of Oncology (ICO), Badalona, Spain.,ONCOBELL Program, Bellvitge Biomedical Research Institute (IDIBELL), Barcelona, Spain.,Consortium for Biomedical Research in Epidemiology and Public Health (CIBERESP), Madrid, Spain.,Department of Clinical Sciences, University of Barcelona, Barcelona, Spain
| | - Rafael de Cid
- Genomes For Life - GCAT lab, Institute for Health Science Research Germans Trias i Pujol (IGTP), Can Ruti Campus, Badalona, Barcelona, Spain.
| | - Jan Graffelman
- Department of Statistics and Operations Research, Universitat Politècnica de Catalunya, Barcelona, Spain. .,Department of Biostatistics, University of Washington, Seattle, WA, USA.
| |
Collapse
|
2
|
Blay N, Casas E, Galván-Femenía I, Graffelman J, de Cid R, Vavouri T. Assessment of kinship detection using RNA-seq data. Nucleic Acids Res 2020; 47:e136. [PMID: 31501877 PMCID: PMC6868348 DOI: 10.1093/nar/gkz776] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2019] [Revised: 08/23/2019] [Accepted: 08/29/2019] [Indexed: 01/23/2023] Open
Abstract
Analysis of RNA sequencing (RNA-seq) data from related individuals is widely used in clinical and molecular genetics studies. Prediction of kinship from RNA-seq data would be useful for confirming the expected relationships in family based studies and for highlighting samples from related individuals in case-control or population based studies. Currently, reconstruction of pedigrees is largely based on SNPs or microsatellites, obtained from genotyping arrays, whole genome sequencing and whole exome sequencing. Potential problems with using RNA-seq data for kinship detection are the low proportion of the genome that it covers, the highly skewed coverage of exons of different genes depending on expression level and allele-specific expression. In this study we assess the use of RNA-seq data to detect kinship between individuals, through pairwise identity by descent (IBD) estimates. First, we obtained high quality SNPs after successive filters to minimize the effects due to allelic imbalance as well as errors in sequencing, mapping and genotyping. Then, we used these SNPs to calculate pairwise IBD estimates. By analysing both real and simulated RNA-seq data we show that it is possible to identify up to second degree relationships using RNA-seq data of even low to moderate sequencing depth.
Collapse
Affiliation(s)
- Natalia Blay
- Program for Predictive and Personalized Medicine of Cancer, Germans Trias i Pujol Research Institute (PMPPC-IGTP), Badalona 08916, Spain.,Josep Carreras Leukaemia Research Institute (IJC), Campus ICO-Germans Trias i Pujol, Universitat Autònoma de Barcelona, Badalona 08916, Spain.,Masters Programme in Bioinformatics and Biostatistics, Universitat Oberta de Catalunya (UOC), Barcelona 08035, Spain
| | - Eduard Casas
- Program for Predictive and Personalized Medicine of Cancer, Germans Trias i Pujol Research Institute (PMPPC-IGTP), Badalona 08916, Spain.,Josep Carreras Leukaemia Research Institute (IJC), Campus ICO-Germans Trias i Pujol, Universitat Autònoma de Barcelona, Badalona 08916, Spain.,Doctoral Programme in Biomedicine, Universitat de Barcelona, Barcelona 08007, Spain
| | - Iván Galván-Femenía
- Program for Predictive and Personalized Medicine of Cancer, Germans Trias i Pujol Research Institute (PMPPC-IGTP), Badalona 08916, Spain.,Genomes for Life - GCAT lab Group - Germans Trias i Pujol Research Institute, Can Ruti Campus, Ctra de Can Ruti, Camí de les Escoles s/n, Badalona, Barcelona 08916, Spain
| | - Jan Graffelman
- Department of Statistics and Operations Research Universitat Politècnica de Catalunya, Barcelona 08028, Spain.,Department of Biostatistics, University of Washington, Seattle, WA 98105-946, USA
| | - Rafael de Cid
- Program for Predictive and Personalized Medicine of Cancer, Germans Trias i Pujol Research Institute (PMPPC-IGTP), Badalona 08916, Spain.,Genomes for Life - GCAT lab Group - Germans Trias i Pujol Research Institute, Can Ruti Campus, Ctra de Can Ruti, Camí de les Escoles s/n, Badalona, Barcelona 08916, Spain
| | - Tanya Vavouri
- Program for Predictive and Personalized Medicine of Cancer, Germans Trias i Pujol Research Institute (PMPPC-IGTP), Badalona 08916, Spain.,Josep Carreras Leukaemia Research Institute (IJC), Campus ICO-Germans Trias i Pujol, Universitat Autònoma de Barcelona, Badalona 08916, Spain
| |
Collapse
|
3
|
Graffelman J, Galván Femenía I, de Cid R, Barceló Vidal C. A Log-Ratio Biplot Approach for Exploring Genetic Relatedness Based on Identity by State. Front Genet 2019; 10:341. [PMID: 31068965 PMCID: PMC6491861 DOI: 10.3389/fgene.2019.00341] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2018] [Accepted: 03/29/2019] [Indexed: 12/31/2022] Open
Abstract
The detection of cryptic relatedness in large population-based cohorts is of great importance in genome research. The usual approach for detecting closely related individuals is to plot allele sharing statistics, based on identity-by-state or identity-by-descent, in a two-dimensional scatterplot. This approach ignores that allele sharing data across individuals has in reality a higher dimensionality, and neither regards the compositional nature of the underlying counts of shared genotypes. In this paper we develop biplot methodology based on log-ratio principal component analysis that overcomes these restrictions. This leads to entirely new graphics that are essentially useful for exploring relatedness in genetic databases from homogeneous populations. The proposed method can be applied in an iterative manner, acting as a looking glass for more remote relationships that are harder to classify. Datasets from the 1,000 Genomes Project and the Genomes For Life-GCAT Project are used to illustrate the proposed method. The discriminatory power of the log-ratio biplot approach is compared with the classical plots in a simulation study. In a non-inbred homogeneous population the classification rate of the log-ratio principal component approach outperforms the classical graphics across the whole allele frequency spectrum, using only identity by state. In these circumstances, simulations show that with 35,000 independent bi-allelic variants, log-ratio principal component analysis, combined with discriminant analysis, can correctly classify relationships up to and including the fourth degree.
Collapse
Affiliation(s)
- Jan Graffelman
- Department of Statistics and Operations Research, Technical University of Catalonia, Barcelona, Spain.,Department of Biostatistics, University of Washington, Seattle, WA, United States
| | - Iván Galván Femenía
- Department of Computer Science, Applied Mathematics and Statistics, University of Girona, Girona, Spain.,Genomes For Life - GCAT Lab, Institute for Health Science Research Germans Trias i Pujol (IGTP), Badalona, Spain
| | - Rafael de Cid
- Genomes For Life - GCAT Lab, Institute for Health Science Research Germans Trias i Pujol (IGTP), Badalona, Spain
| | - Carles Barceló Vidal
- Department of Computer Science, Applied Mathematics and Statistics, University of Girona, Girona, Spain
| |
Collapse
|
4
|
Galván-Femenía I, Graffelman J, Barceló-I-Vidal C. Graphics for relatedness research. Mol Ecol Resour 2017; 17:1271-1282. [PMID: 28374569 PMCID: PMC5624821 DOI: 10.1111/1755-0998.12674] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2016] [Revised: 03/15/2017] [Accepted: 03/21/2017] [Indexed: 11/27/2022]
Abstract
Studies of relatedness have been crucial in molecular ecology over the last decades. Good evidence of this is the fact that studies of population structure, evolution of social behaviours, genetic diversity and quantitative genetics all involve relatedness research. The main aim of this article was to review the most common graphical methods used in allele sharing studies for detecting and identifying family relationships. Both IBS- and IBD-based allele sharing studies are considered. Furthermore, we propose two additional graphical methods from the field of compositional data analysis: the ternary diagram and scatterplots of isometric log-ratios of IBS and IBD probabilities. We illustrate all graphical tools with genetic data from the HGDP-CEPH diversity panel, using mainly 377 microsatellites genotyped for 25 individuals from the Maya population of this panel. We enhance all graphics with convex hulls obtained by simulation and use these to confirm the documented relationships. The proposed compositional graphics are shown to be useful in relatedness research, as they also single out the most prominent related pairs. The ternary diagram is advocated for its ability to display all three allele sharing probabilities simultaneously. The log-ratio plots are advocated as an attempt to overcome the problems with the Euclidean distance interpretation in the classical graphics.
Collapse
Affiliation(s)
- Iván Galván-Femenía
- Department of Computer Science, Applied Mathematics and Statistics, Universitat de Girona, Girona, Spain.,Disease Genomics-GCAT Group, Germans Trias Health Research Institute (IGTP)-Program of Predictive and Personalized Medicine of Cancer (PMPPC), Can Ruti Campus, Badalona, Barcelona, Spain
| | - Jan Graffelman
- Department of Statistics and Operations Research, Universitat Politècnica de Catalunya, Barcelona, Spain.,Department of Biostatistics, University of Washington, Seattle, WA, USA
| | - Carles Barceló-I-Vidal
- Department of Computer Science, Applied Mathematics and Statistics, Universitat de Girona, Girona, Spain
| |
Collapse
|