1
|
Nguyen R, Kapp JD, Sacco S, Myers SP, Green RE. A computational approach for positive genetic identification and relatedness detection from low-coverage shotgun sequencing data. J Hered 2023; 114:504-512. [PMID: 37381815 PMCID: PMC10445519 DOI: 10.1093/jhered/esad041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2023] [Accepted: 06/28/2023] [Indexed: 06/30/2023] Open
Abstract
Several methods exist for detecting genetic relatedness or identity by comparing DNA information. These methods generally require genotype calls, either single-nucleotide polymorphisms or short tandem repeats, at the sites used for comparison. For some DNA samples, like those obtained from bone fragments or single rootless hairs, there is often not enough DNA present to generate genotype calls that are accurate and complete enough for these comparisons. Here, we describe IBDGem, a fast and robust computational procedure for detecting genomic regions of identity-by-descent by comparing low-coverage shotgun sequence data against genotype calls from a known query individual. At less than 1× genome coverage, IBDGem reliably detects segments of relatedness and can make high-confidence identity detections with as little as 0.01× genome coverage.
Collapse
Affiliation(s)
- Remy Nguyen
- Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA, United States
| | - Joshua D Kapp
- Department of Ecology and Evolutionary Biology, University of California, Santa Cruz, Santa Cruz, CA, United States
| | - Samuel Sacco
- Department of Ecology and Evolutionary Biology, University of California, Santa Cruz, Santa Cruz, CA, United States
| | - Steven P Myers
- California Department of Justice Jan Bashinski DNA Laboratory, Richmond, CA, United States
| | - Richard E Green
- Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA, United States
| |
Collapse
|
2
|
Huang M, Liu M, Li H, King J, Smuts A, Budowle B, Ge J. A machine learning approach for missing persons cases with high genotyping errors. Front Genet 2022; 13:971242. [PMID: 36263419 PMCID: PMC9573995 DOI: 10.3389/fgene.2022.971242] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Accepted: 09/16/2022] [Indexed: 11/22/2022] Open
Abstract
Estimating the relationships between individuals is one of the fundamental challenges in many fields. In particular, relationship.ip estimation could provide valuable information for missing persons cases. The recently developed investigative genetic genealogy approach uses high-density single nucleotide polymorphisms (SNPs) to determine close and more distant relationships, in which hundreds of thousands to tens of millions of SNPs are generated either by microarray genotyping or whole-genome sequencing. The current studies usually assume the SNP profiles were generated with minimum errors. However, in the missing person cases, the DNA samples can be highly degraded, and the SNP profiles generated from these samples usually contain lots of errors. In this study, a machine learning approach was developed for estimating the relationships with high error SNP profiles. In this approach, a hierarchical classification strategy was employed first to classify the relationships by degree and then the relationship types within each degree separately. As for each classification, feature selection was implemented to gain better performance. Both simulated and real data sets with various genotyping error rates were utilized in evaluating this approach, and the accuracies of this approach were higher than individual measures; namely, this approach was more accurate and robust than the individual measures for SNP profiles with genotyping errors. In addition, the highest accuracy could be obtained by providing the same genotyping error rates in train and test sets, and thus estimating genotyping errors of the SNP profiles is critical to obtaining high accuracy of relationship estimation.
Collapse
Affiliation(s)
- Meng Huang
- Center for Human Identification, University of North Texas Health Science Center, Fort Worth, TX, United States
| | - Muyi Liu
- Center for Human Identification, University of North Texas Health Science Center, Fort Worth, TX, United States
| | - Hongmin Li
- Department of Computer Science, College of Science, California State University, East Bay, Hayward, CA, United States
| | - Jonathan King
- Center for Human Identification, University of North Texas Health Science Center, Fort Worth, TX, United States
| | - Amy Smuts
- Center for Human Identification, University of North Texas Health Science Center, Fort Worth, TX, United States
| | - Bruce Budowle
- Center for Human Identification, University of North Texas Health Science Center, Fort Worth, TX, United States
- Department of Microbiology, Immunology and Genetics, University of North Texas Health Science Center, Fort Worth, TX, United States
| | - Jianye Ge
- Center for Human Identification, University of North Texas Health Science Center, Fort Worth, TX, United States
- Department of Microbiology, Immunology and Genetics, University of North Texas Health Science Center, Fort Worth, TX, United States
| |
Collapse
|