1
|
Barash M, McNevin D, Fedorenko V, Giverts P. Machine learning applications in forensic DNA profiling: A critical review. Forensic Sci Int Genet 2024; 69:102994. [PMID: 38086200 DOI: 10.1016/j.fsigen.2023.102994] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2023] [Revised: 11/06/2023] [Accepted: 11/26/2023] [Indexed: 01/29/2024]
Abstract
Machine learning (ML) is a range of powerful computational algorithms capable of generating predictive models via intelligent autonomous analysis of relatively large and often unstructured data. ML has become an integral part of our daily lives with a plethora of applications, including web, business, automotive industry, clinical diagnostics, scientific research, and more recently, forensic science. In the field of forensic DNA, the manual analysis of complex data can be challenging, time-consuming, and error-prone. The integration of novel ML-based methods may aid in streamlining this process while maintaining the high accuracy and reproducibility required for forensic tools. Due to the relative novelty of such applications, the forensic community is largely unaware of ML capabilities and limitations. Furthermore, computer science and ML professionals are often unfamiliar with the forensic science field and its specific requirements. This manuscript offers a brief introduction to the capabilities of machine learning methods and their applications in the context of forensic DNA analysis and offers a critical review of the current literature in this rapidly developing field.
Collapse
Affiliation(s)
- Mark Barash
- Department of Justice Studies, San José State University, San Jose, CA, United States; Centre for Forensic Science, School of Mathematical and Physical Sciences, Faculty of Science, University of Technology Sydney, Broadway, Ultimo, NSW 2007, Australia.
| | - Dennis McNevin
- Centre for Forensic Science, School of Mathematical and Physical Sciences, Faculty of Science, University of Technology Sydney, Broadway, Ultimo, NSW 2007, Australia
| | - Vladimir Fedorenko
- The Educational and Scientific Laboratory of Forensic Materials Engineering of the Saratov State University, Russia
| | - Pavel Giverts
- Division of Identification and Forensic Science, Israel Police HQ, Haim Bar-Lev Road, Jerusalem, Israel
| |
Collapse
|
2
|
Wang H, Zhu Q, Huang Y, Cao Y, Hu Y, Wei Y, Wang Y, Hou T, Shan T, Dai X, Zhang X, Wang Y, Zhang J. Using simulated microhaplotype genotyping data to evaluate the value of machine learning algorithms for inferring DNA mixture contributor numbers. Forensic Sci Int Genet 2024; 69:103008. [PMID: 38244524 DOI: 10.1016/j.fsigen.2024.103008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Revised: 12/01/2023] [Accepted: 01/05/2024] [Indexed: 01/22/2024]
Abstract
Inferring the number of contributors (NoC) is a crucial step in interpreting DNA mixtures, as it directly affects the accuracy of the likelihood ratio calculation and the assessment of evidence strength. However, obtaining the correct NoC in complex DNA mixtures remains challenging due to the high degree of allele sharing and dropout. This study aimed to analyze the impact of allele sharing and dropout on NoC inference in complex DNA mixtures when using microhaplotypes (MH). The effectiveness and value of highly polymorphic MH for NoC inference in complex DNA mixtures were evaluated through comparing the performance of three NoC inference methods, including maximum allele count (MAC) method, maximum likelihood estimation (MLE) method, and random forest classification (RFC) algorithm. In this study, we selected the top 100 most polymorphic MH from the Southern Han Chinese (CHS) population, and simulated over 40 million complex DNA mixture profiles with the NoC ranging from 2 to 8. These profiles involve unrelated individuals (RM type) and related pairs of individuals, including parent-offspring pairs (PO type), full-sibling pairs (FS type), and second-degree kinship pairs (SE type). Our results indicated that how the number of detected alleles in DNA mixture profiles varied with the markers' polymorphism, kinship's involvement, NoC, and dropout settings. Across different types of DNA mixtures, the MAC and MLE methods performed best in the RM type, followed by SE, FS, and PO types, while RFC models showed the best performance in the PO type, followed by RM, SE, and FS types. The recall of all three methods for NoC inference were decreased as the NoC and dropout levels increased. Furthermore, the MLE method performed better at low NoC, whereas RFC models excelled at high NoC and/or high dropout levels, regardless of the availability of a priori information about related pairs of individuals in DNA mixtures. However, the RFC models which considered the aforementioned priori information and were trained specifically on each type of DNA mixture profiles, outperformed RFC_ALL model that did not consider such information. Finally, we provided recommendations for model building when applying machine learning algorithms to NoC inference.
Collapse
Affiliation(s)
- Haoyu Wang
- West China School of Basic Medical Sciences & Forensic Medicine, Sichuan University, China
| | - Qiang Zhu
- West China School of Basic Medical Sciences & Forensic Medicine, Sichuan University, China
| | - Yuguo Huang
- West China School of Basic Medical Sciences & Forensic Medicine, Sichuan University, China
| | - Yueyan Cao
- West China School of Basic Medical Sciences & Forensic Medicine, Sichuan University, China
| | - Yuhan Hu
- West China School of Basic Medical Sciences & Forensic Medicine, Sichuan University, China
| | - Yifan Wei
- West China School of Basic Medical Sciences & Forensic Medicine, Sichuan University, China
| | - Yuting Wang
- West China School of Basic Medical Sciences & Forensic Medicine, Sichuan University, China
| | - Tingyun Hou
- West China School of Basic Medical Sciences & Forensic Medicine, Sichuan University, China
| | - Tiantian Shan
- West China School of Basic Medical Sciences & Forensic Medicine, Sichuan University, China
| | - Xuan Dai
- West China School of Basic Medical Sciences & Forensic Medicine, Sichuan University, China
| | - Xiaokang Zhang
- West China School of Basic Medical Sciences & Forensic Medicine, Sichuan University, China
| | - Yufang Wang
- West China School of Basic Medical Sciences & Forensic Medicine, Sichuan University, China.
| | - Ji Zhang
- West China School of Basic Medical Sciences & Forensic Medicine, Sichuan University, China.
| |
Collapse
|
3
|
Adamowicz MS, Rambo TN, Clarke JL. Internal Validation of MaSTR™ Probabilistic Genotyping Software for the Interpretation of 2–5 Person Mixed DNA Profiles. Genes (Basel) 2022; 13:genes13081429. [PMID: 36011340 PMCID: PMC9408203 DOI: 10.3390/genes13081429] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2022] [Revised: 08/07/2022] [Accepted: 08/08/2022] [Indexed: 11/16/2022] Open
Abstract
Mixed human deoxyribonucleic acid (DNA) samples present one of the most challenging pieces of evidence that a forensic analyst can encounter. When multiple contributors, stochastic amplification, and allele drop-out further complicate the mixture profile, interpretation by hand becomes unreliable and statistical analysis problematic. Probabilistic genotyping software has provided a tool to address complex mixture interpretation and provide likelihood ratios for defined sets of propositions. The MaSTR™ software is a fully continuous probabilistic system that considers a wide range of STR profile data to provide likelihood ratios on DNA mixtures. Mixtures with two to five contributors and a range of component ratios and allele peak heights were created to test the validity of MaSTR™ with data similar to real casework. Over 280 different mixed DNA profiles were used to perform more than 2600 analyses using different sets of propositions and numbers of contributors. The results of the analyses demonstrated that MaSTR™ provided accurate and precise statistical data on DNA mixtures with up to five contributors, including minor contributors with stochastic amplification effects. Tests for both Type I and Type II errors were performed. The findings in this study support that MaSTR™ is a robust tool that meets the current standards for probabilistic genotyping.
Collapse
|
4
|
Noël J, Noël S, Mailly F, Granger D, Lefebvre JF, Milot E, Séguin D. Total allele count distribution (TAC curves) improves number of contributor estimation for complex DNA mixtures. CANADIAN SOCIETY OF FORENSIC SCIENCE JOURNAL 2022. [DOI: 10.1080/00085030.2022.2028359] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Affiliation(s)
- Josée Noël
- Laboratoire de Sciences Judiciaires et de Médecine Légale, Montréal, Québec, Canada
| | - Sarah Noël
- Laboratoire de Sciences Judiciaires et de Médecine Légale, Montréal, Québec, Canada
| | - France Mailly
- Laboratoire de Sciences Judiciaires et de Médecine Légale, Montréal, Québec, Canada
| | - Dominic Granger
- Laboratoire de Sciences Judiciaires et de Médecine Légale, Montréal, Québec, Canada
| | | | - Emmanuel Milot
- Laboratoire de Recherche en Criminalistique, Department of Chemistry, Biochemistry and Physics and Centre International de Criminologie Comparée, Université du Québec à Trois-Rivières, Trois-Rivières, Québec, Canada
| | - Diane Séguin
- Laboratoire de Sciences Judiciaires et de Médecine Légale, Montréal, Québec, Canada
| |
Collapse
|
5
|
Grgicak CM, Duffy KR, Lun DS. The a posteriori probability of the number of contributors when conditioned on an assumed contributor. Forensic Sci Int Genet 2021; 54:102563. [PMID: 34284325 DOI: 10.1016/j.fsigen.2021.102563] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2021] [Revised: 06/24/2021] [Accepted: 07/03/2021] [Indexed: 10/20/2022]
Abstract
Forensic DNA signal is notoriously challenging to assess, requiring computational tools to support its interpretation. Over-expressions of stutter, allele drop-out, allele drop-in, degradation, differential degradation, and the like, make forensic DNA profiles too complicated to evaluate by manual methods. In response, computational tools that make point estimates on the Number of Contributors (NOC) to a sample have been developed, as have Bayesian methods that evaluate an A Posteriori Probability (APP) distribution on the NOC. In cases where an overly narrow NOC range is assumed, the downstream strength of evidence may be incomplete insofar as the evidence is evaluated with an inadequate set of propositions. In the current paper, we extend previous work on NOCIt, a Bayesian method that determines an APP on the NOC given an electropherogram, by reporting on an implementation where the user can add assumed contributors. NOCIt is a continuous system that incorporates models of peak height (including degradation and differential degradation), forward and reverse stutter, noise, and allelic drop-out, while being cognizant of allele frequencies in a reference population. When conditioned on a known contributor, we found that the mode of the APP distribution can shift to one greater when compared with the circumstance where no known contributor is assumed, and that occurred most often when the assumed contributor was the minor constituent to the mixture. In a development of a result of Slooten and Caliebe (FSI:G, 2018) that, under suitable assumptions, establishes the NOC can be treated as a nuisance variable in the computation of a likelihood ratio between the prosecution and defense hypotheses, we show that this computation must not only use coincident models, but also coincident contextual information. The results reported here, therefore, illustrate the power of modern probabilistic systems to assess full weights-of-evidence, and to provide information on reasonable NOC ranges across multiple contexts.
Collapse
Affiliation(s)
- Catherine M Grgicak
- Department of Chemistry, Rutgers University, Camden, NJ 08102, USA; Center for Computational and Integrative Biology, Rutgers University, Camden, NJ 08102, USA.
| | - Ken R Duffy
- Hamilton Institute, Maynooth University, Ireland
| | - Desmond S Lun
- Center for Computational and Integrative Biology, Rutgers University, Camden, NJ 08102, USA; Department of Computer Science, Rutgers University, Camden, NJ 08102, USA; Department of Plant Biology, Rutgers University, New Brunswick, NJ 08901, USA
| |
Collapse
|
6
|
Valtl J, Mönich UJ, Lun DS, Kelley J, Grgicak CM. A series of developmental validation tests for Number of Contributors platforms: Exemplars using NOCIt and a neural network. Forensic Sci Int Genet 2021; 54:102556. [PMID: 34225042 DOI: 10.1016/j.fsigen.2021.102556] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Revised: 06/15/2021] [Accepted: 06/16/2021] [Indexed: 10/21/2022]
Abstract
Complex DNA mixtures are challenging to interpret and require computational tools that aid in that interpretation. Recently, several computational methods that estimate the number of contributors (NOC) to a sample have been developed. Unlike analogous tools that interpret profiles and report LRs, NOC tools vary widely in their operational principle where some are Bayesian and others are machine learning tools. Conjunctionally, NOC tools may return a single n estimate, or a distribution on n. This vast array of constructs, coupled with a gap in standardized methods by which to validate NOC systems, warrants an exploration into the measures by which differing NOC systems might be tested for operations. In the current paper, we use two exemplar NOC systems: a probabilistic system named NOCIt, which renders an a posteriori probability (APP) distribution on the number of contributors given an electropherogram and an artificial neural network (ANN). NOCIt is a continuous Bayesian inference system incorporating models of peak height, degradation, differential degradation, forward and reverse stutter, noise and allelic drop-out while considering allele frequencies in a reference population. The ANN is also a continuous method, taking all the same features (barring degradation) into account. Unlike its Bayesian counterpart, it demands substantively more data to parameterize, requiring synthetic data. We explore each system's performance by conducting tests on 214 PROVEDIt mixtures where the limit of detection was 1-copy of DNA. We found that after a lengthy training period of approximately 24 h, the ANN's evaluation process was very fast and perfectly repeatable. In contrast, NOCIt only took a few minutes to train but took tens of minutes to complete each sample and was less repeatable. In addition, it rendered a probability distribution that was more sensitive and specific, affording a reasonable method by which to report all reasonable n that explain the evidence for a given sample. Whatever the method, by acknowledging the inherent differences between NOC systems, we demonstrate that validation constructs will necessarily be guided by the needs of the forensic domain and be dependent upon whether the laboratory seeks to assign a single n or range of n.
Collapse
Affiliation(s)
- Jakob Valtl
- Lehrstuhl für Theoretische Informationstechnik, Technische Universität München, 80333 Munich, Germany
| | - Ullrich J Mönich
- Lehrstuhl für Theoretische Informationstechnik, Technische Universität München, 80333 Munich, Germany
| | - Desmond S Lun
- Center for Computational and Integrative Biology, Rutgers University, Camden, NJ 08102, USA; Department of Computer Science, Rutgers University, Camden, NJ 08102, USA; Department of Plant Biology, Rutgers University, New Brunswick, NJ 08901, USA
| | - James Kelley
- Center for Computational and Integrative Biology, Rutgers University, Camden, NJ 08102, USA
| | - Catherine M Grgicak
- Center for Computational and Integrative Biology, Rutgers University, Camden, NJ 08102, USA; Department of Chemistry, Rutgers University, Camden, NJ 08102, USA.
| |
Collapse
|
7
|
Liu Z, Gao Z, Wang J, Shi J, Liu J, Chen D, Li W, Guo J, Cheng X, Hao T, Li Z, Li Y, Yan J, Zhang G. A method of identifying the blood contributor in mixture stains through detecting blood‐specific mRNA polymorphism. Electrophoresis 2020; 41:1364-1373. [DOI: 10.1002/elps.202000053] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2020] [Revised: 03/31/2020] [Accepted: 05/04/2020] [Indexed: 01/31/2023]
Affiliation(s)
- Zidong Liu
- School of Forensic MedicineShanxi Medical University Jinzhong Shanxi P. R. China
| | - Zhe Gao
- School of Forensic MedicineShanxi Medical University Jinzhong Shanxi P. R. China
| | - Jiaqi Wang
- School of Forensic MedicineShanxi Medical University Jinzhong Shanxi P. R. China
| | - Jie Shi
- School of Forensic MedicineShanxi Medical University Jinzhong Shanxi P. R. China
| | - Jinding Liu
- School of Forensic MedicineShanxi Medical University Jinzhong Shanxi P. R. China
| | - Deqing Chen
- School of Forensic MedicineShanxi Medical University Jinzhong Shanxi P. R. China
| | - Wenyan Li
- School of Forensic MedicineShanxi Medical University Jinzhong Shanxi P. R. China
| | - Jiangling Guo
- School of Forensic MedicineShanxi Medical University Jinzhong Shanxi P. R. China
| | - Xiaojuan Cheng
- School of Forensic MedicineShanxi Medical University Jinzhong Shanxi P. R. China
| | - Ting Hao
- School of Forensic MedicineShanxi Medical University Jinzhong Shanxi P. R. China
| | - Zeqin Li
- School of Forensic MedicineShanxi Medical University Jinzhong Shanxi P. R. China
| | - Yanhua Li
- School of Forensic MedicineShanxi Medical University Jinzhong Shanxi P. R. China
| | - Jiangwei Yan
- School of Forensic MedicineShanxi Medical University Jinzhong Shanxi P. R. China
| | - Gengqian Zhang
- School of Forensic MedicineShanxi Medical University Jinzhong Shanxi P. R. China
| |
Collapse
|