1
|
Elhaik E. Principal Component Analyses (PCA)-based findings in population genetic studies are highly biased and must be reevaluated. Sci Rep 2022; 12:14683. [PMID: 36038559 PMCID: PMC9424212 DOI: 10.1038/s41598-022-14395-4] [Citation(s) in RCA: 45] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2022] [Accepted: 06/06/2022] [Indexed: 12/29/2022] Open
Abstract
Principal Component Analysis (PCA) is a multivariate analysis that reduces the complexity of datasets while preserving data covariance. The outcome can be visualized on colorful scatterplots, ideally with only a minimal loss of information. PCA applications, implemented in well-cited packages like EIGENSOFT and PLINK, are extensively used as the foremost analyses in population genetics and related fields (e.g., animal and plant or medical genetics). PCA outcomes are used to shape study design, identify, and characterize individuals and populations, and draw historical and ethnobiological conclusions on origins, evolution, dispersion, and relatedness. The replicability crisis in science has prompted us to evaluate whether PCA results are reliable, robust, and replicable. We analyzed twelve common test cases using an intuitive color-based model alongside human population data. We demonstrate that PCA results can be artifacts of the data and can be easily manipulated to generate desired outcomes. PCA adjustment also yielded unfavorable outcomes in association studies. PCA results may not be reliable, robust, or replicable as the field assumes. Our findings raise concerns about the validity of results reported in the population genetics literature and related fields that place a disproportionate reliance upon PCA outcomes and the insights derived from them. We conclude that PCA may have a biasing role in genetic investigations and that 32,000-216,000 genetic studies should be reevaluated. An alternative mixed-admixture population genetic model is discussed.
Collapse
Affiliation(s)
- Eran Elhaik
- Department of Biology, Lund University, 22362, Lund, Sweden.
| |
Collapse
|
2
|
Behnamian S, Esposito U, Holland G, Alshehab G, Dobre AM, Pirooznia M, Brimacombe CS, Elhaik E. Temporal population structure, a genetic dating method for ancient Eurasian genomes from the past 10,000 years. CELL REPORTS METHODS 2022; 2:100270. [PMID: 36046618 PMCID: PMC9421539 DOI: 10.1016/j.crmeth.2022.100270] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/27/2021] [Revised: 06/17/2022] [Accepted: 07/19/2022] [Indexed: 11/21/2022]
Abstract
Radiocarbon dating is the gold standard in archeology to estimate the age of skeletons, a key to studying their origins. Many published ancient genomes lack reliable and direct dates, which results in obscure and contradictory reports. We developed the temporal population structure (TPS), a DNA-based dating method for genomes ranging from the Late Mesolithic to today, and applied it to 3,591 ancient and 1,307 modern Eurasians. TPS predictions aligned with the known dates and correctly accounted for kin relationships. TPS dating of poorly dated Eurasian samples resolved conflicting reports in the literature, as illustrated by one test case. We also demonstrated how TPS improved the ability to study phenotypic traits over time. TPS can be used when radiocarbon dating is unfeasible or uncertain or to develop alternative hypotheses for samples younger than 10,000 years ago, a limitation that may be resolved over time as ancient data accumulate.
Collapse
Affiliation(s)
- Sara Behnamian
- Department of Biology, Lund University, 22362 Lund, Sweden
| | - Umberto Esposito
- Department of Animal and Plant Sciences, University of Sheffield, Sheffield S10 2TN, UK
| | - Grace Holland
- Department of Animal and Plant Sciences, University of Sheffield, Sheffield S10 2TN, UK
| | - Ghadeer Alshehab
- Department of Automatic Control and Systems Engineering, University of Sheffield, Sheffield S1 3JD, UK
| | - Ann M. Dobre
- Department of Animal and Plant Sciences, University of Sheffield, Sheffield S10 2TN, UK
| | - Mehdi Pirooznia
- National Heart, Lung, and Blood Institute (NHLBI), Bethesda, MD 20892, USA
| | - Conrad S. Brimacombe
- Department of Animal and Plant Sciences, University of Sheffield, Sheffield S10 2TN, UK
- Department of Anthropology and Archaeology, University of Bristol, Bristol BS8 1TH, UK
| | - Eran Elhaik
- Department of Biology, Lund University, 22362 Lund, Sweden
| |
Collapse
|
3
|
Aguiar-Pulido V, Wolujewicz P, Martinez-Fundichely A, Elhaik E, Thareja G, Abdel Aleem A, Chalhoub N, Cuykendall T, Al-Zamer J, Lei Y, El-Bashir H, Musser JM, Al-Kaabi A, Shaw GM, Khurana E, Suhre K, Mason CE, Elemento O, Finnell RH, Ross ME. Systems biology analysis of human genomes points to key pathways conferring spina bifida risk. Proc Natl Acad Sci U S A 2021; 118:e2106844118. [PMID: 34916285 PMCID: PMC8713748 DOI: 10.1073/pnas.2106844118] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/20/2021] [Indexed: 12/15/2022] Open
Abstract
Spina bifida (SB) is a debilitating birth defect caused by multiple gene and environment interactions. Though SB shows non-Mendelian inheritance, genetic factors contribute to an estimated 70% of cases. Nevertheless, identifying human mutations conferring SB risk is challenging due to its relative rarity, genetic heterogeneity, incomplete penetrance, and environmental influences that hamper genome-wide association studies approaches to untargeted discovery. Thus, SB genetic studies may suffer from population substructure and/or selection bias introduced by typical candidate gene searches. We report a population based, ancestry-matched whole-genome sequence analysis of SB genetic predisposition using a systems biology strategy to interrogate 298 case-control subject genomes (149 pairs). Genes that were enriched in likely gene disrupting (LGD), rare protein-coding variants were subjected to machine learning analysis to identify genes in which LGD variants occur with a different frequency in cases versus controls and so discriminate between these groups. Those genes with high discriminatory potential for SB significantly enriched pathways pertaining to carbon metabolism, inflammation, innate immunity, cytoskeletal regulation, and essential transcriptional regulation consistent with their having impact on the pathogenesis of human SB. Additionally, an interrogation of conserved noncoding sequences identified robust variant enrichment in regulatory regions of several transcription factors critical to embryonic development. This genome-wide perspective offers an effective approach to the interrogation of coding and noncoding sequence variant contributions to rare complex genetic disorders.
Collapse
Affiliation(s)
- Vanessa Aguiar-Pulido
- Center for Neurogenetics, Feil Family Brain and Mind Research Institute, Weill Cornell Medicine, New York, NY 10021
| | - Paul Wolujewicz
- Center for Neurogenetics, Feil Family Brain and Mind Research Institute, Weill Cornell Medicine, New York, NY 10021
| | - Alexander Martinez-Fundichely
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY 10065
- His Royal Highness Prince Alwaleed Bin Talal Bin Abdulaziz Al-Saud Institute for Computational Biomedicine, Weill Cornell Medicine, New York, NY 10065
| | - Eran Elhaik
- Department of Biology, Lund University SE-221 00 Lund, Sweden
| | - Gaurav Thareja
- Department of Physiology and Biophysics, Weill Cornell Medicine-Qatar, Doha, Qatar
| | | | - Nader Chalhoub
- Department of Neurology, Weill Cornell Medicine-Qatar, Doha, Qatar
| | - Tawny Cuykendall
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY 10065
- His Royal Highness Prince Alwaleed Bin Talal Bin Abdulaziz Al-Saud Institute for Computational Biomedicine, Weill Cornell Medicine, New York, NY 10065
| | - Jamel Al-Zamer
- Rehabilitation Medicine, Hamad Medical Corporation, Doha, Qatar
| | - Yunping Lei
- Department of Molecular and Cellular Biology, Center for Precision Environmental Health, Baylor College of Medicine, Houston, TX 77030
| | | | - James M Musser
- Department of Pathology and Genomic Medicine, Houston Methodist Research Institute, Houston, TX 77030
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY 10065
| | - Abdulla Al-Kaabi
- Sidra Medical and Research Center, Weill Cornell Medicine-Qatar, Doha, Qatar
| | - Gary M Shaw
- Department of Pediatrics, Stanford University School of Medicine, Stanford, CA 94305
| | - Ekta Khurana
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY 10065
- His Royal Highness Prince Alwaleed Bin Talal Bin Abdulaziz Al-Saud Institute for Computational Biomedicine, Weill Cornell Medicine, New York, NY 10065
| | - Karsten Suhre
- Department of Physiology and Biophysics, Weill Cornell Medicine-Qatar, Doha, Qatar
| | - Christopher E Mason
- Center for Neurogenetics, Feil Family Brain and Mind Research Institute, Weill Cornell Medicine, New York, NY 10021
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY 10065
- His Royal Highness Prince Alwaleed Bin Talal Bin Abdulaziz Al-Saud Institute for Computational Biomedicine, Weill Cornell Medicine, New York, NY 10065
| | - Olivier Elemento
- Center for Neurogenetics, Feil Family Brain and Mind Research Institute, Weill Cornell Medicine, New York, NY 10021
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY 10065
- His Royal Highness Prince Alwaleed Bin Talal Bin Abdulaziz Al-Saud Institute for Computational Biomedicine, Weill Cornell Medicine, New York, NY 10065
- Caryl and Israel Englander Institute for Precision Medicine, Weill Cornell Medicine, New York, NY 10021
| | - Richard H Finnell
- Department of Molecular and Cellular Biology, Center for Precision Environmental Health, Baylor College of Medicine, Houston, TX 77030
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030
- Department of Medicine, Baylor College of Medicine, Houston, TX 77030
| | - M Elizabeth Ross
- Center for Neurogenetics, Feil Family Brain and Mind Research Institute, Weill Cornell Medicine, New York, NY 10021;
| |
Collapse
|
4
|
Wolujewicz P, Steele JW, Kaltschmidt JA, Finnell RH, Ross ME. Unraveling the complex genetics of neural tube defects: From biological models to human genomics and back. Genesis 2021; 59:e23459. [PMID: 34713546 DOI: 10.1002/dvg.23459] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2021] [Revised: 09/08/2021] [Accepted: 09/17/2021] [Indexed: 12/11/2022]
Abstract
Neural tube defects (NTDs) are a classic example of preventable birth defects for which there is a proven-effective intervention, folic acid (FA); however, further methods of prevention remain unrealized. In the decades following implementation of FA nutritional fortification programs throughout at least 87 nations, it has become apparent that not all NTDs can be prevented by FA. In the United States, FA fortification only reduced NTD rates by 28-35% (Williams et al., 2015). As such, it is imperative that further work is performed to understand the risk factors associated with NTDs and their underlying mechanisms so that alternative prevention strategies can be developed. However, this is complicated by the sheer number of genes associated with neural tube development, the heterogeneity of observable phenotypes in human cases, the rareness of the disease, and the myriad of environmental factors associated with NTD risk. Given the complex genetic architecture underlying NTD pathology and the way in which that architecture interacts dynamically with environmental factors, further prevention initiatives will undoubtedly require precision medicine strategies that utilize the power of human genomics and modern tools for assessing genetic risk factors. Herein, we review recent advances in genomic strategies for discovering genetic variants associated with these defects, and new ways in which biological models, such as mice and cell culture-derived organoids, are leveraged to assess mechanistic functionality, the way these variants interact with other genetic or environmental factors, and their ultimate contribution to human NTD risk.
Collapse
Affiliation(s)
- Paul Wolujewicz
- Center for Neurogenetics, Feil Family Brain & Mind Research Institute, Weill Cornell Medicine, New York, New York, USA
| | - John W Steele
- Center for Precision Environmental Health, Department of Molecular and Cellular Biology, Baylor College of Medicine, Houston, Texas, USA
| | - Julia A Kaltschmidt
- Department of Neurosurgery, Stanford University School of Medicine, Stanford, California, USA
| | - Richard H Finnell
- Center for Precision Environmental Health, Department of Molecular and Cellular Biology, Baylor College of Medicine, Houston, Texas, USA.,Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA.,Department of Medicine, Baylor College of Medicine, Houston, Texas, USA
| | - Margaret Elizabeth Ross
- Center for Neurogenetics, Feil Family Brain & Mind Research Institute, Weill Cornell Medicine, New York, New York, USA
| |
Collapse
|
5
|
Carress H, Lawson DJ, Elhaik E. Population genetic considerations for using biobanks as international resources in the pandemic era and beyond. BMC Genomics 2021; 22:351. [PMID: 34001009 PMCID: PMC8127217 DOI: 10.1186/s12864-021-07618-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2020] [Accepted: 04/14/2021] [Indexed: 12/11/2022] Open
Abstract
The past years have seen the rise of genomic biobanks and mega-scale meta-analysis of genomic data, which promises to reveal the genetic underpinnings of health and disease. However, the over-representation of Europeans in genomic studies not only limits the global understanding of disease risk but also inhibits viable research into the genomic differences between carriers and patients. Whilst the community has agreed that more diverse samples are required, it is not enough to blindly increase diversity; the diversity must be quantified, compared and annotated to lead to insight. Genetic annotations from separate biobanks need to be comparable and computable and to operate without access to raw data due to privacy concerns. Comparability is key both for regular research and to allow international comparison in response to pandemics. Here, we evaluate the appropriateness of the most common genomic tools used to depict population structure in a standardized and comparable manner. The end goal is to reduce the effects of confounding and learn from genuine variation in genetic effects on phenotypes across populations, which will improve the value of biobanks (locally and internationally), increase the accuracy of association analyses and inform developmental efforts.
Collapse
Affiliation(s)
- Hannah Carress
- Department of Animal and Plant Sciences, University of Sheffield, Sheffield, UK
| | - Daniel John Lawson
- School of Mathematics and Integrative Epidemiology Unit, University of Bristol, Bristol, UK
| | - Eran Elhaik
- Department of Animal and Plant Sciences, University of Sheffield, Sheffield, UK. .,Department of Biology, Lund University, Lund, Sweden.
| |
Collapse
|
6
|
Esposito U, Das R, Syed S, Pirooznia M, Elhaik E. Ancient Ancestry Informative Markers for Identifying Fine-Scale Ancient Population Structure in Eurasians. Genes (Basel) 2018; 9:E625. [PMID: 30545160 PMCID: PMC6316245 DOI: 10.3390/genes9120625] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2018] [Revised: 12/05/2018] [Accepted: 12/10/2018] [Indexed: 12/23/2022] Open
Abstract
The rapid accumulation of ancient human genomes from various areas and time periods potentially enables the expansion of studies of biodiversity, biogeography, forensics, population history, and epidemiology into past populations. However, most ancient DNA (aDNA) data were generated through microarrays designed for modern-day populations, which are known to misrepresent the population structure. Past studies addressed these problems by using ancestry informative markers (AIMs). It is, thereby, unclear whether AIMs derived from contemporary human genomes can capture ancient population structures, and whether AIM-finding methods are applicable to aDNA, provided that the high missingness rates in ancient-and oftentimes haploid-DNA can also distort the population structure. Here, we define ancient AIMs (aAIMs) and develop a framework to evaluate established and novel AIM-finding methods in identifying the most informative markers. We show that aAIMs identified by a novel principal component analysis (PCA)-based method outperform all of the competing methods in classifying ancient individuals into populations and identifying admixed individuals. In some cases, predictions made using the aAIMs were more accurate than those made with a complete marker set. We discuss the features of the ancient Eurasian population structure and strategies to identify aAIMs. This work informs the design of single nucleotide polymorphism (SNP) microarrays and the interpretation of aDNA results, which enables a population-wide testing of primordialist theories.
Collapse
Affiliation(s)
- Umberto Esposito
- Department of Animal and Plant Sciences, University of Sheffield, Sheffield S10 2TN, UK.
| | - Ranajit Das
- Manipal University, Manipal Centre for Natural Sciences (MCNS), Manipal, Karnataka, 576104, India.
| | - Syakir Syed
- Department of Animal and Plant Sciences, University of Sheffield, Sheffield S10 2TN, UK.
| | - Mehdi Pirooznia
- Bioinformatics and Computational Biology, National Heart Lung and Blood Institute, National Institutes of Health, Bethesda, MD 20892, USA .
| | - Eran Elhaik
- Department of Animal and Plant Sciences, University of Sheffield, Sheffield S10 2TN, UK.
| |
Collapse
|