1
|
Yang B, Zhou X, Liu S. Tracing the genealogy origin of geographic populations based on genomic variation and deep learning. Mol Phylogenet Evol 2024; 198:108142. [PMID: 38964594 DOI: 10.1016/j.ympev.2024.108142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2023] [Revised: 05/30/2024] [Accepted: 07/01/2024] [Indexed: 07/06/2024]
Abstract
Assigning a query individual animal or plant to its derived population is a prime task in diverse applications related to organismal genealogy. Such endeavors have conventionally relied on short DNA sequences under a phylogenetic framework. These methods naturally show constraints when the inferred population sources are ambiguously phylogenetically structured, a scenario demanding substantially more informative genetic signals. Recent advances in cost-effective production of whole-genome sequences and artificial intelligence have created an unprecedented opportunity to trace the population origin for essentially any given individual, as long as the genome reference data are comprehensive and standardized. Here, we developed a convolutional neural network method to identify population origins using genomic SNPs. Three empirical datasets (an Asian honeybee, a red fire ant, and a chicken datasets) and two simulated populations are used for the proof of concepts. The performance tests indicate that our method can accurately identify the genealogy origin of query individuals, with success rates ranging from 93 % to 100 %. We further showed that the accuracy of the model can be significantly increased by refining the informative sites through FST filtering. Our method is robust to configurations related to batch sizes and epochs, whereas model learning benefits from the setting of a proper preset learning rate. Moreover, we explained the importance score of key sites for algorithm interpretability and credibility, which has been largely ignored. We anticipate that by coupling genomics and deep learning, our method will see broad potential in conservation and management applications that involve natural resources, invasive pests and weeds, and illegal trades of wildlife products.
Collapse
Affiliation(s)
- Bing Yang
- Department of Entomology, China Agricultural University, Beijing 100193, China
| | - Xin Zhou
- Department of Entomology, China Agricultural University, Beijing 100193, China.
| | - Shanlin Liu
- Department of Entomology, China Agricultural University, Beijing 100193, China; Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China.
| |
Collapse
|
2
|
Wang X, Heckel G. Genome-wide relaxation of selection and the evolution of the island syndrome in Orkney voles. Genome Res 2024; 34:851-862. [PMID: 38955466 DOI: 10.1101/gr.278487.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Accepted: 05/14/2024] [Indexed: 07/04/2024]
Abstract
Island populations often experience different ecological and demographic conditions than their counterparts on the continent, resulting in divergent evolutionary forces affecting their genomes. Random genetic drift and selection both may leave their imprints on island populations, although the relative impact depends strongly on the specific conditions. Here we address their contributions to the island syndrome in a rodent with an unusually clear history of isolation. Common voles (Microtus arvalis) were introduced by humans on the Orkney archipelago north of Scotland >5000 years ago and rapidly evolved to exceptionally large size. Our analyses show that the genomes of Orkney voles were dominated by genetic drift, with extremely low diversity, variable Tajima's D, and very high divergence from continental conspecifics. Increased d N/d S ratios over a wide range of genes in Orkney voles indicated genome-wide relaxation of purifying selection. We found evidence of hard sweeps on key genes of the lipid metabolism pathway only in continental voles. The marked increase of body size in Orkney-a typical phenomenon of the island syndrome-may thus be associated to the relaxation of positive selection on genes related to this pathway. On the other hand, a hard sweep on immune genes of Orkney voles likely reflects the divergent ecological conditions and possibly the history of human introduction. The long-term isolated Orkney voles show that adaptive changes may still impact the evolutionary trajectories of such populations despite the pervasive consequences of genetic drift at the genome level.
Collapse
Affiliation(s)
- Xuejing Wang
- Institute of Ecology and Evolution, University of Bern, 3012 Bern, Switzerland
| | - Gerald Heckel
- Institute of Ecology and Evolution, University of Bern, 3012 Bern, Switzerland;
- Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| |
Collapse
|
3
|
Wong Y, Ignatieva A, Koskela J, Gorjanc G, Wohns AW, Kelleher J. A general and efficient representation of ancestral recombination graphs. Genetics 2024:iyae100. [PMID: 39013109 DOI: 10.1093/genetics/iyae100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2024] [Accepted: 06/05/2024] [Indexed: 07/18/2024] Open
Abstract
As a result of recombination, adjacent nucleotides can have different paths of genetic inheritance and therefore the genealogical trees for a sample of DNA sequences vary along the genome. The structure capturing the details of these intricately interwoven paths of inheritance is referred to as an ancestral recombination graph (ARG). Classical formalisms have focused on mapping coalescence and recombination events to the nodes in an ARG. However, this approach is out of step with some modern developments, which do not represent genetic inheritance in terms of these events or explicitly infer them. We present a simple formalism that defines an ARG in terms of specific genomes and their intervals of genetic inheritance, and show how it generalizes these classical treatments and encompasses the outputs of recent methods. We discuss nuances arising from this more general structure, and argue that it forms an appropriate basis for a software standard in this rapidly growing field.
Collapse
Affiliation(s)
- Yan Wong
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
| | - Anastasia Ignatieva
- School of Mathematics and Statistics, University of Glasgow, Glasgow G12 8TA, UK
- Department of Statistics, University of Oxford, Oxford OX1 3LB, UK
| | - Jere Koskela
- School of Mathematics, Statistics and Physics, Newcastle University, Newcastle NE1 7RU, UK
- Department of Statistics, University of Warwick, Coventry CV4 7AL, UK
| | - Gregor Gorjanc
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Edinburgh EH25 9RG, UK
| | - Anthony W Wohns
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305-5101, USA
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
| |
Collapse
|
4
|
Guo B, Takala-Harrison S, O’Connor TD. Benchmarking and Optimization of Methods for the Detection of Identity-By-Descent in High-Recombining Plasmodium falciparum. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.04.592538. [PMID: 38746392 PMCID: PMC11092787 DOI: 10.1101/2024.05.04.592538] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2024]
Abstract
Genomic surveillance is crucial for identifying at-risk populations for targeted malaria control and elimination. Identity-by-descent (IBD) is increasingly being used in Plasmodium population genomics to estimate genetic relatedness, effective population size (N e ), population structure, and signals of positive selection. Despite its potential, a thorough evaluation of IBD segment detection tools for species with high recombination rates, such as P. falciparum , remains absent. Here, we perform comprehensive benchmarking of IBD callers - probabilistic (hmmIBD, isoRelate), identity-by-state-based (hap-IBD, phased IBD) and others (Refined IBD) - using population genetic simulations tailored for high recombination, and IBD quality metrics at both the IBD segment level and the IBD-based downstream inference level. Our results demonstrate that low marker density per genetic unit, related to high recombination relative to mutation, significantly compromises the accuracy of detected IBD segments. In genomes with high recombination rates resembling P. falciparum , most IBD callers exhibit high false negative rates for shorter IBD segments, which can be partially mitigated through optimization of IBD caller parameters, especially those related to marker density. Notably, IBD detected with optimized parameters allows for more accurate capture of selection signals and population structure; IBD-based N e inference is very sensitive to IBD detection errors, with IBD called from hmmIBD uniquely providing less biased estimates of N e in this context. Validation with empirical data from the MalariaGEN Pf7 database, representing different transmission settings, corroborates these findings. We conclude that context-specific evaluation and parameter optimization are essential for accurate IBD detection in high-recombining species and recommend hmmIBD for quality-sensitive analysis, such as estimation of N e in these species. Our optimization and high-level benchmarking methods not only improve IBD segment detection in high-recombining genomes but also enhance overall genomic analysis, paving the way for more accurate genomic surveillance and targeted intervention strategies for malaria.
Collapse
Affiliation(s)
- Bing Guo
- Center for Vaccine Development and Global Health, University of Maryland School of Medicine, Baltimore, MD USA
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Shannon Takala-Harrison
- Center for Vaccine Development and Global Health, University of Maryland School of Medicine, Baltimore, MD USA
| | - Timothy D. O’Connor
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA
| |
Collapse
|
5
|
Dehasque M, Morales HE, Díez-Del-Molino D, Pečnerová P, Chacón-Duque JC, Kanellidou F, Muller H, Plotnikov V, Protopopov A, Tikhonov A, Nikolskiy P, Danilov GK, Giannì M, van der Sluis L, Higham T, Heintzman PD, Oskolkov N, Gilbert MTP, Götherström A, van der Valk T, Vartanyan S, Dalén L. Temporal dynamics of woolly mammoth genome erosion prior to extinction. Cell 2024; 187:3531-3540.e13. [PMID: 38942016 DOI: 10.1016/j.cell.2024.05.033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2023] [Revised: 02/08/2024] [Accepted: 05/17/2024] [Indexed: 06/30/2024]
Abstract
A number of species have recently recovered from near-extinction. Although these species have avoided the immediate extinction threat, their long-term viability remains precarious due to the potential genetic consequences of population declines, which are poorly understood on a timescale beyond a few generations. Woolly mammoths (Mammuthus primigenius) became isolated on Wrangel Island around 10,000 years ago and persisted for over 200 generations before becoming extinct around 4,000 years ago. To study the evolutionary processes leading up to the mammoths' extinction, we analyzed 21 Siberian woolly mammoth genomes. Our results show that the population recovered quickly from a severe bottleneck and remained demographically stable during the ensuing six millennia. We find that mildly deleterious mutations gradually accumulated, whereas highly deleterious mutations were purged, suggesting ongoing inbreeding depression that lasted for hundreds of generations. The time-lag between demographic and genetic recovery has wide-ranging implications for conservation management of recently bottlenecked populations.
Collapse
Affiliation(s)
- Marianne Dehasque
- Centre for Palaeogenetics, Svante Arrhenius väg 20C, 10691 Stockholm, Sweden; Department of Bioinformatics and Genetics, Swedish Museum of Natural History, Box 50007, 10405 Stockholm, Sweden; Department of Zoology, Stockholm University, 10691 Stockholm, Sweden.
| | - Hernán E Morales
- Center for Evolutionary Hologenomics, The Globe Institute, University of Copenhagen, Copenhagen, Denmark
| | - David Díez-Del-Molino
- Centre for Palaeogenetics, Svante Arrhenius väg 20C, 10691 Stockholm, Sweden; Department of Bioinformatics and Genetics, Swedish Museum of Natural History, Box 50007, 10405 Stockholm, Sweden; Department of Zoology, Stockholm University, 10691 Stockholm, Sweden
| | - Patrícia Pečnerová
- Department of Bioinformatics and Genetics, Swedish Museum of Natural History, Box 50007, 10405 Stockholm, Sweden; Department of Zoology, Stockholm University, 10691 Stockholm, Sweden; Section for Computational and RNA Biology, Department of Biology, University of Copenhagen, 2200 Copenhagen, Denmark
| | - J Camilo Chacón-Duque
- Centre for Palaeogenetics, Svante Arrhenius väg 20C, 10691 Stockholm, Sweden; Department of Zoology, Stockholm University, 10691 Stockholm, Sweden; Department of Archaeology and Classical Studies, Stockholm University, Lilla Frescativägen 7, 11418 Stockholm, Sweden
| | - Foteini Kanellidou
- Centre for Palaeogenetics, Svante Arrhenius väg 20C, 10691 Stockholm, Sweden; Department of Zoology, Stockholm University, 10691 Stockholm, Sweden
| | - Héloïse Muller
- Master de Biologie, Ecole Normale Superieure de Lyon, Universite Claude Bernard Lyon I, Universite de Lyon, 69007 Lyon, France
| | - Valerii Plotnikov
- Academy of Sciences of Sakha Republic, Lenin Avenue 33, Yakutsk, Republic of Sakha (Yakutia), Russia
| | - Albert Protopopov
- Academy of Sciences of Sakha Republic, Lenin Avenue 33, Yakutsk, Republic of Sakha (Yakutia), Russia
| | - Alexei Tikhonov
- Zoological Institute of Russian Academy of Sciences, Saint-Petersburg, Russia
| | - Pavel Nikolskiy
- Geological Institute of the Russian Academy of Sciences, Moscow, Russia
| | - Gleb K Danilov
- Peter the Great Museum of Anthropology and Ethnography, Kunstkamera, Russian Academy of Sciences, 3 University Embankment, Box 199034, Saint-Petersburg, Russia
| | - Maddalena Giannì
- Department of Evolutionary Anthropology, Faculty of Life Sciences, University of Vienna, Vienna, Austria; Human Evolution and Archaeological Sciences (HEAS), University of Vienna, Vienna, Austria
| | - Laura van der Sluis
- Department of Evolutionary Anthropology, Faculty of Life Sciences, University of Vienna, Vienna, Austria; Human Evolution and Archaeological Sciences (HEAS), University of Vienna, Vienna, Austria
| | - Tom Higham
- Department of Evolutionary Anthropology, Faculty of Life Sciences, University of Vienna, Vienna, Austria; Human Evolution and Archaeological Sciences (HEAS), University of Vienna, Vienna, Austria
| | - Peter D Heintzman
- Centre for Palaeogenetics, Svante Arrhenius väg 20C, 10691 Stockholm, Sweden; Department of Geological Sciences, Stockholm University, 10691 Stockholm, Sweden
| | - Nikolay Oskolkov
- Department of Biology, National Bioinformatics Infrastructure Sweden, Science for Life Laboratory, Lund University, Lund, Sweden
| | - M Thomas P Gilbert
- Center for Evolutionary Hologenomics, The Globe Institute, University of Copenhagen, Copenhagen, Denmark; University Museum, NTNU, Trondheim, Norway
| | - Anders Götherström
- Centre for Palaeogenetics, Svante Arrhenius väg 20C, 10691 Stockholm, Sweden; Department of Archaeology and Classical Studies, Stockholm University, Lilla Frescativägen 7, 11418 Stockholm, Sweden
| | - Tom van der Valk
- Centre for Palaeogenetics, Svante Arrhenius väg 20C, 10691 Stockholm, Sweden; Department of Bioinformatics and Genetics, Swedish Museum of Natural History, Box 50007, 10405 Stockholm, Sweden; SciLifeLab, Stockholm, Sweden
| | - Sergey Vartanyan
- North-East Interdisciplinary Scientific Research Institute N.A.N.A. Shilo, Far East Branch, Russian Academy of Sciences, Magadan, Russia
| | - Love Dalén
- Centre for Palaeogenetics, Svante Arrhenius väg 20C, 10691 Stockholm, Sweden; Department of Bioinformatics and Genetics, Swedish Museum of Natural History, Box 50007, 10405 Stockholm, Sweden; Department of Zoology, Stockholm University, 10691 Stockholm, Sweden.
| |
Collapse
|
6
|
Cen S, Rasmussen DA. Exploring the Accuracy and Limits of Algorithms for Localizing Recombination Breakpoints. Mol Biol Evol 2024; 41:msae133. [PMID: 38917277 PMCID: PMC11229816 DOI: 10.1093/molbev/msae133] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2023] [Revised: 06/04/2024] [Accepted: 06/11/2024] [Indexed: 06/27/2024] Open
Abstract
Phylogenetic methods are widely used to reconstruct the evolutionary relationships among species and individuals. However, recombination can obscure ancestral relationships as individuals may inherit different regions of their genome from different ancestors. It is, therefore, often necessary to detect recombination events, locate recombination breakpoints, and select recombination-free alignments prior to reconstructing phylogenetic trees. While many earlier studies have examined the power of different methods to detect recombination, very few have examined the ability of these methods to accurately locate recombination breakpoints. In this study, we simulated genome sequences based on ancestral recombination graphs and explored the accuracy of three popular recombination detection methods: MaxChi, 3SEQ, and Genetic Algorithm Recombination Detection. The accuracy of inferred breakpoint locations was evaluated along with the key factors contributing to variation in accuracy across datasets. While many different genomic features contribute to the variation in performance across methods, the number of informative sites consistent with the pattern of inheritance between parent and recombinant child sequences always has the greatest contribution to accuracy. While partitioning sequence alignments based on identified recombination breakpoints can greatly decrease phylogenetic error, the quality of phylogenetic reconstructions depends very little on how breakpoints are chosen to partition the alignment. Our work sheds light on how different features of recombinant genomes affect the performance of recombination detection methods and suggests best practices for reconstructing phylogenies based on recombination-free alignments.
Collapse
Affiliation(s)
- Shi Cen
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC, USA
| | - David A Rasmussen
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC, USA
- Department of Entomology and Plant Pathology, North Carolina State University, Raleigh, NC, USA
| |
Collapse
|
7
|
Xu P, Liang S, Hahn A, Zhao V, Lo WT‘J, Haller BC, Sobkowiak B, Chitwood MH, Colijn C, Cohen T, Rhee KY, Messer PW, Wells MT, Clark AG, Kim J. e3SIM: epidemiological-ecological-evolutionary simulation framework for genomic epidemiology. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.29.601123. [PMID: 39005464 PMCID: PMC11244936 DOI: 10.1101/2024.06.29.601123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/16/2024]
Abstract
Infectious disease dynamics are driven by the complex interplay of epidemiological, ecological, and evolutionary processes. Accurately modeling these interactions is crucial for understanding pathogen spread and informing public health strategies. However, existing simulators often fail to capture the dynamic interplay between these processes, resulting in oversimplified models that do not fully reflect real-world complexities in which the pathogen's genetic evolution dynamically influences disease transmission. We introduce the epidemiological-ecological-evolutionary simulator (e3SIM), an open-source framework that concurrently models the transmission dynamics and molecular evolution of pathogens within a host population while integrating environmental factors. Using an agent-based, discrete-generation, forward-in-time approach, e3SIM incorporates compartmental models, host-population contact networks, and quantitative-trait models for pathogens. This integration allows for realistic simulations of disease spread and pathogen evolution. Key features include a modular and scalable design, flexibility in modeling various epidemiological and population-genetic complexities, incorporation of time-varying environmental factors, and a user-friendly graphical interface. We demonstrate e3SIM's capabilities through simulations of realistic outbreak scenarios with SARS-CoV-2 and Mycobacterium tuberculosis, illustrating its flexibility for studying the genomic epidemiology of diverse pathogen types.
Collapse
Affiliation(s)
- Peiyu Xu
- Department of Molecular Biology & Genetics, Cornell University, Ithaca, NY, USA
| | - Shenni Liang
- Department of Computational Science, Cornell University, Ithaca, NY, USA
| | - Andrew Hahn
- Department of Computational Science, Cornell University, Ithaca, NY, USA
| | - Vivian Zhao
- Department of Computational Science, Cornell University, Ithaca, NY, USA
| | - Wai Tung ‘Jack’ Lo
- Department of Computational Biology, Cornell University, Ithaca, NY, USA
| | - Benjamin C. Haller
- Department of Computational Biology, Cornell University, Ithaca, NY, USA
| | - Benjamin Sobkowiak
- Department of Epidemiology of Microbial Disease, Yale School of Public Health, New Haven, CT, USA
| | - Melanie H. Chitwood
- Department of Epidemiology of Microbial Disease, Yale School of Public Health, New Haven, CT, USA
| | - Caroline Colijn
- Department of Mathematics, Simon Fraser University, Burnaby, BC, Canada
| | - Ted Cohen
- Department of Epidemiology of Microbial Disease, Yale School of Public Health, New Haven, CT, USA
| | - Kyu Y. Rhee
- Department of Medicine, Weill Cornell Medicine, New York, NY, USA
| | - Philipp W. Messer
- Department of Computational Biology, Cornell University, Ithaca, NY, USA
| | - Martin T. Wells
- Department of Statistics and Data Science, Cornell University, Ithaca, NY, USA
| | - Andrew G. Clark
- Department of Molecular Biology & Genetics, Cornell University, Ithaca, NY, USA
- Department of Computational Biology, Cornell University, Ithaca, NY, USA
| | - Jaehee Kim
- Department of Computational Biology, Cornell University, Ithaca, NY, USA
| |
Collapse
|
8
|
Özkan M, Gürün K, Yüncü E, Vural KB, Atağ G, Akbaba A, Fidan FR, Sağlıcan E, Altınışık EN, Koptekin D, Pawłowska K, Hodder I, Adcock SE, Arbuckle BS, Steadman SR, McMahon G, Erdal YS, Bilgin CC, Togan İ, Geigl EM, Götherström A, Grange T, Özer F, Somel M. The first complete genome of the extinct European wild ass (Equus hemionus hydruntinus). Mol Ecol 2024; 33:e17440. [PMID: 38946459 DOI: 10.1111/mec.17440] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Revised: 05/17/2024] [Accepted: 06/14/2024] [Indexed: 07/02/2024]
Abstract
We present palaeogenomes of three morphologically unidentified Anatolian equids dating to the first millennium BCE, sequenced to a coverage of 0.6-6.4×. Mitochondrial DNA haplotypes of the Anatolian individuals clustered with those of Equus hydruntinus (or Equus hemionus hydruntinus), the extinct European wild ass, secular name 'hydruntine'. Further, the Anatolian wild ass whole genome profiles fell outside the genomic diversity of other extant and past Asiatic wild ass (E. hemionus) lineages. These observations suggest that the three Anatolian wild asses represent hydruntines, making them the latest recorded survivors of this lineage, about a millennium later than the latest observations in the zooarchaeological record. Our mitogenomic and genomic analyses indicate that E. h. hydruntinus was a clade belonging to ancient and present-day E. hemionus lineages that radiated possibly between 0.6 and 0.8 Mya. We also find evidence consistent with recent gene flow between hydruntines and Middle Eastern wild asses. Analyses of genome-wide heterozygosity and runs of homozygosity suggest that the Anatolian wild ass population may have lost genetic diversity by the mid-first millennium BCE, a possible sign of its eventual demise.
Collapse
Affiliation(s)
- Mustafa Özkan
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| | - Kanat Gürün
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| | - Eren Yüncü
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| | - Kıvılcım Başak Vural
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| | - Gözde Atağ
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| | - Ali Akbaba
- Department of Anthropology, Ankara University, Ankara, Turkey
- Alparslan University, Muş, Turkey
| | - Fatma Rabia Fidan
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
- Cancer Dynamics Laboratory, The Francis Crick Institute, London, UK
| | - Ekin Sağlıcan
- Department of Health Informatics, Middle East Technical University, Ankara, Turkey
| | - Ezgi N Altınışık
- Department of Anthropology, Human_G Laboratory, Hacettepe University, Ankara, Turkey
| | - Dilek Koptekin
- Department of Health Informatics, Middle East Technical University, Ankara, Turkey
| | - Kamilla Pawłowska
- Department of Palaeoenvironmental Research, Adam Mickiewicz University, Poznań, Poland
| | - Ian Hodder
- Department of Anthropology, Stanford University, Stanford, California, USA
| | - Sarah E Adcock
- Institute for the Study of the Ancient World, New York University, New York, New York, USA
| | - Benjamin S Arbuckle
- Department of Anthropology, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| | - Sharon R Steadman
- Department of Sociology/Anthropology, SUNY Cortland, Cortland, New York, USA
| | - Gregory McMahon
- Classics, Humanities and Italian Studies Department, University of New Hampshire, Durham, New Hampshire, USA
| | - Yılmaz Selim Erdal
- Department of Anthropology, Human_G Laboratory, Hacettepe University, Ankara, Turkey
| | - C Can Bilgin
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| | - İnci Togan
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| | - Eva-Maria Geigl
- Institut Jacques Monod, CNRS, Université de Paris, Paris, France
| | - Anders Götherström
- Department of Archaeology and Classical Studies, Stockholm University, Stockholm, Sweden
| | - Thierry Grange
- Institut Jacques Monod, CNRS, Université de Paris, Paris, France
| | - Füsun Özer
- Department of Health Informatics, Middle East Technical University, Ankara, Turkey
| | - Mehmet Somel
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| |
Collapse
|
9
|
Clark MI, Fitzpatrick SW, Bradburd GS. Pitfalls and windfalls of detecting demographic declines using population genetics in long-lived species. Evol Appl 2024; 17:e13754. [PMID: 39006005 PMCID: PMC11246600 DOI: 10.1111/eva.13754] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2024] [Revised: 06/13/2024] [Accepted: 06/27/2024] [Indexed: 07/16/2024] Open
Abstract
Detecting recent demographic changes is a crucial component of species conservation and management, as many natural populations face declines due to anthropogenic habitat alteration and climate change. Genetic methods allow researchers to detect changes in effective population size (Ne) from sampling at a single timepoint. However, in species with long lifespans, there is a lag between the start of a decline in a population and the resulting decrease in genetic diversity. This lag slows the rate at which diversity is lost, and therefore makes it difficult to detect recent declines using genetic data. However, the genomes of old individuals can provide a window into the past, and can be compared to those of younger individuals, a contrast that may help reveal recent demographic declines. To test whether comparing the genomes of young and old individuals can help infer recent demographic bottlenecks, we use forward-time, individual-based simulations with varying mean individual lifespans and extents of generational overlap. We find that age information can be used to aid in the detection of demographic declines when the decline has been severe. When average lifespan is long, comparing young and old individuals from a single timepoint has greater power to detect a recent (within the last 50 years) bottleneck event than comparing individuals sampled at different points in time. Our results demonstrate how longevity and generational overlap can be both a hindrance and a boon to detecting recent demographic declines from population genomic data.
Collapse
Affiliation(s)
- Meaghan I. Clark
- Department of Integrative BiologyMichigan State UniversityEast LansingMichiganUSA
- Ecology, Evolution, and Behavior ProgramMichigan State UniversityEast LansingMichiganUSA
- W.K. Kellogg Biological StationMichigan State UniversityHickory CornersMichiganUSA
| | - Sarah W. Fitzpatrick
- Department of Integrative BiologyMichigan State UniversityEast LansingMichiganUSA
- Ecology, Evolution, and Behavior ProgramMichigan State UniversityEast LansingMichiganUSA
- W.K. Kellogg Biological StationMichigan State UniversityHickory CornersMichiganUSA
| | - Gideon S. Bradburd
- W.K. Kellogg Biological StationMichigan State UniversityHickory CornersMichiganUSA
- Department of Ecology and Evolutionary BiologyUniversity of MichiganAnn ArborMichiganUSA
| |
Collapse
|
10
|
Aktürk Ş, Mapelli I, Güler MN, Gürün K, Katırcıoğlu B, Vural KB, Sağlıcan E, Çetin M, Yaka R, Sürer E, Atağ G, Çokoğlu SS, Sevkar A, Altınışık NE, Koptekin D, Somel M. Benchmarking kinship estimation tools for ancient genomes using pedigree simulations. Mol Ecol Resour 2024; 24:e13960. [PMID: 38676702 DOI: 10.1111/1755-0998.13960] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2023] [Revised: 03/19/2024] [Accepted: 03/28/2024] [Indexed: 04/29/2024]
Abstract
There is growing interest in uncovering genetic kinship patterns in past societies using low-coverage palaeogenomes. Here, we benchmark four tools for kinship estimation with such data: lcMLkin, NgsRelate, KIN, and READ, which differ in their input, IBD estimation methods, and statistical approaches. We used pedigree and ancient genome sequence simulations to evaluate these tools when only a limited number (1 to 50 K, with minor allele frequency ≥0.01) of shared SNPs are available. The performance of all four tools was comparable using ≥20 K SNPs. We found that first-degree related pairs can be accurately classified even with 1 K SNPs, with 85% F1 scores using READ and 96% using NgsRelate or lcMLkin. Distinguishing third-degree relatives from unrelated pairs or second-degree relatives was also possible with high accuracy (F1 > 90%) with 5 K SNPs using NgsRelate and lcMLkin, while READ and KIN showed lower success (69 and 79% respectively). Meanwhile, noise in population allele frequencies and inbreeding (first-cousin mating) led to deviations in kinship coefficients, with different sensitivities across tools. We conclude that using multiple tools in parallel might be an effective approach to achieve robust estimates on ultra-low-coverage genomes.
Collapse
Affiliation(s)
- Şevval Aktürk
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| | - Igor Mapelli
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| | - Merve N Güler
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| | - Kanat Gürün
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| | - Büşra Katırcıoğlu
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| | - Kıvılcım Başak Vural
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| | - Ekin Sağlıcan
- Department of Health Informatics, Graduate School of Informatics, Middle East Technical University, Ankara, Turkey
| | - Mehmet Çetin
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| | - Reyhan Yaka
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
- Centre for Palaeogenetics, Stockholm, Sweden
- Department of Archaeology and Classical Studies, Stockholm University, Stockholm, Sweden
| | - Elif Sürer
- Department of Modeling and Simulation, Graduate School of Informatics, Middle East Technical University, Ankara, Turkey
| | - Gözde Atağ
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| | - Sevim Seda Çokoğlu
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| | - Arda Sevkar
- Department of Anthropology, Hacettepe University, Ankara, Turkey
| | - N Ezgi Altınışık
- Department of Anthropology, Hacettepe University, Ankara, Turkey
| | - Dilek Koptekin
- Department of Health Informatics, Graduate School of Informatics, Middle East Technical University, Ankara, Turkey
| | - Mehmet Somel
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| |
Collapse
|
11
|
Naseri A, Zhi D, Zhang S. Discovery of runs-of-homozygosity diplotype clusters and their associations with diseases in UK Biobank. eLife 2024; 13:e81698. [PMID: 38905121 PMCID: PMC11249732 DOI: 10.7554/elife.81698] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2022] [Accepted: 06/20/2024] [Indexed: 06/23/2024] Open
Abstract
Runs-of-homozygosity (ROH) segments, contiguous homozygous regions in a genome were traditionally linked to families and inbred populations. However, a growing literature suggests that ROHs are ubiquitous in outbred populations. Still, most existing genetic studies of ROH in populations are limited to aggregated ROH content across the genome, which does not offer the resolution for mapping causal loci. This limitation is mainly due to a lack of methods for the efficient identification of shared ROH diplotypes. Here, we present a new method, ROH-DICE (runs-of-homozygous diplotype cluster enumerator), to find large ROH diplotype clusters, sufficiently long ROHs shared by a sufficient number of individuals, in large cohorts. ROH-DICE identified over 1 million ROH diplotypes that span over 100 single nucleotide polymorphisms (SNPs) and are shared by more than 100 UK Biobank participants. Moreover, we found significant associations of clustered ROH diplotypes across the genome with various self-reported diseases, with the strongest associations found between the extended human leukocyte antigen (HLA) region and autoimmune disorders. We found an association between a diplotype covering the homeostatic iron regulator (HFE) gene and hemochromatosis, even though the well-known causal SNP was not directly genotyped or imputed. Using a genome-wide scan, we identified a putative association between carriers of an ROH diplotype in chromosome 4 and an increase in mortality among COVID-19 patients (p-value = 1.82 × 10-11). In summary, our ROH-DICE method, by calling out large ROH diplotypes in a large outbred population, enables further population genetics into the demographic history of large populations. More importantly, our method enables a new genome-wide mapping approach for finding disease-causing loci with multi-marker recessive effects at a population scale.
Collapse
Affiliation(s)
- Ardalan Naseri
- Department of Computer Science, University of Central FloridaOrlandoUnited States
| | - Degui Zhi
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at HoustonHoustonUnited States
| | - Shaojie Zhang
- Department of Computer Science, University of Central FloridaOrlandoUnited States
| |
Collapse
|
12
|
Patel RA, Weiß CL, Zhu H, Mostafavi H, Simons YB, Spence JP, Pritchard JK. Conditional frequency spectra as a tool for studying selection on complex traits in biobanks. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.15.599126. [PMID: 38948697 PMCID: PMC11212903 DOI: 10.1101/2024.06.15.599126] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/02/2024]
Abstract
Natural selection on complex traits is difficult to study in part due to the ascertainment inherent to genome-wide association studies (GWAS). The power to detect a trait-associated variant in GWAS is a function of frequency and effect size - but for traits under selection, the effect size of a variant determines the strength of selection against it, constraining its frequency. To account for GWAS ascertainment, we propose studying the joint distribution of allele frequencies across populations, conditional on the frequencies in the GWAS cohort. Before considering these conditional frequency spectra, we first characterized the impact of selection and non-equilibrium demography on allele frequency dynamics forwards and backwards in time. We then used these results to understand conditional frequency spectra under realistic human demography. Finally, we investigated empirical conditional frequency spectra for GWAS variants associated with 106 complex traits, finding compelling evidence for either stabilizing or purifying selection. Our results provide insight into polygenic score portability and other properties of variants ascertained with GWAS, highlighting the utility of conditional frequency spectra.
Collapse
Affiliation(s)
- Roshni A. Patel
- Department of Genetics, Stanford University School of Medicine, Stanford, CA
| | - Clemens L. Weiß
- Stanford Cancer Institute Core, Stanford University School of Medicine, Stanford, CA
| | - Huisheng Zhu
- Department of Biology, Stanford University, Stanford, CA
| | - Hakhamanesh Mostafavi
- Center for Human Genetics and Genomics, New York University School of Medicine, New York, NY
- Division of Biostatistics, Department of Population Health, New York University School of Medicine, New York, NY
| | | | - Jeffrey P. Spence
- Department of Genetics, Stanford University School of Medicine, Stanford, CA
| | - Jonathan K. Pritchard
- Department of Genetics, Stanford University School of Medicine, Stanford, CA
- Department of Biology, Stanford University, Stanford, CA
| |
Collapse
|
13
|
Anderson NW, Kirk L, Schraiber JG, Ragsdale AP. A Path Integral Approach for Allele Frequency Dynamics Under Polygenic Selection. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.14.599114. [PMID: 38915613 PMCID: PMC11195211 DOI: 10.1101/2024.06.14.599114] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/26/2024]
Abstract
Many phenotypic traits have a polygenic genetic basis, making it challenging to learn their genetic architectures and predict individual phenotypes. One promising avenue to resolve the genetic basis of complex traits is through evolve-and-resequence experiments, in which laboratory populations are exposed to some selective pressure and trait-contributing loci are identified by extreme frequency changes over the course of the experiment. However, small laboratory populations will experience substantial random genetic drift, and it is difficult to determine whether selection played a roll in a given allele frequency change. Predicting how much allele frequencies change under drift and selection had remained an open problem well into the 21st century, even those contributing to simple, monogenic traits. Recently, there have been efforts to apply the path integral, a method borrowed from physics, to solve this problem. So far, this approach has been limited to genic selection, and is therefore inadequate to capture the complexity of quantitative, highly polygenic traits that are commonly studied. Here we extend one of these path integral methods, the perturbation approximation, to selection scenarios that are of interest to quantitative genetics. In particular, we derive analytic expressions for the transition probability (i.e., the probability that an allele will change in frequency from x , to y in time t ) of an allele contributing to a trait subject to stabilizing selection, as well as that of an allele contributing to a trait rapidly adapting to a new phenotypic optimum. We use these expressions to characterize the use of allele frequency change to test for selection, as well as explore optimal design choices for evolve-and-resequence experiments to uncover the genetic architecture of polygenic traits under selection.
Collapse
Affiliation(s)
- Nathan W. Anderson
- Department of Integrative Biology, University of Wisconsin-Madison, Madison, WI, 53706, USA
| | - Lloyd Kirk
- Department of Integrative Biology, University of Wisconsin-Madison, Madison, WI, 53706, USA
| | - Joshua G. Schraiber
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, 90089, USA
| | - Aaron P. Ragsdale
- Department of Integrative Biology, University of Wisconsin-Madison, Madison, WI, 53706, USA
| |
Collapse
|
14
|
Czech E, Millar TR, White T, Jeffery B, Miles A, Tallman S, Wojdyla R, Zabad S, Hammerbacher J, Kelleher J. Analysis-ready VCF at Biobank scale using Zarr. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.11.598241. [PMID: 38915693 PMCID: PMC11195102 DOI: 10.1101/2024.06.11.598241] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/26/2024]
Abstract
Background Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of the VCF data model (either as text or packed binary) emphasises efficient retrieval of all data for a given variant, but accessing data on a field or sample basis is inefficient. Biobank scale datasets currently available consist of hundreds of thousands of whole genomes and hundreds of terabytes of compressed VCF. Row-wise data storage is fundamentally unsuitable and a more scalable approach is needed. Results We present the VCF Zarr specification, an encoding of the VCF data model using Zarr which makes retrieving subsets of the data much more efficient. Zarr is a cloud-native format for storing multi-dimensional data, widely used in scientific computing. We show how this format is far more efficient than standard VCF based approaches, and competitive with specialised methods for storing genotype data in terms of compression ratios and calculation performance. We demonstrate the VCF Zarr format (and the vcf2zarr conversion utility) on a subset of the Genomics England aggV2 dataset comprising 78,195 samples and 59,880,903 variants, with a 5X reduction in storage and greater than 300X reduction in CPU usage in some representative benchmarks. Conclusions Large row-encoded VCF files are a major bottleneck for current research, and storing and processing these files incurs a substantial cost. The VCF Zarr specification, building on widely-used, open-source technologies has the potential to greatly reduce these costs, and may enable a diverse ecosystem of next-generation tools for analysing genetic variation data directly from cloud-based object stores.
Collapse
Affiliation(s)
- Eric Czech
- Related Sciences and Lincoln, Lincoln, New Zealand
| | - Timothy R. Millar
- The New Zealand Institute for Plant & Food Research Ltd, Lincoln, New Zealand
- Department of Biochemistry, School of Biomedical Sciences, University of Otago, Dunedin, New Zealand
| | - Tom White
- Tom White Consulting Ltd., Li Ka Shing Centre for Health Information and Discovery, University of Oxford, UK
| | - Ben Jeffery
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, UK
| | - Alistair Miles
- Wellcome Sanger Institute, McGill University, Montreal, QC, Canada
| | - Sam Tallman
- Genomics England, McGill University, Montreal, QC, Canada
| | | | - Shadi Zabad
- School of Computer Science, McGill University, Montreal, QC, Canada
| | | | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, UK
| |
Collapse
|
15
|
Temple SD, Thompson EA. Identity-by-descent segments in large samples. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.05.597656. [PMID: 38895476 PMCID: PMC11185678 DOI: 10.1101/2024.06.05.597656] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/21/2024]
Abstract
If two haplotypes share the same alleles for an extended gene tract, these haplotypes are likely to derive identical-by-descent from a recent common ancestor. Identity-by-descent segment lengths are correlated via unobserved tree and recombination processes, which commonly presents challenges to the derivation of theoretical results in population genetics. Under interpretable regularity conditions, we show that the proportion of detectable identity-by-descent segments at a locus is normally distributed for large sample size and large scaled population size. We use efficient and exact simulations to study the distributional behavior of the detectable identity-by-descent rate in finite samples. One consequence of non-normality in finite samples is that genome-wide scans based on identity-by-descent rates may be subject to anti-conservative Type 1 error control. Highlights We show the asymptotic normality of the identity-by-descent rate, a mean of correlated binary random variables that arises in population genetics studies.We describe an efficient algorithm capable of simulating long identity-by-descent segments around a locus in large sample sizes.In enormous simulation studies, we use this algorithm to characterize the distributional properties of the identity-by-descent rate.In finite samples, we reject the null hypothesis of normality more often than the nominal significance level, indicating that genome-wide scans based on identity-by-descent rates may be anti-conservative.
Collapse
|
16
|
Belman S, Pesonen H, Croucher NJ, Bentley SD, Corander J. Estimating between-country migration in pneumococcal populations. G3 (BETHESDA, MD.) 2024; 14:jkae058. [PMID: 38507601 PMCID: PMC11152062 DOI: 10.1093/g3journal/jkae058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/22/2023] [Revised: 02/29/2024] [Accepted: 03/11/2024] [Indexed: 03/22/2024]
Abstract
Streptococcus pneumoniae (the pneumococcus) is a globally distributed, human obligate opportunistic bacterial pathogen which, although often carried commensally, is also a significant cause of invasive disease. Apart from multi-drug resistant and virulent clones, the rate and direction of pneumococcal dissemination between different countries remains largely unknown. The ability for the pneumococcus to take a foothold in a country depends on existing population configuration, the extent of vaccine implementation, as well as human mobility since it is a human obligate bacterium. To shed light on its international movement, we used extensive genome data from the Global Pneumococcal Sequencing project and estimated migration parameters between multiple countries in Africa. Data on allele frequencies of polymorphisms at housekeeping-like loci for multiple different lineages circulating in the populations of South Africa, Malawi, Kenya, and The Gambia were used to calculate the fixation index (Fst) between countries. We then further used these summaries to fit migration coalescent models with the likelihood-free inference algorithms available in the ELFI software package. Synthetic datawere additionally used to validate the inference approach. Our results demonstrate country-pair specific migration patterns and heterogeneity in the extent of migration between different lineages. Our approach demonstrates that coalescent models can be effectively used for inferring migration rates for bacterial species and lineages provided sufficiently granular population genomics surveillance data. Further, it can demonstrate the connectivity of respiratory disease agents between countries to inform intervention policy in the longer term.
Collapse
Affiliation(s)
- Sophie Belman
- Parasites and Microbes, Wellcome Sanger Institute, Hinxton, Cambridgeshire, CB10 1SA, UK
| | - Henri Pesonen
- Oslo Centre for Biostatistics and Epidemiology, Oslo University Hospital, Oslo, 0372, Norway
| | - Nicholas J Croucher
- MRC Centre for Global Infectious Disease Analysis, Department of Infectious Disease Epidemiology, School of Public Health, White City Campus, Imperial College London, London W12 0BZ, UK
| | - Stephen D Bentley
- Parasites and Microbes, Wellcome Sanger Institute, Hinxton, Cambridgeshire, CB10 1SA, UK
| | - Jukka Corander
- Department of Biostatistics, University of Oslo, Oslo, 0372, Norway
- Helsinki Institute for Information Technology HIIT, Department of Mathematics and Statistics, University of Helsinki, Espoo, Helsinki, 02150, Finland
| |
Collapse
|
17
|
Dutheil JY. On the estimation of genome-average recombination rates. Genetics 2024; 227:iyae051. [PMID: 38565705 PMCID: PMC11232287 DOI: 10.1093/genetics/iyae051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2024] [Revised: 03/13/2024] [Accepted: 03/20/2024] [Indexed: 04/04/2024] Open
Abstract
The rate at which recombination events occur in a population is an indicator of its effective population size and the organism's reproduction mode. It determines the extent of linkage disequilibrium along the genome and, thereby, the efficacy of both purifying and positive selection. The population recombination rate can be inferred using models of genome evolution in populations. Classic methods based on the patterns of linkage disequilibrium provide the most accurate estimates, providing large sample sizes are used and the demography of the population is properly accounted for. Here, the capacity of approaches based on the sequentially Markov coalescent (SMC) to infer the genome-average recombination rate from as little as a single diploid genome is examined. SMC approaches provide highly accurate estimates even in the presence of changing population sizes, providing that (1) within genome heterogeneity is accounted for and (2) classic maximum-likelihood optimization algorithms are employed to fit the model. SMC-based estimates proved sensitive to gene conversion, leading to an overestimation of the recombination rate if conversion events are frequent. Conversely, methods based on the correlation of heterozygosity succeed in disentangling the rate of crossing over from that of gene conversion events, but only when the population size is constant and the recombination landscape homogeneous. These results call for a convergence of these two methods to obtain accurate and comparable estimates of recombination rates between populations.
Collapse
Affiliation(s)
- Julien Y Dutheil
- Max Planck Institute for Evolutionary Biology, August-Thienemann-Str. 2, Plön 24306, Germany
| |
Collapse
|
18
|
Tagami D, Bisschop G, Kelleher J. tstrait: a quantitative trait simulator for ancestral recombination graphs. BIOINFORMATICS (OXFORD, ENGLAND) 2024; 40:btae334. [PMID: 38796683 DOI: 10.1093/bioinformatics/btae334] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 05/14/2024] [Accepted: 05/24/2024] [Indexed: 05/28/2024]
Abstract
SUMMARY Ancestral recombination graphs (ARGs) encode the ensemble of correlated genealogical trees arising from recombination in a compact and efficient structure and are of fundamental importance in population and statistical genetics. Recent breakthroughs have made it possible to simulate and infer ARGs at biobank scale, and there is now intense interest in using ARG-based methods across a broad range of applications, particularly in genome-wide association studies (GWAS). Sophisticated methods exist to simulate ARGs using population genetics models, but there is currently no software to simulate quantitative traits directly from these ARGs. To apply existing quantitative trait simulators users must export genotype data, losing important information about ancestral processes and producing prohibitively large files when applied to the biobank-scale datasets currently of interest in GWAS. We present tstrait, an open-source Python library to simulate quantitative traits on ARGs, and show how this user-friendly software can quickly simulate phenotypes for biobank-scale datasets on a laptop computer. AVAILABILITY AND IMPLEMENTATION tstrait is available for download on the Python Package Index. Full documentation with examples and workflow templates is available on https://tskit.dev/tstrait/docs/, and the development version is maintained on GitHub (https://github.com/tskit-dev/tstrait).
Collapse
Affiliation(s)
- Daiki Tagami
- Department of Statistics, University of Oxford, Oxford OX1 3LB, United Kingdom
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, United Kingdom
| | - Gertjan Bisschop
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, United Kingdom
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, United Kingdom
| |
Collapse
|
19
|
Hobolth A, Rivas-González I, Bladt M, Futschik A. Phase-type distributions in mathematical population genetics: An emerging framework. Theor Popul Biol 2024; 157:14-32. [PMID: 38460602 DOI: 10.1016/j.tpb.2024.03.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2023] [Revised: 02/29/2024] [Accepted: 03/04/2024] [Indexed: 03/11/2024]
Abstract
A phase-type distribution is the time to absorption in a continuous- or discrete-time Markov chain. Phase-type distributions can be used as a general framework to calculate key properties of the standard coalescent model and many of its extensions. Here, the 'phases' in the phase-type distribution correspond to states in the ancestral process. For example, the time to the most recent common ancestor and the total branch length are phase-type distributed. Furthermore, the site frequency spectrum follows a multivariate discrete phase-type distribution and the joint distribution of total branch lengths in the two-locus coalescent-with-recombination model is multivariate phase-type distributed. In general, phase-type distributions provide a powerful mathematical framework for coalescent theory because they are analytically tractable using matrix manipulations. The purpose of this review is to explain the phase-type theory and demonstrate how the theory can be applied to derive basic properties of coalescent models. These properties can then be used to obtain insight into the ancestral process, or they can be applied for statistical inference. In particular, we show the relation between classical first-step analysis of coalescent models and phase-type calculations. We also show how reward transformations in phase-type theory lead to easy calculation of covariances and correlation coefficients between e.g. tree height, tree length, external branch length, and internal branch length. Furthermore, we discuss how these quantities can be used for statistical inference based on estimating equations. Providing an alternative to previous work based on the Laplace transform, we derive likelihoods for small-size coalescent trees based on phase-type theory. Overall, our main aim is to demonstrate that phase-type distributions provide a convenient general set of tools to understand aspects of coalescent models that are otherwise difficult to derive. Throughout the review, we emphasize the versatility of the phase-type framework, which is also illustrated by our accompanying R-code. All our analyses and figures can be reproduced from code available on GitHub.
Collapse
Affiliation(s)
- Asger Hobolth
- Department of Mathematics, Aarhus University, Denmark.
| | | | - Mogens Bladt
- Department of Mathematical Sciences, University of Copenhagen, Denmark.
| | - Andreas Futschik
- Institute of Applied Statistics, Johannes Kepler University, Austria.
| |
Collapse
|
20
|
Ouerghi F, Krane DE, Edge MD. On forensic likelihood ratios from low-coverage sequencing. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.24.595821. [PMID: 38854110 PMCID: PMC11160658 DOI: 10.1101/2024.05.24.595821] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2024]
Abstract
With advances in sequencing technology, forensic workers can access genetic information from increasingly challenging samples. A recently published computational approach, IBDGem , analyzes sequencing reads, including from low-coverage samples, in order to arrive at likelihood ratios for tests of identity. Here, we show that likelihood ratios produced by IBDGem test a null hypothesis different from the traditional one used in a forensic genetics context. In particular, rather than testing the hypothesis that the sample comes from a person unrelated to the person of interest, IBDGem tests the hypothesis that the sample comes from an individual who is included in the reference sample used to run the method. This null hypothesis is not generally of forensic interest, because the defense hypothesis is not that the evidence comes from an individual included in a reference panel. Further, it does not take into account genetic variation outside the reference panel, and as a result, the computed likelihood ratios can be much larger than likelihood ratios computed for the standard forensic null hypothesis, often by many orders of magnitude, thus potentially creating an impression of stronger evidence for identity than is warranted. We lay out this result and illustrate it with examples, giving suggestions for directions that might lead to likelihood ratios that have the traditional interpretation.
Collapse
|
21
|
Shpak M, Lawrence KN, Pool JE. The Precision and Power of Population Branch Statistics in Identifying the Genomic Signatures of Local Adaptation. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.14.594139. [PMID: 38798330 PMCID: PMC11118325 DOI: 10.1101/2024.05.14.594139] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2024]
Abstract
Population branch statistics, which estimate the branch lengths of focal populations with respect to two outgroups, have been used as an alternative to FST-based genome-wide scans for identifying loci associated with local selective sweeps. In addition to the original population branch statistic (PBS), there are subsequently proposed branch rescalings: normalized population branch statistic (PBSn1), which adjusts focal branch length with respect to outgroup branch lengths at the same locus, and population branch excess (PBE), which also incorporates median branch lengths at other loci. PBSn1 and PBE have been proposed to be less sensitive to allele frequency divergence generated by background selection or geographically ubiquitous positive selection rather than local selective sweeps. However, the accuracy and statistical power of branch statistics have not been systematically assessed. To do so, we simulate genomes in representative large and small populations with varying proportions of sites evolving under genetic drift or background selection (approximated using variable Ne), local selective sweeps, and geographically parallel selective sweeps. We then assess the probability that local selective sweep loci are correctly identified as outliers by FST and by each of the branch statistics. We find that branch statistics consistently outperform FST at identifying local sweeps. When background selection and/or parallel sweeps are introduced, PBSn1 and especially PBE correctly identify local sweeps among their top outliers at a higher frequency than PBS. These results validate the greater specificity of rescaled branch statistics such as PBE to detect population-specific positive selection, supporting their use in genomic studies focused on local adaptation.
Collapse
Affiliation(s)
- Max Shpak
- Laboratory of Genetics, University of Wisconsin–Madison, Madison, WI, USA
| | - Kadee N. Lawrence
- Laboratory of Genetics, University of Wisconsin–Madison, Madison, WI, USA
| | - John E. Pool
- Laboratory of Genetics, University of Wisconsin–Madison, Madison, WI, USA
| |
Collapse
|
22
|
Tran LN, Sun CK, Struck TJ, Sajan M, Gutenkunst RN. Computationally Efficient Demographic History Inference from Allele Frequencies with Supervised Machine Learning. Mol Biol Evol 2024; 41:msae077. [PMID: 38636507 PMCID: PMC11082913 DOI: 10.1093/molbev/msae077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2023] [Revised: 04/08/2024] [Accepted: 04/12/2024] [Indexed: 04/20/2024] Open
Abstract
Inferring past demographic history of natural populations from genomic data is of central concern in many studies across research fields. Previously, our group had developed dadi, a widely used demographic history inference method based on the allele frequency spectrum (AFS) and maximum composite-likelihood optimization. However, dadi's optimization procedure can be computationally expensive. Here, we present donni (demography optimization via neural network inference), a new inference method based on dadi that is more efficient while maintaining comparable inference accuracy. For each dadi-supported demographic model, donni simulates the expected AFS for a range of model parameters then trains a set of Mean Variance Estimation neural networks using the simulated AFS. Trained networks can then be used to instantaneously infer the model parameters from future genomic data summarized by an AFS. We demonstrate that for many demographic models, donni can infer some parameters, such as population size changes, very well and other parameters, such as migration rates and times of demographic events, fairly well. Importantly, donni provides both parameter and confidence interval estimates from input AFS with accuracy comparable to parameters inferred by dadi's likelihood optimization while bypassing its long and computationally intensive evaluation process. donni's performance demonstrates that supervised machine learning algorithms may be a promising avenue for developing more sustainable and computationally efficient demographic history inference methods.
Collapse
Affiliation(s)
- Linh N Tran
- Genetics Graduate Interdisciplinary Program, University of Arizona, Tucson, AZ 85721, USA
- Department of Molecular & Cellular Biology, University of Arizona, Tucson, AZ 85721, USA
| | - Connie K Sun
- Department of Molecular & Cellular Biology, University of Arizona, Tucson, AZ 85721, USA
| | - Travis J Struck
- Department of Molecular & Cellular Biology, University of Arizona, Tucson, AZ 85721, USA
| | - Mathews Sajan
- Department of Molecular & Cellular Biology, University of Arizona, Tucson, AZ 85721, USA
| | - Ryan N Gutenkunst
- Department of Molecular & Cellular Biology, University of Arizona, Tucson, AZ 85721, USA
| |
Collapse
|
23
|
Eldon B, Stephan W. Sweepstakes reproduction facilitates rapid adaptation in highly fecund populations. Mol Ecol 2024; 33:e16903. [PMID: 36896794 DOI: 10.1111/mec.16903] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2022] [Revised: 02/21/2023] [Accepted: 02/23/2023] [Indexed: 03/11/2023]
Abstract
Adaptation enables natural populations to survive in a changing environment. Understanding the mechanics of adaptation is therefore crucial for learning about the evolution and ecology of natural populations. We focus on the impact of random sweepstakes on selection in highly fecund haploid and diploid populations partitioned into two genetic types, with one type conferring selective advantage. For the diploid populations, we incorporate various dominance mechanisms. We assume that the populations may experience recurrent bottlenecks. In random sweepstakes, the distribution of individual recruitment success is highly skewed, resulting in a huge variance in the number of offspring contributed by the individuals present in any given generation. Using computer simulations, we investigate the joint effects of random sweepstakes, recurrent bottlenecks and dominance mechanisms on selection. In our framework, bottlenecks allow random sweepstakes to have an effect on the time to fixation, and in diploid populations, the effect of random sweepstakes depends on the dominance mechanism. We describe selective sweepstakes that are approximated by recurrent sweeps of strongly beneficial allelic types arising by mutation. We demonstrate that both types of sweepstakes reproduction may facilitate rapid adaptation (as defined based on the average time to fixation of a type conferring selective advantage conditioned on fixation of the type). However, whether random sweepstakes cause rapid adaptation depends also on their interactions with bottlenecks and dominance mechanisms. Finally, we review a case study in which a model of recurrent sweeps is shown to essentially explain population genomic data from Atlantic cod.
Collapse
Affiliation(s)
- Bjarki Eldon
- Institute of Evolution and Biodiversity Science, Natural History Museum Berlin, Berlin, Germany
| | | |
Collapse
|
24
|
DeHaas D, Pan Z, Wei X. Genotype Representation Graphs: Enabling Efficient Analysis of Biobank-Scale Data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.23.590800. [PMID: 38712040 PMCID: PMC11071416 DOI: 10.1101/2024.04.23.590800] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/08/2024]
Abstract
Computational analysis of a large number of genomes requires a data structure that can represent the dataset compactly while also enabling efficient operations on variants and samples. Current practice is to store large-scale genetic polymorphism data using tabular data structures and file formats, where rows and columns represent samples and genetic variants. However, encoding genetic data in such formats has become unsustainable. For example, the UK Biobank polymorphism data of 200,000 phased whole genomes has exceeded 350 terabytes (TB) in Variant Call Format (VCF), too large to fit into hard drives in uncompressed form. To mitigate the computational burden, we introduce the Genotype Representation Graph (GRG), an extremely compact data structure to losslessly present phased whole-genome polymorphisms. A GRG is a fully connected hierarchical graph that exploits variant-sharing across samples, leveraging on ideas inspired by Ancestral Recombination Graphs. Capturing variant-sharing in a graph format compresses biobank-scale data to the point where it can fit in a typical server's RAM (5-26GB per chromosome), and enables graph-traversal algorithms to trivially reuse computed values, both of which can significantly reduce computation time. We have developed a command-line tool and a library usable via both C++ and Python for constructing and processing GRG files which scales to a million whole genomes. It takes 160GB disk space to encode the information in 200,000 UK Biobank phased whole genomes as a GRG, more than 2000 times smaller than the size of VCF. Moreover, the size of GRG increases sublinearly with the number of samples stored, making it a sustainable solution to the increasing number of samples in large datasets. We show that summaries of genetic variants can be computed on GRG via graph traversal that runs 230 times faster than on VCF. We anticipate that GRG-based algorithms will improve the scalability of various types of computation and generally lower the cost of analyzing large genomic datasets.
Collapse
Affiliation(s)
- Drew DeHaas
- Department of Computational Biology, Cornell University, Ithaca, NY
| | - Ziqing Pan
- Department of Computational Biology, Cornell University, Ithaca, NY
| | - Xinzhu Wei
- Department of Computational Biology, Cornell University, Ithaca, NY
| |
Collapse
|
25
|
Sommer-Trembo C, Santos ME, Clark B, Werner M, Fages A, Matschiner M, Hornung S, Ronco F, Oliver C, Garcia C, Tschopp P, Malinsky M, Salzburger W. The genetics of niche-specific behavioral tendencies in an adaptive radiation of cichlid fishes. Science 2024; 384:470-475. [PMID: 38662824 DOI: 10.1126/science.adj9228] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Accepted: 03/12/2024] [Indexed: 05/03/2024]
Abstract
Behavior is critical for animal survival and reproduction, and possibly for diversification and evolutionary radiation. However, the genetics behind adaptive variation in behavior are poorly understood. In this work, we examined a fundamental and widespread behavioral trait, exploratory behavior, in one of the largest adaptive radiations on Earth, the cichlid fishes of Lake Tanganyika. By integrating quantitative behavioral data from 57 cichlid species (702 wild-caught individuals) with high-resolution ecomorphological and genomic information, we show that exploratory behavior is linked to macrohabitat niche adaptations in Tanganyikan cichlids. Furthermore, we uncovered a correlation between the genotypes at a single-nucleotide polymorphism upstream of the AMPA glutamate-receptor regulatory gene cacng5b and variation in exploratory tendency. We validated this association using behavioral predictions with a neural network approach and CRISPR-Cas9 genome editing.
Collapse
Affiliation(s)
- Carolin Sommer-Trembo
- Zoological Institute, Department of Environmental Sciences, University of Basel, Basel, Switzerland
| | - M Emília Santos
- Department of Zoology, University of Cambridge, Cambridge, UK
| | - Bethan Clark
- Department of Zoology, University of Cambridge, Cambridge, UK
| | - Marco Werner
- Leibniz-Institute for Polymer Research Dresden, Dresden, Germany
| | - Antoine Fages
- Zoological Institute, Department of Environmental Sciences, University of Basel, Basel, Switzerland
| | | | - Simon Hornung
- Zoological Institute, Department of Environmental Sciences, University of Basel, Basel, Switzerland
| | - Fabrizia Ronco
- Zoological Institute, Department of Environmental Sciences, University of Basel, Basel, Switzerland
- Natural History Museum, University of Oslo, Oslo, Norway
| | - Chantal Oliver
- Zoological Institute, Department of Environmental Sciences, University of Basel, Basel, Switzerland
| | - Cody Garcia
- Zoological Institute, Department of Environmental Sciences, University of Basel, Basel, Switzerland
| | - Patrick Tschopp
- Zoological Institute, Department of Environmental Sciences, University of Basel, Basel, Switzerland
| | - Milan Malinsky
- Department of Biology, Institute of Ecology and Evolution, University of Bern, Bern, Switzerland
| | - Walter Salzburger
- Zoological Institute, Department of Environmental Sciences, University of Basel, Basel, Switzerland
| |
Collapse
|
26
|
Guyon L, Guez J, Toupance B, Heyer E, Chaix R. Patrilineal segmentary systems provide a peaceful explanation for the post-Neolithic Y-chromosome bottleneck. Nat Commun 2024; 15:3243. [PMID: 38658560 PMCID: PMC11043392 DOI: 10.1038/s41467-024-47618-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2023] [Accepted: 04/08/2024] [Indexed: 04/26/2024] Open
Abstract
Studies have found a pronounced decline in male effective population sizes worldwide around 3000-5000 years ago. This bottleneck was not observed for female effective population sizes, which continued to increase over time. Until now, this remarkable genetic pattern was interpreted as the result of an ancient structuring of human populations into patrilineal groups (gathering closely related males) violently competing with each other. In this scenario, violence is responsible for the repeated extinctions of patrilineal groups, leading to a significant reduction in male effective population size. Here, we propose an alternative hypothesis by modelling a segmentary patrilineal system based on anthropological literature. We show that variance in reproductive success between patrilineal groups, combined with lineal fission (i.e., the splitting of a group into two new groups of patrilineally related individuals), can lead to a substantial reduction in the male effective population size without resorting to the violence hypothesis. Thus, a peaceful explanation involving ancient changes in social structures, linked to global changes in subsistence systems, may be sufficient to explain the reported decline in Y-chromosome diversity.
Collapse
Affiliation(s)
- Léa Guyon
- Eco-Anthropologie (UMR 7206), Muséum National d'Histoire Naturelle, CNRS, Université Paris Cité, Paris, 75116, France.
| | - Jérémy Guez
- Eco-Anthropologie (UMR 7206), Muséum National d'Histoire Naturelle, CNRS, Université Paris Cité, Paris, 75116, France
- Université Paris-Saclay, CNRS, INRIA, Laboratoire Interdisciplinaire des Sciences du Numérique, Orsay, 91400, France
| | - Bruno Toupance
- Eco-Anthropologie (UMR 7206), Muséum National d'Histoire Naturelle, CNRS, Université Paris Cité, Paris, 75116, France
- Université Paris Cité, Eco-anthropologie, Paris, F-75006, France
| | - Evelyne Heyer
- Eco-Anthropologie (UMR 7206), Muséum National d'Histoire Naturelle, CNRS, Université Paris Cité, Paris, 75116, France
| | - Raphaëlle Chaix
- Eco-Anthropologie (UMR 7206), Muséum National d'Histoire Naturelle, CNRS, Université Paris Cité, Paris, 75116, France.
| |
Collapse
|
27
|
Wong Y, Ignatieva A, Koskela J, Gorjanc G, Wohns AW, Kelleher J. A general and efficient representation of ancestral recombination graphs. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.11.03.565466. [PMID: 37961279 PMCID: PMC10635123 DOI: 10.1101/2023.11.03.565466] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
As a result of recombination, adjacent nucleotides can have different paths of genetic inheritance and therefore the genealogical trees for a sample of DNA sequences vary along the genome. The structure capturing the details of these intricately interwoven paths of inheritance is referred to as an ancestral recombination graph (ARG). Classical formalisms have focused on mapping coalescence and recombination events to the nodes in an ARG. This approach is out of step with modern developments, which do not represent genetic inheritance in terms of these events or explicitly infer them. We present a simple formalism that defines an ARG in terms of specific genomes and their intervals of genetic inheritance, and show how it generalises these classical treatments and encompasses the outputs of recent methods. We discuss nuances arising from this more general structure, and argue that it forms an appropriate basis for a software standard in this rapidly growing field.
Collapse
Affiliation(s)
- Yan Wong
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, UK
| | - Anastasia Ignatieva
- School of Mathematics and Statistics, University of Glasgow, UK
- Department of Statistics, University of Oxford, UK
| | - Jere Koskela
- School of Mathematics, Statistics and Physics, Newcastle University, UK
- Department of Statistics, University of Warwick, UK
| | - Gregor Gorjanc
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, UK
| | - Anthony W. Wohns
- Broad Institute of MIT and Harvard, Cambridge, USA
- Department of Genetics, Stanford University School of Medicine, Stanford, USA
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, UK
| |
Collapse
|
28
|
Dabi A, Schrider DR. Population size rescaling significantly biases outcomes of forward-in-time population genetic simulations. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.07.588318. [PMID: 38645049 PMCID: PMC11030438 DOI: 10.1101/2024.04.07.588318] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/23/2024]
Abstract
Simulations are an essential tool in all areas of population genetic research, used in tasks such as the validation of theoretical analysis and the study of complex evolutionary models. Forward-in-time simulations are especially flexible, allowing for various types of natural selection, complex genetic architectures, and non-Wright-Fisher dynamics. However, their intense computational requirements can be prohibitive to simulating large populations and genomes. A popular method to alleviate this burden is to scale down the population size by some scaling factor while scaling up the mutation rate, selection coefficients, and recombination rate by the same factor. However, this rescaling approach may in some cases bias simulation results. To investigate the manner and degree to which rescaling impacts simulation outcomes, we carried out simulations with different demographic histories and distributions of fitness effects using several values of the rescaling factor, Q , and compared the deviation of key outcomes (fixation times, fixation probabilities, allele frequencies, and linkage disequilibrium) between the scaled and unscaled simulations. Our results indicate that scaling introduces substantial biases to each of these measured outcomes, even at small values of Q . Moreover, the nature of these effects depends on the evolutionary model and scaling factor being examined. While increasing the scaling factor tends to increase the observed biases, this relationship is not always straightforward, thus it may be difficult to know the impact of scaling on simulation outcomes a priori. However, it appears that for most models, only a small number of replicates was needed to accurately quantify the bias produced by rescaling for a given Q . In summary, while rescaling forward-in-time simulations may be necessary in many cases, researchers should be aware of the rescaling effect's impact on simulation outcomes and consider investigating its magnitude in smaller scale simulations of the desired model(s) before selecting an appropriate value of Q .
Collapse
Affiliation(s)
- Amjad Dabi
- Department of Genetics, University of North Carolina, Chapel Hill, North Carolina, USA
| | - Daniel R. Schrider
- Department of Genetics, University of North Carolina, Chapel Hill, North Carolina, USA
| |
Collapse
|
29
|
Browning SR, Browning BL. Biobank-scale inference of multi-individual identity by descent and gene conversion. Am J Hum Genet 2024; 111:691-700. [PMID: 38513668 PMCID: PMC11023918 DOI: 10.1016/j.ajhg.2024.02.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2023] [Revised: 02/26/2024] [Accepted: 02/27/2024] [Indexed: 03/23/2024] Open
Abstract
We present a method for efficiently identifying clusters of identical-by-descent haplotypes in biobank-scale sequence data. Our multi-individual approach enables much more computationally efficient inference of identity by descent (IBD) than approaches that infer pairwise IBD segments and provides locus-specific IBD clusters rather than IBD segments. Our method's computation time, memory requirements, and output size scale linearly with the number of individuals in the dataset. We also present a method for using multi-individual IBD to detect alleles changed by gene conversion. Application of our methods to the autosomal sequence data for 125,361 White British individuals in the UK Biobank detects more than 9 million converted alleles. This is 2,900 times more alleles changed by gene conversion than were detected in a previous analysis of familial data. We estimate that more than 250,000 sequenced probands and a much larger number of additional genomes from multi-generational family members would be required to find a similar number of alleles changed by gene conversion using a family-based approach. Our IBD clustering method is implemented in the open-source ibd-cluster software package.
Collapse
Affiliation(s)
- Sharon R Browning
- Department of Biostatistics, University of Washington, Seattle, WA, USA.
| | - Brian L Browning
- Department of Biostatistics, University of Washington, Seattle, WA, USA; Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, WA, USA.
| |
Collapse
|
30
|
Rivas-González I, Tung J. A multi-million-year natural experiment: Comparative genomics on a massive scale and its implications for human health. Evol Med Public Health 2024; 12:67-70. [PMID: 38601345 PMCID: PMC11005778 DOI: 10.1093/emph/eoae006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2024] [Revised: 03/18/2024] [Indexed: 04/12/2024] Open
Abstract
Improving the diversity and quality of genome assemblies for non-human mammals has been a long-standing goal of comparative genomics. The last year saw substantial progress towards this goal, including the release of genome alignments for 240 mammals and nearly half the primate order. These resources have increased our ability to identify evolutionarily constrained regions of the genome, and together strongly support the importance of these regions to biomedically relevant trait variation in humans. They also provide new strategies for identifying the genetic basis of changes unique to individual lineages, illustrating the value of evolutionary comparative approaches for understanding human health.
Collapse
Affiliation(s)
- Iker Rivas-González
- Department of Primate Behavior and Evolution, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany
| | - Jenny Tung
- Department of Primate Behavior and Evolution, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany
- Department of Evolutionary Anthropology, Duke University, Durham, NC, USA
- Department of Biology, Duke University, Durham, NC, USA
- Faculty of Life Sciences, Institute of Biology, Leipzig University, Leipzig, Germany
| |
Collapse
|
31
|
Riley R, Mathieson I, Mathieson S. Interpreting generative adversarial networks to infer natural selection from genetic data. Genetics 2024; 226:iyae024. [PMID: 38386895 PMCID: PMC10990424 DOI: 10.1093/genetics/iyae024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2023] [Revised: 01/15/2024] [Accepted: 01/19/2024] [Indexed: 02/24/2024] Open
Abstract
Understanding natural selection and other forms of non-neutrality is a major focus for the use of machine learning in population genetics. Existing methods rely on computationally intensive simulated training data. Unlike efficient neutral coalescent simulations for demographic inference, realistic simulations of selection typically require slow forward simulations. Because there are many possible modes of selection, a high dimensional parameter space must be explored, with no guarantee that the simulated models are close to the real processes. Finally, it is difficult to interpret trained neural networks, leading to a lack of understanding about what features contribute to classification. Here we develop a new approach to detect selection and other local evolutionary processes that requires relatively few selection simulations during training. We build upon a generative adversarial network trained to simulate realistic neutral data. This consists of a generator (fitted demographic model), and a discriminator (convolutional neural network) that predicts whether a genomic region is real or fake. As the generator can only generate data under neutral demographic processes, regions of real data that the discriminator recognizes as having a high probability of being "real" do not fit the neutral demographic model and are therefore candidates for targets of selection. To incentivize identification of a specific mode of selection, we fine-tune the discriminator with a small number of custom non-neutral simulations. We show that this approach has high power to detect various forms of selection in simulations, and that it finds regions under positive selection identified by state-of-the-art population genetic methods in three human populations. Finally, we show how to interpret the trained networks by clustering hidden units of the discriminator based on their correlation patterns with known summary statistics.
Collapse
Affiliation(s)
- Rebecca Riley
- Department of Computer Science, Haverford College, Haverford, PA 19041, USA
| | - Iain Mathieson
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Sara Mathieson
- Department of Computer Science, Haverford College, Haverford, PA 19041, USA
| |
Collapse
|
32
|
Johnson OL, Tobler R, Schmidt JM, Huber CD. Population genetic simulation: Benchmarking frameworks for non-standard models of natural selection. Mol Ecol Resour 2024; 24:e13930. [PMID: 38247258 DOI: 10.1111/1755-0998.13930] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Revised: 12/21/2023] [Accepted: 01/09/2024] [Indexed: 01/23/2024]
Abstract
Population genetic simulation has emerged as a common tool for investigating increasingly complex evolutionary and demographic models. Software capable of handling high-level model complexity has recently been developed, and the advancement of tree sequence recording now allows simulations to merge the efficiency and genealogical insight of coalescent simulations with the flexibility of forward simulations. However, frameworks utilizing these features have not yet been compared and benchmarked. Here, we evaluate various simulation workflows using the coalescent simulator msprime and the forward simulator SLiM, to assess resource efficiency and determine an optimal simulation framework. Three aspects were evaluated: (1) the burn-in, to establish an equilibrium level of neutral diversity in the population; (2) the forward simulation, in which temporally fluctuating selection is acting; and (3) the final computation of summary statistics. We provide typical memory and computation time requirements for each step. We find that the fastest framework, a combination of coalescent and forward simulation with tree sequence recording, increases simulation speed by over twenty times compared to classical forward simulations without tree sequence recording, although it does require six times more memory. Overall, using efficient simulation workflows can lead to a substantial improvement when modelling complex evolutionary scenarios-although the optimal framework ultimately depends on the available computational resources.
Collapse
Affiliation(s)
- Olivia L Johnson
- School of Biological Sciences, University of Adelaide, Adelaide, South Australia, Australia
| | - Raymond Tobler
- Evolution of Cultural Diversity Initiative, The Australian National University, Canberra, Australian Capital Territory, Australia
| | - Joshua M Schmidt
- Department of Ophthalmology, College of Medicine and Public Health, Flinders University, Adelaide, South Australia, Australia
| | - Christian D Huber
- School of Biological Sciences, University of Adelaide, Adelaide, South Australia, Australia
- Department of Biology, Pennsylvania State University, University Park, Pennsylvania, USA
| |
Collapse
|
33
|
Clark MI, Fitzpatrick SW, Bradburd GS. Pitfalls and windfalls of detecting demographic declines using population genetics in long-lived species. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.27.586886. [PMID: 38585961 PMCID: PMC10996660 DOI: 10.1101/2024.03.27.586886] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/09/2024]
Abstract
Detecting recent demographic changes is a crucial component of species conservation and management, as many natural populations face declines due to anthropogenic habitat alteration and climate change. Genetic methods allow researchers to detect changes in effective population size (N e ) from sampling at a single timepoint. However, in species with long lifespans, there is a lag between the start of a decline in a population and the resulting decrease in genetic diversity. This lag slows the rate at which diversity is lost, and therefore makes it difficult to detect recent declines using genetic data. However, the genomes of old individuals can provide a window into the past, and can be compared to those of younger individuals, a contrast that may help reveal recent demographic declines. To test whether comparing the genomes of young and old individuals can help infer recent demographic bottlenecks, we use forward-time, individual-based simulations with varying mean individual lifespans and extents of generational overlap. We find that age information can be used to aid in the detection of demographic declines when the decline has been severe. When average lifespan is long, comparing young and old individuals from a single timepoint has greater power to detect a recent (within the last 50 years) bottleneck event than comparing individuals sampled at different points in time. Our results demonstrate how longevity and generational overlap can be both a hindrance and a boon to detecting recent demographic declines from population genomic data.
Collapse
|
34
|
Guardado M, Perez C, Jackson S, Magaña J, Campana S, Samperio E, Rojas BC, Hernandez S, Syas K, Hernandez R, Zavala EI, Rohlfs R. py_ped_sim - A flexible forward genetic simulator for complex family pedigree analysis. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.25.586501. [PMID: 38585824 PMCID: PMC10996500 DOI: 10.1101/2024.03.25.586501] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/09/2024]
Abstract
Background Large-scale family pedigrees are commonly used across medical, evolutionary, and forensic genetics. These pedigrees are tools for identifying genetic disorders, tracking evolutionary patterns, and establishing familial relationships via forensic genetic identification. However, there is a lack of software to accurately simulate different pedigree structures along with genomes corresponding to those individuals in a family pedigree. This limits simulation-based evaluations of methods that use pedigrees. Results We have developed a python command-line-based tool called py_ped_sim that facilitates the simulation of pedigree structures and the genomes of individuals in a pedigree. py_ped_sim represents pedigrees as directed acyclic graphs, enabling conversion between standard pedigree formats and integration with the forward population genetic simulator, SLiM. Notably, py_ped_sim allows the simulation of varying numbers of offspring for a set of parents, with the capacity to shift the distribution of sibship sizes over generations. We additionally add simulations for events of misattributed paternity, which offers a way to simulate half-sibling relationships. We validated the accuracy of our software by simulating genomes onto diverse family pedigree structures, showing that the estimated kinship coefficients closely approximated expected values. Conclusions py_ped_sim is a user-friendly and open-source solution for simulating pedigree structures and conducting pedigree genome simulations. It empowers medical, forensic, and evolutionary genetics researchers to gain deeper insights into the dynamics of genetic inheritance and relatedness within families.
Collapse
Affiliation(s)
- Miguel Guardado
- San Francisco State University, Department of Mathematics, San Francisco CA, 94132, USA
- University of California San Francisco, Biological and Medical Informatics Graduate Program. San Francisco CA, 94158
- Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA; San Francisco, 94134, CA, USA
- University of Oregon; Department of Data Science; Eugene, OR, 97403, USA
| | - Cynthia Perez
- San Francisco State University, Department of Biology, San Francisco CA, 94132, USA
| | - Shalom Jackson
- San Francisco State University, Department of Biology, San Francisco CA, 94132, USA
| | - Joaquín Magaña
- San Francisco State University, Department of Biology, San Francisco CA, 94132, USA
| | - Sthen Campana
- San Francisco State University, Department of Biology, San Francisco CA, 94132, USA
| | - Emily Samperio
- San Francisco State University, Department of Biology, San Francisco CA, 94132, USA
| | | | - Selena Hernandez
- San Francisco State University, Department of Biology, San Francisco CA, 94132, USA
| | - Kaela Syas
- San Francisco State University, Department of Mathematics, San Francisco CA, 94132, USA
| | - Ryan Hernandez
- Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA; San Francisco, 94134, CA, USA
| | - Elena I. Zavala
- San Francisco State University, Department of Biology, San Francisco CA, 94132, USA
- University of California, Berkeley, Department of Molecular and Cell Biology, Berkeley, CA, 94720, USA
| | - Rori Rohlfs
- San Francisco State University, Department of Biology, San Francisco CA, 94132, USA
- University of Oregon; Department of Data Science; Eugene, OR, 97403, USA
| |
Collapse
|
35
|
Guo B, Borda V, Laboulaye R, Spring MD, Wojnarski M, Vesely BA, Silva JC, Waters NC, O'Connor TD, Takala-Harrison S. Strong positive selection biases identity-by-descent-based inferences of recent demography and population structure in Plasmodium falciparum. Nat Commun 2024; 15:2499. [PMID: 38509066 PMCID: PMC10954658 DOI: 10.1038/s41467-024-46659-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2023] [Accepted: 02/28/2024] [Indexed: 03/22/2024] Open
Abstract
Malaria genomic surveillance often estimates parasite genetic relatedness using metrics such as Identity-By-Decent (IBD), yet strong positive selection stemming from antimalarial drug resistance or other interventions may bias IBD-based estimates. In this study, we use simulations, a true IBD inference algorithm, and empirical data sets from different malaria transmission settings to investigate the extent of this bias and explore potential correction strategies. We analyze whole genome sequence data generated from 640 new and 3089 publicly available Plasmodium falciparum clinical isolates. We demonstrate that positive selection distorts IBD distributions, leading to underestimated effective population size and blurred population structure. Additionally, we discover that the removal of IBD peak regions partially restores the accuracy of IBD-based inferences, with this effect contingent on the population's background genetic relatedness and extent of inbreeding. Consequently, we advocate for selection correction for parasite populations undergoing strong, recent positive selection, particularly in high malaria transmission settings.
Collapse
Affiliation(s)
- Bing Guo
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA
- Center for Vaccine Development and Global Health, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Victor Borda
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Roland Laboulaye
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Michele D Spring
- Armed Forces Research Institute of Medical Sciences, Bangkok, Thailand
| | - Mariusz Wojnarski
- Armed Forces Research Institute of Medical Sciences, Bangkok, Thailand
| | - Brian A Vesely
- Armed Forces Research Institute of Medical Sciences, Bangkok, Thailand
| | - Joana C Silva
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA
- Department of Microbiology and Immunology, University of Maryland School of Medicine, Baltimore, MD, USA
- Global Health and Tropical Medicine (GHTM), Instituto de Higiene e Medicina Tropical (IHMT), Universidade NOVA de Lisboa (NOVA), Lisbon, Portugal
| | - Norman C Waters
- Armed Forces Research Institute of Medical Sciences, Bangkok, Thailand
| | - Timothy D O'Connor
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA.
| | - Shannon Takala-Harrison
- Center for Vaccine Development and Global Health, University of Maryland School of Medicine, Baltimore, MD, USA.
| |
Collapse
|
36
|
Smith CCR, Patterson G, Ralph PL, Kern AD. Estimation of spatial demographic maps from polymorphism data using a neural network. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.15.585300. [PMID: 38559192 PMCID: PMC10980082 DOI: 10.1101/2024.03.15.585300] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
A fundamental goal in population genetics is to understand how variation is arrayed over natural landscapes. From first principles we know that common features such as heterogeneous population densities and source sink dynamics of dispersal should shape genetic variation over space, however there are few tools currently available that can deal with these ubiquitous complexities. Geographically referenced single nucleotide polymorphism (SNP) data are increasingly accessible, presenting an opportunity to study genetic variation across geographic space in myriad species. We present a new inference method that uses geo-referenced SNPs and a deep neural network to estimate spatially heterogeneous maps of population density and dispersal rate. Our neural network trains on simulated input and output pairings, where the input consists of genotypes and sampling locations generated from a continuous space population genetic simulator, and the output is a map of the true demographic parameters. We benchmark our tool against existing methods and discuss qualitative differences between the different approaches; in particular, our program is unique because it infers the magnitude of both dispersal and density as well as their variation over the landscape, and it does so using SNP data. Similar methods are constrained to estimating relative migration rates, or require identity by descent blocks as input. We applied our tool to empirical data from North American grey wolves, for which it estimated mostly reasonable demographic parameters, but was affected by incomplete spatial sampling. Genetic based methods like ours complement other, direct methods for estimating past and present demography, and we believe will serve as valuable tools for applications in conservation, ecology, and evolutionary biology. An open source software package implementing our method is available from https://github.com/kr-colab/mapNN.
Collapse
Affiliation(s)
- Chris C. R. Smith
- Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403, USA
| | - Gilia Patterson
- Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403, USA
| | - Peter L. Ralph
- Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403, USA
| | - Andrew D. Kern
- Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403, USA
| |
Collapse
|
37
|
Tagami D, Bisschop G, Kelleher J. tstrait: a quantitative trait simulator for ancestral recombination graphs. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.13.584790. [PMID: 38559118 PMCID: PMC10980058 DOI: 10.1101/2024.03.13.584790] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Summary Ancestral recombination graphs (ARGs) encode the ensemble of correlated genealogical trees arising from recombination in a compact and efficient structure, and are of fundamental importance in population and statistical genetics. Recent breakthroughs have made it possible to simulate and infer ARGs at biobank scale, and there is now intense interest in using ARG-based methods across a broad range of applications, particularly in genome-wide association studies (GWAS). Sophisticated methods exist to simulate ARGs using population genetics models, but there is currently no software to simulate quantitative traits directly from these ARGs. To apply existing quantitative trait simulators users must export genotype data, losing important information about ancestral processes and producing prohibitively large files when applied to the biobank-scale datasets currently of interest in GWAS. We present tstrait, an open-source Python library to simulate quantitative traits on ARGs, and show how this user-friendly software can quickly simulate phenotypes for biobank-scale datasets on a laptop computer. Availability and Implementation tstrait is available for download on the Python Package Index. Full documentation with examples and workflow templates is available on https://tskit.dev/tstrait/docs/, and the development version is maintained on GitHub (https://github.com/tskit-dev/tstrait). Contact daiki.tagami@hertford.ox.ac.uk.
Collapse
Affiliation(s)
- Daiki Tagami
- Department of Statistics, University of Oxford, 24-29 St Giles’, Oxford OX1 3LB, United Kingdom
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Old Road Campus, Oxford OX3 7LF, United Kingdom
| | - Gertjan Bisschop
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Old Road Campus, Oxford OX3 7LF, United Kingdom
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Old Road Campus, Oxford OX3 7LF, United Kingdom
| |
Collapse
|
38
|
Huang Z, Kelleher J, Chan YB, Balding DJ. Estimating evolutionary and demographic parameters via ARG-derived IBD. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.07.583855. [PMID: 38559261 PMCID: PMC10979897 DOI: 10.1101/2024.03.07.583855] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Inference of demographic and evolutionary parameters from a sample of genome sequences often proceeds by first inferring identical-by-descent (IBD) genome segments. By exploiting efficient data encoding based on the ancestral recombination graph (ARG), we obtain three major advantages over current approaches: (i) no need to impose a length threshold on IBD segments, (ii) IBD can be defined without the hard-to-verify requirement of no recombination, and (iii) computation time can be reduced with little loss of statistical efficiency using only the IBD segments from a set of sequence pairs that scales linearly with sample size. We first demonstrate powerful inferences when true IBD information is available from simulated data. For IBD inferred from real data, we propose an approximate Bayesian computation inference algorithm and use it to show that poorly-inferred short IBD segments can improve estimation precision. We show estimation precision similar to a previously-published estimator despite a 4 000-fold reduction in data used for inference. Computational cost limits model complexity in our approach, but we are able to incorporate unknown nuisance parameters and model misspecification, still finding improved parameter inference.
Collapse
Affiliation(s)
- Zhendong Huang
- Melbourne Integrative Genomics, School of Mathematics & Statistics, University of Melbourne, Australia
| | - Jerome Kelleher
- Oxford Big Data Institute, University of Oxford, United Kingdom
| | - Yao-ban Chan
- Melbourne Integrative Genomics, School of Mathematics & Statistics, University of Melbourne, Australia
| | - David J. Balding
- Melbourne Integrative Genomics, School of Mathematics & Statistics, University of Melbourne, Australia
| |
Collapse
|
39
|
Kent TV, Schrider DR, Matute DR. Demographic history and the efficacy of selection in the globally invasive mosquito Aedes aegypti. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.07.584008. [PMID: 38559089 PMCID: PMC10979846 DOI: 10.1101/2024.03.07.584008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Aedes aegypti is the main vector species of yellow fever, dengue, zika and chikungunya. The species is originally from Africa but has experienced a spectacular expansion in its geographic range to a large swath of the world, the demographic effects of which have remained largely understudied. In this report, we examine whole-genome sequences from 6 countries in Africa, North America, and South America to investigate the demographic history of the spread of Ae. aegypti into the Americas its impact on genomic diversity. In the Americas, we observe patterns of strong population structure consistent with relatively low (but probably non-zero) levels of gene flow but occasional long-range dispersal and/or recolonization events. We also find evidence that the colonization of the Americas has resulted in introduction bottlenecks. However, while each sampling location shows evidence of a past population contraction and subsequent recovery, our results suggest that the bottlenecks in America have led to a reduction in genetic diversity of only ~35% relative to African populations, and the American samples have retained high levels of genetic diversity (expected heterozygosity of ~0.02 at synonymous sites) and have experienced only a minor reduction in the efficacy of selection. These results evoke the image of an invasive species that has expanded its range with remarkable genetic resilience in the face of strong eradication pressure.
Collapse
Affiliation(s)
- Tyler V. Kent
- Department of Biology, College of Arts and Sciences, University of North Carolina, Chapel Hill, NC, USA
- Department of Genetics, School of Medicine, University of North Carolina, Chapel Hill, NC, USA
| | - Daniel R. Schrider
- Department of Genetics, School of Medicine, University of North Carolina, Chapel Hill, NC, USA
| | - Daniel R. Matute
- Department of Biology, College of Arts and Sciences, University of North Carolina, Chapel Hill, NC, USA
| |
Collapse
|
40
|
Schraiber JG, Edge MD, Pennell M. Unifying approaches from statistical genetics and phylogenetics for mapping phenotypes in structured populations. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.10.579721. [PMID: 38496530 PMCID: PMC10942266 DOI: 10.1101/2024.02.10.579721] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/19/2024]
Abstract
In both statistical genetics and phylogenetics, a major goal is to identify correlations between genetic loci or other aspects of the phenotype or environment and a focal trait. In these two fields, there are sophisticated but disparate statistical traditions aimed at these tasks. The disconnect between their respective approaches is becoming untenable as questions in medicine, conservation biology, and evolutionary biology increasingly rely on integrating data from within and among species, and once-clear conceptual divisions are becoming increasingly blurred. To help bridge this divide, we derive a general model describing the covariance between the genetic contributions to the quantitative phenotypes of different individuals. Taking this approach shows that standard models in both statistical genetics (e.g., Genome-Wide Association Studies; GWAS) and phylogenetic comparative biology (e.g., phylogenetic regression) can be interpreted as special cases of this more general quantitative-genetic model. The fact that these models share the same core architecture means that we can build a unified understanding of the strengths and limitations of different methods for controlling for genetic structure when testing for associations. We develop intuition for why and when spurious correlations may occur using analytical theory and conduct population-genetic and phylogenetic simulations of quantitative traits. The structural similarity of problems in statistical genetics and phylogenetics enables us to take methodological advances from one field and apply them in the other. We demonstrate this by showing how a standard GWAS technique-including both the genetic relatedness matrix (GRM) as well as its leading eigenvectors, corresponding to the principal components of the genotype matrix, in a regression model-can mitigate spurious correlations in phylogenetic analyses. As a case study of this, we re-examine an analysis testing for co-evolution of expression levels between genes across a fungal phylogeny, and show that including covariance matrix eigenvectors as covariates decreases the false positive rate while simultaneously increasing the true positive rate. More generally, this work provides a foundation for more integrative approaches for understanding the genetic architecture of phenotypes and how evolutionary processes shape it.
Collapse
|
41
|
Simon A, Coop G. The contribution of gene flow, selection, and genetic drift to five thousand years of human allele frequency change. Proc Natl Acad Sci U S A 2024; 121:e2312377121. [PMID: 38363870 PMCID: PMC10907250 DOI: 10.1073/pnas.2312377121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2023] [Accepted: 01/09/2024] [Indexed: 02/18/2024] Open
Abstract
Genomic time series from experimental evolution studies and ancient DNA datasets offer us a chance to directly observe the interplay of various evolutionary forces. We show how the genome-wide variance in allele frequency change between two time points can be decomposed into the contributions of gene flow, genetic drift, and linked selection. In closed populations, the contribution of linked selection is identifiable because it creates covariances between time intervals, and genetic drift does not. However, repeated gene flow between populations can also produce directionality in allele frequency change, creating covariances. We show how to accurately separate the fraction of variance in allele frequency change due to admixture and linked selection in a population receiving gene flow. We use two human ancient DNA datasets, spanning around 5,000 y, as time transects to quantify the contributions to the genome-wide variance in allele frequency change. We find that a large fraction of genome-wide change is due to gene flow. In both cases, after correcting for known major gene flow events, we do not observe a signal of genome-wide linked selection. Thus despite the known role of selection in shaping long-term polymorphism levels, and an increasing number of examples of strong selection on single loci and polygenic scores from ancient DNA, it appears to be gene flow and drift, and not selection, that are the main determinants of recent genome-wide allele frequency change. Our approach should be applicable to the growing number of contemporary and ancient temporal population genomics datasets.
Collapse
Affiliation(s)
- Alexis Simon
- Center for Population Biology, University of California, Davis, CA95616
- Department of Evolution and Ecology, University of California, Davis, CA95616
| | - Graham Coop
- Center for Population Biology, University of California, Davis, CA95616
- Department of Evolution and Ecology, University of California, Davis, CA95616
| |
Collapse
|
42
|
Tran LN, Sun CK, Struck TJ, Sajan M, Gutenkunst RN. Computationally efficient demographic history inference from allele frequencies with supervised machine learning. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.05.24.542158. [PMID: 38405827 PMCID: PMC10888863 DOI: 10.1101/2023.05.24.542158] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/27/2024]
Abstract
Inferring past demographic history of natural populations from genomic data is of central concern in many studies across research fields. Previously, our group had developed dadi, a widely used demographic history inference method based on the allele frequency spectrum (AFS) and maximum composite likelihood optimization. However, dadi's optimization procedure can be computationally expensive. Here, we developed donni (demography optimization via neural network inference), a new inference method based on dadi that is more efficient while maintaining comparable inference accuracy. For each dadi-supported demographic model, donni simulates the expected AFS for a range of model parameters then trains a set of Mean Variance Estimation neural networks using the simulated AFS. Trained networks can then be used to instantaneously infer the model parameters from future input data AFS. We demonstrated that for many demographic models, donni can infer some parameters, such as population size changes, very well and other parameters, such as migration rates and times of demographic events, fairly well. Importantly, donni provides both parameter and confidence interval estimates from input AFS with accuracy comparable to parameters inferred by dadi's likelihood optimization while bypassing its long and computationally intensive evaluation process. donni's performance demonstrates that supervised machine learning algorithms may be a promising avenue for developing more sustainable and computationally efficient demographic history inference methods.
Collapse
Affiliation(s)
- Linh N. Tran
- Genetics Graduate Interdisciplinary Program, University of Arizona, Tucson, AZ, USA
- Department of Molecular & Cellular Biology, University of Arizona, Tucson, AZ, USA
| | - Connie K. Sun
- Department of Molecular & Cellular Biology, University of Arizona, Tucson, AZ, USA
| | - Travis J. Struck
- Department of Molecular & Cellular Biology, University of Arizona, Tucson, AZ, USA
| | - Mathews Sajan
- Department of Molecular & Cellular Biology, University of Arizona, Tucson, AZ, USA
| | - Ryan N. Gutenkunst
- Department of Molecular & Cellular Biology, University of Arizona, Tucson, AZ, USA
| |
Collapse
|
43
|
Nunez JCB, Lenhart BA, Bangerter A, Murray CS, Mazzeo GR, Yu Y, Nystrom TL, Tern C, Erickson PA, Bergland AO. A cosmopolitan inversion facilitates seasonal adaptation in overwintering Drosophila. Genetics 2024; 226:iyad207. [PMID: 38051996 PMCID: PMC10847723 DOI: 10.1093/genetics/iyad207] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2023] [Accepted: 11/28/2023] [Indexed: 12/07/2023] Open
Abstract
Fluctuations in the strength and direction of natural selection through time are a ubiquitous feature of life on Earth. One evolutionary outcome of such fluctuations is adaptive tracking, wherein populations rapidly adapt from standing genetic variation. In certain circumstances, adaptive tracking can lead to the long-term maintenance of functional polymorphism despite allele frequency change due to selection. Although adaptive tracking is likely a common process, we still have a limited understanding of aspects of its genetic architecture and its strength relative to other evolutionary forces such as drift. Drosophila melanogaster living in temperate regions evolve to track seasonal fluctuations and are an excellent system to tackle these gaps in knowledge. By sequencing orchard populations collected across multiple years, we characterized the genomic signal of seasonal demography and identified that the cosmopolitan inversion In(2L)t facilitates seasonal adaptive tracking and shows molecular footprints of selection. A meta-analysis of phenotypic studies shows that seasonal loci within In(2L)t are associated with behavior, life history, physiology, and morphological traits. We identify candidate loci and experimentally link them to phenotype. Our work contributes to our general understanding of fluctuating selection and highlights the evolutionary outcome and dynamics of contemporary selection on inversions.
Collapse
Affiliation(s)
- Joaquin C B Nunez
- Department of Biology, University of Virginia, 90 Geldard Drive, Charlottesville, VA 22901, USA
- Department of Biology, University of Vermont, 109 Carrigan Drive, Burlington, VT 05405, USA
| | - Benedict A Lenhart
- Department of Biology, University of Virginia, 90 Geldard Drive, Charlottesville, VA 22901, USA
| | - Alyssa Bangerter
- Department of Biology, University of Virginia, 90 Geldard Drive, Charlottesville, VA 22901, USA
| | - Connor S Murray
- Department of Biology, University of Virginia, 90 Geldard Drive, Charlottesville, VA 22901, USA
| | - Giovanni R Mazzeo
- Department of Biology, University of Virginia, 90 Geldard Drive, Charlottesville, VA 22901, USA
| | - Yang Yu
- Department of Biology, University of Virginia, 90 Geldard Drive, Charlottesville, VA 22901, USA
| | - Taylor L Nystrom
- Department of Biology, University of Virginia, 90 Geldard Drive, Charlottesville, VA 22901, USA
| | - Courtney Tern
- Department of Biology, University of Virginia, 90 Geldard Drive, Charlottesville, VA 22901, USA
| | - Priscilla A Erickson
- Department of Biology, University of Virginia, 90 Geldard Drive, Charlottesville, VA 22901, USA
- Department of Biology, University of Richmond, 138 UR Drive, Richmond, VA 23173, USA
| | - Alan O Bergland
- Department of Biology, University of Virginia, 90 Geldard Drive, Charlottesville, VA 22901, USA
| |
Collapse
|
44
|
Rivas-González I, Schierup MH, Wakeley J, Hobolth A. TRAILS: Tree reconstruction of ancestry using incomplete lineage sorting. PLoS Genet 2024; 20:e1010836. [PMID: 38330138 PMCID: PMC10880969 DOI: 10.1371/journal.pgen.1010836] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Revised: 02/21/2024] [Accepted: 01/22/2024] [Indexed: 02/10/2024] Open
Abstract
Genome-wide genealogies of multiple species carry detailed information about demographic and selection processes on individual branches of the phylogeny. Here, we introduce TRAILS, a hidden Markov model that accurately infers time-resolved population genetics parameters, such as ancestral effective population sizes and speciation times, for ancestral branches using a multi-species alignment of three species and an outgroup. TRAILS leverages the information contained in incomplete lineage sorting fragments by modelling genealogies along the genome as rooted three-leaved trees, each with a topology and two coalescent events happening in discretized time intervals within the phylogeny. Posterior decoding of the hidden Markov model can be used to infer the ancestral recombination graph for the alignment and details on demographic changes within a branch. Since TRAILS performs posterior decoding at the base-pair level, genome-wide scans based on the posterior probabilities can be devised to detect deviations from neutrality. Using TRAILS on a human-chimp-gorilla-orangutan alignment, we recover speciation parameters and extract information about the topology and coalescent times at high resolution.
Collapse
Affiliation(s)
| | - Mikkel H. Schierup
- Bioinformatics Research Center (BiRC), Aarhus University, Aarhus, Denmark
| | - John Wakeley
- Department of Organismic and Evolutionary Biology, Harvard University, Massachusetts, United States of America
| | - Asger Hobolth
- Department of Mathematics, Aarhus University, Aarhus, Denmark
| |
Collapse
|
45
|
van der Valk T, Jensen A, Caillaud D, Guschanski K. Comparative genomic analyses provide new insights into evolutionary history and conservation genomics of gorillas. BMC Ecol Evol 2024; 24:14. [PMID: 38273244 PMCID: PMC10811819 DOI: 10.1186/s12862-023-02195-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2023] [Accepted: 12/22/2023] [Indexed: 01/27/2024] Open
Abstract
Genome sequencing is a powerful tool to understand species evolutionary history, uncover genes under selection, which could be informative of local adaptation, and infer measures of genetic diversity, inbreeding and mutational load that could be used to inform conservation efforts. Gorillas, critically endangered primates, have received considerable attention and with the recently sequenced Bwindi mountain gorilla population, genomic data is now available from all gorilla subspecies and both mountain gorilla populations. Here, we reanalysed this rich dataset with a focus on evolutionary history, local adaptation and genomic parameters relevant for conservation. We estimate a recent split between western and eastern gorillas of 150,000-180,000 years ago, with gene flow around 20,000 years ago, primarily between the Cross River and Grauer's gorilla subspecies. This gene flow event likely obscures evolutionary relationships within eastern gorillas: after excluding putatively introgressed genomic regions, we uncover a sister relationship between Virunga mountain gorillas and Grauer's gorillas to the exclusion of Bwindi mountain gorillas. This makes mountain gorillas paraphyletic. Eastern gorillas are less genetically diverse and more inbred than western gorillas, yet we detected lower genetic load in the eastern species. Analyses of indels fit remarkably well with differences in genetic diversity across gorilla taxa as recovered with nucleotide diversity measures. We also identified genes under selection and unique gene variants specific for each gorilla subspecies, encoding, among others, traits involved in immunity, diet, muscular development, hair morphology and behavior. The presence of this functional variation suggests that the subspecies may be locally adapted. In conclusion, using extensive genomic resources we provide a comprehensive overview of gorilla genomic diversity, including a so-far understudied Bwindi mountain gorilla population, identify putative genes involved in local adaptation, and detect population-specific gene flow across gorilla species.
Collapse
Affiliation(s)
- Tom van der Valk
- Centre for Palaeogenetics, Stockholm, Sweden.
- Department of Bioinformatics and Genetics, Swedish Museum of Natural History, Stockholm, Sweden.
- SciLifeLab, Stockholm, Sweden.
- Department of Zoology, Stockholm University, Stockholm, Sweden.
| | - Axel Jensen
- Department of Ecology and Genetics, Animal Ecology, Uppsala University, Uppsala, Sweden
| | - Damien Caillaud
- Department of Anthropology, University of CA - Davis, Davis, California, USA
| | - Katerina Guschanski
- SciLifeLab, Stockholm, Sweden
- Department of Ecology and Genetics, Animal Ecology, Uppsala University, Uppsala, Sweden
- Institute of Ecology and Evolution, School of Biological Sciences, University of Edinburgh, Edinburgh, UK
| |
Collapse
|
46
|
Simon A, Coop G. The contribution of gene flow, selection, and genetic drift to five thousand years of human allele frequency change. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.07.11.548607. [PMID: 37503227 PMCID: PMC10370008 DOI: 10.1101/2023.07.11.548607] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/29/2023]
Abstract
Genomic time series from experimental evolution studies and ancient DNA datasets offer us a chance to directly observe the interplay of various evolutionary forces. We show how the genome-wide variance in allele frequency change between two time points can be decomposed into the contributions of gene flow, genetic drift, and linked selection. In closed populations, the contribution of linked selection is identifiable because it creates covariances between time intervals, and genetic drift does not. However, repeated gene flow between populations can also produce directionality in allele frequency change, creating covariances. We show how to accurately separate the fraction of variance in allele frequency change due to admixture and linked selection in a population receiving gene flow. We use two human ancient DNA datasets, spanning around 5,000 years, as time transects to quantify the contributions to the genome-wide variance in allele frequency change. We find that a large fraction of genome-wide change is due to gene flow. In both cases, after correcting for known major gene flow events, we do not observe a signal of genome-wide linked selection. Thus despite the known role of selection in shaping long-term polymorphism levels, and an increasing number of examples of strong selection on single loci and polygenic scores from ancient DNA, it appears to be gene flow and drift, and not selection, that are the main determinants of recent genome-wide allele frequency change. Our approach should be applicable to the growing number of contemporary and ancient temporal population genomics datasets.
Collapse
Affiliation(s)
- Alexis Simon
- Center for Population Biology, University of California, Davis, CA 95616
- Department of Evolution and Ecology, University of California, Davis, CA 95616
| | - Graham Coop
- Center for Population Biology, University of California, Davis, CA 95616
- Department of Evolution and Ecology, University of California, Davis, CA 95616
| |
Collapse
|
47
|
Zhang Y, Zhang H, Wu Y. A general approach for inferring the ancestry of recent ancestors of an admixed individual. Proc Natl Acad Sci U S A 2024; 121:e2316242120. [PMID: 38165936 PMCID: PMC10786287 DOI: 10.1073/pnas.2316242120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Accepted: 11/27/2023] [Indexed: 01/04/2024] Open
Abstract
The genome of an individual from an admixed population consists of segments originated from different ancestral populations. Most existing ancestry inference approaches focus on calling these segments for the extant individual. In this paper, we present a general ancestry inference approach for inferring recent ancestors from an extant genome. Given the genome of an individual from a recently admixed population, our method can estimate the proportions of the genomes of the recent ancestors of this individual that originated from some ancestral populations. The key step of our method is the inference of ancestors (called founders) right after the formation of an admixed population. The inferred founders can then be used to infer the ancestry of recent ancestors of an extant individual. Our method is implemented in a computer program called PedMix2. To the best of our knowledge, there is no existing method that can practically infer ancestors beyond grandparents from an extant individual's genome. Results on both simulated and real data show that PedMix2 performs well in ancestry inference.
Collapse
Affiliation(s)
- Yiming Zhang
- School of Computing, College of Engineering, University of Connecticut, Storrs, CT06269
| | - Haotian Zhang
- School of Computing, College of Engineering, University of Connecticut, Storrs, CT06269
| | - Yufeng Wu
- School of Computing, College of Engineering, University of Connecticut, Storrs, CT06269
| |
Collapse
|
48
|
Stankowski S, Zagrodzka ZB, Garlovsky MD, Pal A, Shipilina D, Castillo DG, Lifchitz H, Le Moan A, Leder E, Reeve J, Johannesson K, Westram AM, Butlin RK. The genetic basis of a recent transition to live-bearing in marine snails. Science 2024; 383:114-119. [PMID: 38175895 DOI: 10.1126/science.adi2982] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2023] [Accepted: 10/25/2023] [Indexed: 01/06/2024]
Abstract
Key innovations are fundamental to biological diversification, but their genetic basis is poorly understood. A recent transition from egg-laying to live-bearing in marine snails (Littorina spp.) provides the opportunity to study the genetic architecture of an innovation that has evolved repeatedly across animals. Individuals do not cluster by reproductive mode in a genome-wide phylogeny, but local genealogical analysis revealed numerous small genomic regions where all live-bearers carry the same core haplotype. Candidate regions show evidence for live-bearer-specific positive selection and are enriched for genes that are differentially expressed between egg-laying and live-bearing reproductive systems. Ages of selective sweeps suggest that live-bearer-specific alleles accumulated over more than 200,000 generations. Our results suggest that new functions evolve through the recruitment of many alleles rather than in a single evolutionary step.
Collapse
Affiliation(s)
- Sean Stankowski
- Ecology and Evolutionary Biology, School of Biosciences, University of Sheffield, Sheffield S10 2TN, UK
- Institute of Science and Technology Austria (ISTA), 3400 Klosterneuburg, Austria
- Department of Ecology and Evolution, University of Sussex, Brighton BN1 9RH, UK
| | - Zuzanna B Zagrodzka
- Ecology and Evolutionary Biology, School of Biosciences, University of Sheffield, Sheffield S10 2TN, UK
| | - Martin D Garlovsky
- Department of Applied Zoology, Faculty of Biology, Technische Universität Dresden, 01069 Dresden, Germany
| | - Arka Pal
- Institute of Science and Technology Austria (ISTA), 3400 Klosterneuburg, Austria
| | - Daria Shipilina
- Institute of Science and Technology Austria (ISTA), 3400 Klosterneuburg, Austria
- Department of Ecology and Genetics, Program of Evolutionary Biology, Uppsala University, SE-752 36 Uppsala, Sweden
| | | | - Hila Lifchitz
- Institute of Science and Technology Austria (ISTA), 3400 Klosterneuburg, Austria
| | - Alan Le Moan
- CNRS and Sorbonne Université, Station Biologique de Roscoff, 29680 Roscoff, France
- Department of Marine Sciences, Tjärnö Marine Laboratory, University of Gothenburg, 452 96 Strömstad, Sweden
| | - Erica Leder
- Department of Marine Sciences, Tjärnö Marine Laboratory, University of Gothenburg, 452 96 Strömstad, Sweden
- Natural History Museum, University of Oslo, 0562 Oslo, Norway
| | - James Reeve
- Department of Marine Sciences, Tjärnö Marine Laboratory, University of Gothenburg, 452 96 Strömstad, Sweden
| | - Kerstin Johannesson
- Department of Marine Sciences, Tjärnö Marine Laboratory, University of Gothenburg, 452 96 Strömstad, Sweden
| | - Anja M Westram
- Institute of Science and Technology Austria (ISTA), 3400 Klosterneuburg, Austria
- Faculty of Biosciences and Aquaculture, Nord University, N-8049 Bodø, Norway
| | - Roger K Butlin
- Ecology and Evolutionary Biology, School of Biosciences, University of Sheffield, Sheffield S10 2TN, UK
- Department of Marine Sciences, Tjärnö Marine Laboratory, University of Gothenburg, 452 96 Strömstad, Sweden
| |
Collapse
|
49
|
Oliva A, Kaphle A, Reguant R, Sng LMF, Twine NA, Malakar Y, Wickramarachchi A, Keller M, Ranbaduge T, Chan EKF, Breen J, Buckberry S, Guennewig B, Haas M, Brown A, Cowley MJ, Thorne N, Jain Y, Bauer DC. Future-proofing genomic data and consent management: a comprehensive review of technology innovations. Gigascience 2024; 13:giae021. [PMID: 38837943 PMCID: PMC11152178 DOI: 10.1093/gigascience/giae021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Revised: 01/15/2024] [Accepted: 04/09/2024] [Indexed: 06/07/2024] Open
Abstract
Genomic information is increasingly used to inform medical treatments and manage future disease risks. However, any personal and societal gains must be carefully balanced against the risk to individuals contributing their genomic data. Expanding our understanding of actionable genomic insights requires researchers to access large global datasets to capture the complexity of genomic contribution to diseases. Similarly, clinicians need efficient access to a patient's genome as well as population-representative historical records for evidence-based decisions. Both researchers and clinicians hence rely on participants to consent to the use of their genomic data, which in turn requires trust in the professional and ethical handling of this information. Here, we review existing and emerging solutions for secure and effective genomic information management, including storage, encryption, consent, and authorization that are needed to build participant trust. We discuss recent innovations in cloud computing, quantum-computing-proof encryption, and self-sovereign identity. These innovations can augment key developments from within the genomics community, notably GA4GH Passports and the Crypt4GH file container standard. We also explore how decentralized storage as well as the digital consenting process can offer culturally acceptable processes to encourage data contributions from ethnic minorities. We conclude that the individual and their right for self-determination needs to be put at the center of any genomics framework, because only on an individual level can the received benefits be accurately balanced against the risk of exposing private information.
Collapse
Affiliation(s)
- Adrien Oliva
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, Level 3/160 Hawkesbury Rd, Westmead NSW 2145, Australia
| | - Anubhav Kaphle
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, Level 3/160 Hawkesbury Rd, Westmead NSW 2145, Australia
| | - Roc Reguant
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, Level 3/160 Hawkesbury Rd, Westmead NSW 2145, Australia
| | - Letitia M F Sng
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, Level 3/160 Hawkesbury Rd, Westmead NSW 2145, Australia
| | - Natalie A Twine
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, Level 3/160 Hawkesbury Rd, Westmead NSW 2145, Australia
| | - Yuwan Malakar
- Responsible Innovation Future Science Platform, Commonwealth Scientific and Industrial Research Organisation, Brisbane, 41 Boggo Rd, Dutton Park QLD 4102, Australia
| | - Anuradha Wickramarachchi
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, Level 3/160 Hawkesbury Rd, Westmead NSW 2145, Australia
| | - Marcel Keller
- Data61, Commonwealth Scientific and Industrial Research Organisation, Level 5/13 Garden St, Eveleigh NSW 2015, Australia
| | - Thilina Ranbaduge
- Data61, Commonwealth Scientific and Industrial Research Organisation, Building 101, Clunies Ross St, Black Mountain, Canberra, ACT 2601, Australia
| | - Eva K F Chan
- NSW Health Pathology, Sydney, 1 Reserve Road, St Leonards NSW 2065, Australia
| | - James Breen
- Telethon Kids Institute, Perth, WA 6009, Australia
- National Centre for Indigenous Genomics, The John Curtin School of Medical Research, Australian National University, Canberra, ACT 2601, Australia
| | - Sam Buckberry
- Telethon Kids Institute, Perth, WA 6009, Australia
- National Centre for Indigenous Genomics, The John Curtin School of Medical Research, Australian National University, Canberra, ACT 2601, Australia
| | - Boris Guennewig
- Sydney Medical School, Brain and Mind Centre, The University of Sydney, Sydney, 94 Mallett St, Camperdown NSW 2050, Australia
| | - Matilda Haas
- Australian Genomics, Parkville, VIC 3052, Australia
- Murdoch Children’s Research Institute, Parkville, Victoria 3052, Australia
| | - Alex Brown
- Telethon Kids Institute, Perth, WA 6009, Australia
- National Centre for Indigenous Genomics, The John Curtin School of Medical Research, Australian National University, Canberra, ACT 2601, Australia
| | - Mark J Cowley
- Children’s Cancer Institute, Lowy Cancer Research Centre, Level 4, Lowy Cancer Research Centre Corner Botany & High Streets UNSW Kensington Campus UNSW Sydney, Kensington NSW 2052, Australia
- School of Clinical Medicine, UNSW Medicine & Health, Wallace Wurth Building (C27), Cnr High St & Botany St, UNSW Sydney, Kensington NSW 2052, Australia
| | - Natalie Thorne
- University of Melbourne, Melbourne, Parkville VIC 3052, Australia
- Melbourne Genomics Health Alliance, Melbourne 1G, Walter and Eliza Hall Institute/1G Royal Parade, Parkville VIC 3052, Australia
- Walter and Eliza Hall Institute, Melbourne, 1G, Walter and Eliza Hall Institute/1G Royal Parade, Parkville VIC 3052, Australia
| | - Yatish Jain
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, Level 3/160 Hawkesbury Rd, Westmead NSW 2145, Australia
- Applied BioSciences, Faculty of Science and Engineering, Macquarie University, Applied BioSciences 205B Culloden Rd Macquarie University, NSW 2109, Australia
| | - Denis C Bauer
- Applied BioSciences, Faculty of Science and Engineering, Macquarie University, Applied BioSciences 205B Culloden Rd Macquarie University, NSW 2109, Australia
- Department of Biomedical Sciences, MQ Health General Practice - Macquarie University, Suite 305, Level 3/2 Technology Pl, Macquarie Park NSW 2109, Australia
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, Gate 13, Kintore Avenue University of Adelaide, Adelaide SA 5000, Australia
| |
Collapse
|
50
|
Benham PM, Walsh J, Bowie RCK. Spatial variation in population genomic responses to over a century of anthropogenic change within a tidal marsh songbird. GLOBAL CHANGE BIOLOGY 2024; 30:e17126. [PMID: 38273486 DOI: 10.1111/gcb.17126] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/06/2023] [Revised: 11/22/2023] [Accepted: 12/13/2023] [Indexed: 01/27/2024]
Abstract
Combating the current biodiversity crisis requires the accurate documentation of population responses to human-induced ecological change. However, our ability to pinpoint population responses to human activities is often limited to the analysis of populations studied well after the fact. Museum collections preserve a record of population responses to anthropogenic change that can provide critical baseline data on patterns of genetic diversity, connectivity, and population structure prior to the onset of human perturbation. Here, we leverage a spatially replicated time series of specimens to document population genomic responses to the destruction of nearly 90% of coastal habitats occupied by the Savannah sparrow (Passerculus sandwichensis) in California. We sequenced 219 sparrows collected from 1889 to 2017 across the state of California using an exome capture approach. Spatial-temporal analyses of genetic diversity found that the amount of habitat lost was not predictive of genetic diversity loss. Sparrow populations from southern California historically exhibited lower levels of genetic diversity and experienced the most significant temporal declines in genetic diversity. Despite experiencing the greatest levels of habitat loss, we found that genetic diversity in the San Francisco Bay area remained relatively high. This was potentially related to an observed increase in gene flow into the Bay Area from other populations. While gene flow may have minimized genetic diversity declines, we also found that immigration from inland freshwater-adapted populations into tidal marsh populations led to the erosion of divergence at loci associated with tidal marsh adaptation. Shifting patterns of gene flow through time in response to habitat loss may thus contribute to negative fitness consequences and outbreeding depression. Together, our results underscore the importance of tracing the genomic trajectories of multiple populations over time to address issues of fundamental conservation concern.
Collapse
Affiliation(s)
- Phred M Benham
- Museum of Vertebrate Zoology, University of California, Berkeley, Berkeley, California, USA
- Department of Integrative Biology, University of California, Berkeley, Berkeley, California, USA
| | - Jennifer Walsh
- Fuller Evolutionary Biology Program, Cornell Lab of Ornithology, Cornell University, Ithaca, New York, USA
| | - Rauri C K Bowie
- Museum of Vertebrate Zoology, University of California, Berkeley, Berkeley, California, USA
- Department of Integrative Biology, University of California, Berkeley, Berkeley, California, USA
| |
Collapse
|