1
|
Meisner J, Benros ME, Rasmussen S. Leveraging haplotype information in heritability estimation and polygenic prediction. Nat Commun 2025; 16:126. [PMID: 39747034 PMCID: PMC11695728 DOI: 10.1038/s41467-024-55477-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2024] [Accepted: 12/13/2024] [Indexed: 01/04/2025] Open
Abstract
Polygenic prediction has yet to make a major clinical breakthrough in precision medicine and psychiatry, where the application of polygenic risk scores is expected to improve clinical decision-making. Most widely used approaches for estimating polygenic risk scores are based on summary statistics from external large-scale genome-wide association studies, which rely on assumptions of matching data distributions. This may hinder the impact of polygenic risk scores in modern diverse populations due to small differences in genetic architectures. Reference-free estimators of polygenic scores are instead based on genomic best linear unbiased predictions and model the population of interest directly. We introduce a framework, named hapla, with a novel algorithm for clustering haplotypes in phased genotype data to estimate heritability and perform reference-free polygenic prediction in complex traits. We utilize inferred haplotype clusters to compute accurate heritability estimates and polygenic scores in a simulation study and the iPSYCH2012 case-cohort for depression disorders and schizophrenia. We demonstrate that our haplotype-based approach robustly outperforms standard genotype-based approaches, which can help pave the way for polygenic risk scores in the future of precision medicine and psychiatry.
Collapse
Affiliation(s)
- Jonas Meisner
- Copenhagen Research Center for Biological and Precision Psychiatry, Mental Health Centre Copenhagen, Copenhagen University Hospital, Hellerup, Denmark.
- Novo Nordisk Foundation Center for Basic Metabolic Research, University of Copenhagen, Copenhagen, Denmark.
| | - Michael Eriksen Benros
- Copenhagen Research Center for Biological and Precision Psychiatry, Mental Health Centre Copenhagen, Copenhagen University Hospital, Hellerup, Denmark
- Department of Clinical Medicine, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Simon Rasmussen
- Novo Nordisk Foundation Center for Basic Metabolic Research, University of Copenhagen, Copenhagen, Denmark
| |
Collapse
|
2
|
Abstract
Genomic data are becoming increasingly affordable and easy to collect, and new tools for their analysis are appearing rapidly. Conservation biologists are interested in using this information to assist in management and planning but are typically limited financially and by the lack of genomic resources available for non-model taxa. It is therefore important to be aware of the pitfalls as well as the benefits of applying genomic approaches. Here, we highlight recent methods aimed at standardizing population assessments of genetic variation, inbreeding, and forms of genetic load and methods that help identify past and ongoing patterns of genetic interchange between populations, including those subjected to recent disturbance. We emphasize challenges in applying some of these methods and the need for adequate bioinformatic support. We also consider the promises and challenges of applying genomic approaches to understand adaptive changes in natural populations to predict their future adaptive capacity.
Collapse
Affiliation(s)
- Thomas L Schmidt
- School of BioSciences, Bio21 Institute, University of Melbourne, Parkville, Victoria, Australia;
| | - Joshua A Thia
- School of BioSciences, Bio21 Institute, University of Melbourne, Parkville, Victoria, Australia;
| | - Ary A Hoffmann
- School of BioSciences, Bio21 Institute, University of Melbourne, Parkville, Victoria, Australia;
| |
Collapse
|
3
|
Huang X, Rymbekova A, Dolgova O, Lao O, Kuhlwilm M. Harnessing deep learning for population genetic inference. Nat Rev Genet 2024; 25:61-78. [PMID: 37666948 DOI: 10.1038/s41576-023-00636-3] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/11/2023] [Indexed: 09/06/2023]
Abstract
In population genetics, the emergence of large-scale genomic data for various species and populations has provided new opportunities to understand the evolutionary forces that drive genetic diversity using statistical inference. However, the era of population genomics presents new challenges in analysing the massive amounts of genomes and variants. Deep learning has demonstrated state-of-the-art performance for numerous applications involving large-scale data. Recently, deep learning approaches have gained popularity in population genetics; facilitated by the advent of massive genomic data sets, powerful computational hardware and complex deep learning architectures, they have been used to identify population structure, infer demographic history and investigate natural selection. Here, we introduce common deep learning architectures and provide comprehensive guidelines for implementing deep learning models for population genetic inference. We also discuss current challenges and future directions for applying deep learning in population genetics, focusing on efficiency, robustness and interpretability.
Collapse
Affiliation(s)
- Xin Huang
- Department of Evolutionary Anthropology, University of Vienna, Vienna, Austria.
- Human Evolution and Archaeological Sciences (HEAS), University of Vienna, Vienna, Austria.
| | - Aigerim Rymbekova
- Department of Evolutionary Anthropology, University of Vienna, Vienna, Austria
- Human Evolution and Archaeological Sciences (HEAS), University of Vienna, Vienna, Austria
| | - Olga Dolgova
- Integrative Genomics Laboratory, CIC bioGUNE - Centro de Investigación Cooperativa en Biociencias, Derio, Biscaya, Spain
| | - Oscar Lao
- Institute of Evolutionary Biology, CSIC-Universitat Pompeu Fabra, Barcelona, Spain.
| | - Martin Kuhlwilm
- Department of Evolutionary Anthropology, University of Vienna, Vienna, Austria.
- Human Evolution and Archaeological Sciences (HEAS), University of Vienna, Vienna, Austria.
| |
Collapse
|
4
|
Mas-Sandoval A, Mathieson S, Fumagalli M. The genomic footprint of social stratification in admixing American populations. eLife 2023; 12:e84429. [PMID: 38038347 PMCID: PMC10776089 DOI: 10.7554/elife.84429] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2022] [Accepted: 11/22/2023] [Indexed: 12/02/2023] Open
Abstract
Cultural and socioeconomic differences stratify human societies and shape their genetic structure beyond the sole effect of geography. Despite mating being limited by sociocultural stratification, most demographic models in population genetics often assume random mating. Taking advantage of the correlation between sociocultural stratification and the proportion of genetic ancestry in admixed populations, we sought to infer the former process in the Americas. To this aim, we define a mating model where the individual proportions of the genome inherited from Native American, European, and sub-Saharan African ancestral populations constrain the mating probabilities through ancestry-related assortative mating and sex bias parameters. We simulate a wide range of admixture scenarios under this model. Then, we train a deep neural network and retrieve good performance in predicting mating parameters from genomic data. Our results show how population stratification, shaped by socially constructed racial and gender hierarchies, has constrained the admixture processes in the Americas since the European colonization and the subsequent Atlantic slave trade.
Collapse
Affiliation(s)
- Alex Mas-Sandoval
- Department of Life Sciences, Silwood Park Campus, Imperial College LondonLondonUnited Kingdom
- Department of Statistical Sciences, University of BolognaBolognaItaly
| | - Sara Mathieson
- Department of Computer Science, Haverford CollegeHaverfordUnited States
| | - Matteo Fumagalli
- Department of Life Sciences, Silwood Park Campus, Imperial College LondonLondonUnited Kingdom
- School of Biological and Behavioural Sciences, Queen Mary University of LondonLondonUnited Kingdom
| |
Collapse
|
5
|
Nait Saada J, Tsangalidou Z, Stricker M, Palamara PF. Inference of Coalescence Times and Variant Ages Using Convolutional Neural Networks. Mol Biol Evol 2023; 40:msad211. [PMID: 37738175 PMCID: PMC10581698 DOI: 10.1093/molbev/msad211] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2023] [Revised: 09/11/2023] [Accepted: 09/18/2023] [Indexed: 09/24/2023] Open
Abstract
Accurate inference of the time to the most recent common ancestor (TMRCA) between pairs of individuals and of the age of genomic variants is key in several population genetic analyses. We developed a likelihood-free approach, called CoalNN, which uses a convolutional neural network to predict pairwise TMRCAs and allele ages from sequencing or SNP array data. CoalNN is trained through simulation and can be adapted to varying parameters, such as demographic history, using transfer learning. Across several simulated scenarios, CoalNN matched or outperformed the accuracy of model-based approaches for pairwise TMRCA and allele age prediction. We applied CoalNN to settings for which model-based approaches are under-developed and performed analyses to gain insights into the set of features it uses to perform TMRCA prediction. We next used CoalNN to analyze 2,504 samples from 26 populations in the 1,000 Genome Project data set, inferring the age of ∼80 million variants. We observed substantial variation across populations and for variants predicted to be pathogenic, reflecting heterogeneous demographic histories and the action of negative selection. We used CoalNN's predicted allele ages to construct genome-wide annotations capturing the signature of past negative selection. We performed LD-score regression analysis of heritability using summary association statistics from 63 independent complex traits and diseases (average N=314k), observing increased annotation-specific effects on heritability compared to a previous allele age annotation. These results highlight the effectiveness of using likelihood-free, simulation-trained models to infer properties of gene genealogies in large genomic data sets.
Collapse
Affiliation(s)
| | | | | | - Pier Francesco Palamara
- Department of Statistics, University of Oxford, Oxford, UK
- Wellcome Centre for Human Genetics, University of Oxford, Oxford, UK
| |
Collapse
|
6
|
Abstract
Following the widespread use of deep learning for genomics, deep generative modeling is also becoming a viable methodology for the broad field. Deep generative models (DGMs) can learn the complex structure of genomic data and allow researchers to generate novel genomic instances that retain the real characteristics of the original dataset. Aside from data generation, DGMs can also be used for dimensionality reduction by mapping the data space to a latent space, as well as for prediction tasks via exploitation of this learned mapping or supervised/semi-supervised DGM designs. In this review, we briefly introduce generative modeling and two currently prevailing architectures, we present conceptual applications along with notable examples in functional and evolutionary genomics, and we provide our perspective on potential challenges and future directions.
Collapse
Affiliation(s)
- Burak Yelmen
- Laboratoire Interdisciplinaire des Sciences du Numérique, CNRS UMR 9015, INRIA, Université Paris-Saclay, Orsay, France;
- Institute of Genomics, University of Tartu, Tartu, Estonia
| | - Flora Jay
- Laboratoire Interdisciplinaire des Sciences du Numérique, CNRS UMR 9015, INRIA, Université Paris-Saclay, Orsay, France;
| |
Collapse
|
7
|
Mantes AD, Montserrat DM, Bustamante CD, Giró-i-Nieto X, Ioannidis AG. Neural ADMIXTURE for rapid genomic clustering. NATURE COMPUTATIONAL SCIENCE 2023; 3:621-629. [PMID: 37600116 PMCID: PMC10438426 DOI: 10.1038/s43588-023-00482-7] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/25/2022] [Accepted: 06/06/2023] [Indexed: 08/22/2023]
Abstract
Characterizing the genetic structure of large cohorts has become increasingly important as genetic studies extend to massive, increasingly diverse biobanks. Popular methods decompose individual genomes into fractional cluster assignments with each cluster representing a vector of DNA variant frequencies. However, with rapidly increasing biobank sizes, these methods have become computationally intractable. Here we present Neural ADMIXTURE, a neural network autoencoder that follows the same modeling assumptions as the current standard algorithm, ADMIXTURE, while reducing the compute time by orders of magnitude surpassing even the fastest alternatives. One month of continuous compute using ADMIXTURE can be reduced to just hours with Neural ADMIXTURE. A multi-head approach allows Neural ADMIXTURE to offer even further acceleration by calculating multiple cluster numbers in a single run. Furthermore, the models can be stored, allowing cluster assignment to be performed on new data in linear time without needing to share the training samples.
Collapse
Affiliation(s)
- Albert Dominguez Mantes
- Department of Biomedical Data Science, Stanford Medical School, Stanford, CA, United States
- Signal Theory and Communications Department, Universitat Politècnica de Catalunya, Barcelona, Catalonia, Spain
- School of Life Sciences, École Polytechnique Fédérale de Lausanne, Lausanne, Vaud, Switzerland
| | - Daniel Mas Montserrat
- Department of Biomedical Data Science, Stanford Medical School, Stanford, CA, United States
| | | | - Xavier Giró-i-Nieto
- Signal Theory and Communications Department, Universitat Politècnica de Catalunya, Barcelona, Catalonia, Spain
| | - Alexander G. Ioannidis
- Department of Biomedical Data Science, Stanford Medical School, Stanford, CA, United States
- Institute for Computational and Mathematical Engineering, Stanford University, Stanford, CA, United States
| |
Collapse
|
8
|
Korfmann K, Gaggiotti OE, Fumagalli M. Deep Learning in Population Genetics. Genome Biol Evol 2023; 15:evad008. [PMID: 36683406 PMCID: PMC9897193 DOI: 10.1093/gbe/evad008] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2022] [Revised: 12/19/2022] [Accepted: 01/16/2023] [Indexed: 01/24/2023] Open
Abstract
Population genetics is transitioning into a data-driven discipline thanks to the availability of large-scale genomic data and the need to study increasingly complex evolutionary scenarios. With likelihood and Bayesian approaches becoming either intractable or computationally unfeasible, machine learning, and in particular deep learning, algorithms are emerging as popular techniques for population genetic inferences. These approaches rely on algorithms that learn non-linear relationships between the input data and the model parameters being estimated through representation learning from training data sets. Deep learning algorithms currently employed in the field comprise discriminative and generative models with fully connected, convolutional, or recurrent layers. Additionally, a wide range of powerful simulators to generate training data under complex scenarios are now available. The application of deep learning to empirical data sets mostly replicates previous findings of demography reconstruction and signals of natural selection in model organisms. To showcase the feasibility of deep learning to tackle new challenges, we designed a branched architecture to detect signals of recent balancing selection from temporal haplotypic data, which exhibited good predictive performance on simulated data. Investigations on the interpretability of neural networks, their robustness to uncertain training data, and creative representation of population genetic data, will provide further opportunities for technological advancements in the field.
Collapse
Affiliation(s)
- Kevin Korfmann
- Professorship for Population Genetics, Department of Life Science Systems, Technical University of Munich, Germany
| | - Oscar E Gaggiotti
- Centre for Biological Diversity, Sir Harold Mitchell Building, University of St Andrews, Fife KY16 9TF, UK
| | - Matteo Fumagalli
- Department of Biological and Behavioural Sciences, Queen Mary University of London, UK
| |
Collapse
|
9
|
Sanchez T, Bray EM, Jobic P, Guez J, Letournel AC, Charpiat G, Cury J, Jay F. dnadna: a deep learning framework for population genetics inference. Bioinformatics 2022; 39:6851140. [PMID: 36445000 PMCID: PMC9825738 DOI: 10.1093/bioinformatics/btac765] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2021] [Revised: 10/30/2022] [Accepted: 11/28/2022] [Indexed: 11/30/2022] Open
Abstract
MOTIVATION We present dnadna, a flexible python-based software for deep learning inference in population genetics. It is task-agnostic and aims at facilitating the development, reproducibility, dissemination and re-usability of neural networks designed for population genetic data. RESULTS dnadna defines multiple user-friendly workflows. First, users can implement new architectures and tasks, while benefiting from dnadna utility functions, training procedure and test environment, which saves time and decreases the likelihood of bugs. Second, the implemented networks can be re-optimized based on user-specified training sets and/or tasks. Newly implemented architectures and pre-trained networks are easily shareable with the community for further benchmarking or other applications. Finally, users can apply pre-trained networks in order to predict evolutionary history from alternative real or simulated genetic datasets, without requiring extensive knowledge in deep learning or coding in general. dnadna comes with a peer-reviewed, exchangeable neural network, allowing demographic inference from SNP data, that can be used directly or retrained to solve other tasks. Toy networks are also available to ease the exploration of the software, and we expect that the range of available architectures will keep expanding thanks to community contributions. AVAILABILITY AND IMPLEMENTATION dnadna is a Python (≥3.7) package, its repository is available at gitlab.com/mlgenetics/dnadna and its associated documentation at mlgenetics.gitlab.io/dnadna/.
Collapse
Affiliation(s)
| | | | - Pierre Jobic
- Université Paris-Saclay, CNRS UMR 9015, INRIA, Laboratoire Interdisciplinaire des Sciences du Numérique, 91400 Orsay, France
- ENS Paris-Saclay, 91190 Gif-sur-Yvette, France
| | - Jérémy Guez
- Université Paris-Saclay, CNRS UMR 9015, INRIA, Laboratoire Interdisciplinaire des Sciences du Numérique, 91400 Orsay, France
- UMR7206 Eco-Anthropologie, Muséum National d’Histoire Naturelle, CNRS, Université de Paris, 75016 Paris, France
| | - Anne-Catherine Letournel
- Université Paris-Saclay, CNRS UMR 9015, INRIA, Laboratoire Interdisciplinaire des Sciences du Numérique, 91400 Orsay, France
| | - Guillaume Charpiat
- Université Paris-Saclay, CNRS UMR 9015, INRIA, Laboratoire Interdisciplinaire des Sciences du Numérique, 91400 Orsay, France
| | - Jean Cury
- To whom correspondence should be addressed. or
| | - Flora Jay
- To whom correspondence should be addressed. or
| |
Collapse
|