1
|
Nguembang Fadja A, Riguzzi F, Bertorelle G, Trucchi E. Identification of natural selection in genomic data with deep convolutional neural network. BioData Min 2021; 14:51. [PMID: 34863217 PMCID: PMC8642854 DOI: 10.1186/s13040-021-00280-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2021] [Accepted: 10/25/2021] [Indexed: 11/10/2022] Open
Abstract
Background With the increase in the size of genomic datasets describing variability in populations, extracting relevant information becomes increasingly useful as well as complex. Recently, computational methodologies such as Supervised Machine Learning and specifically Convolutional Neural Networks have been proposed to make inferences on demographic and adaptive processes using genomic data. Even though it was already shown to be powerful and efficient in different fields of investigation, Supervised Machine Learning has still to be explored as to unfold its enormous potential in evolutionary genomics. Results The paper proposes a method based on Supervised Machine Learning for classifying genomic data, represented as windows of genomic sequences from a sample of individuals belonging to the same population. A Convolutional Neural Network is used to test whether a genomic window shows the signature of natural selection. Training performed on simulated data show that the proposed model can accurately predict neutral and selection processes on portions of genomes taken from real populations with almost 90% accuracy.
Collapse
Affiliation(s)
- Arnaud Nguembang Fadja
- Dipartimento di Matematica e Informatica, University of Ferrara, Via Saragat 1, Ferrara, I-44122, Italy.
| | - Fabrizio Riguzzi
- Dipartimento di Matematica e Informatica, University of Ferrara, Via Saragat 1, Ferrara, I-44122, Italy
| | - Giorgio Bertorelle
- Dipartimento di Scienze della Vita e Biotecnologie, University of Ferrara, Via Luigi Borsari 46, Ferrara, I-44121, Italy
| | - Emiliano Trucchi
- Dipartimento di Scienze della Vita e dell'Ambiente, Marche Polytechnic University, Via Brecce Bianche, Ancona, I-60131, Italy
| |
Collapse
|
2
|
Galimberti M, Leuenberger C, Wolf B, Szilágyi SM, Foll M, Wegmann D. Detecting Selection from Linked Sites Using an F-Model. Genetics 2020; 216:1205-1215. [PMID: 33067324 PMCID: PMC7768260 DOI: 10.1534/genetics.120.303780] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2020] [Accepted: 10/03/2020] [Indexed: 11/18/2022] Open
Abstract
Allele frequencies vary across populations and loci, even in the presence of migration. While most differences may be due to genetic drift, divergent selection will further increase differentiation at some loci. Identifying those is key in studying local adaptation, but remains statistically challenging. A particularly elegant way to describe allele frequency differences among populations connected by migration is the F-model, which measures differences in allele frequencies by population specific FST coefficients. This model readily accounts for multiple evolutionary forces by partitioning FST coefficients into locus- and population-specific components reflecting selection and drift, respectively. Here we present an extension of this model to linked loci by means of a hidden Markov model (HMM), which characterizes the effect of selection on linked markers through correlations in the locus specific component along the genome. Using extensive simulations, we show that the statistical power of our method is up to twofold higher than that of previous implementations that assume sites to be independent. We finally evidence selection in the human genome by applying our method to data from the Human Genome Diversity Project (HGDP).
Collapse
Affiliation(s)
- Marco Galimberti
- Department of Biology, University of Fribourg, 1700, Switzerland
- Swiss Institute of Bioinformatics, Fribourg, 1700, Switzerland
| | | | - Beat Wolf
- iCoSys, University of Applied Sciences Western Switzerland, Fribourg, 1700 Switzerland
| | - Sándor Miklós Szilágyi
- Department of Informatics, University of Medicine, Pharmacy, Science and Technology of Târgu Mureş, Târgu Mureş, 540139, Romania
| | - Matthieu Foll
- International Agency for Research on Cancer (IARC/WHO), Section of Genetics, 69372 Lyon, France
| | - Daniel Wegmann
- Department of Biology, University of Fribourg, 1700, Switzerland
- Swiss Institute of Bioinformatics, Fribourg, 1700, Switzerland
| |
Collapse
|
3
|
Flagel L, Brandvain Y, Schrider DR. The Unreasonable Effectiveness of Convolutional Neural Networks in Population Genetic Inference. Mol Biol Evol 2019; 36:220-238. [PMID: 30517664 PMCID: PMC6367976 DOI: 10.1093/molbev/msy224] [Citation(s) in RCA: 98] [Impact Index Per Article: 19.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
Population-scale genomic data sets have given researchers incredible amounts of information from which to infer evolutionary histories. Concomitant with this flood of data, theoretical and methodological advances have sought to extract information from genomic sequences to infer demographic events such as population size changes and gene flow among closely related populations/species, construct recombination maps, and uncover loci underlying recent adaptation. To date, most methods make use of only one or a few summaries of the input sequences and therefore ignore potentially useful information encoded in the data. The most sophisticated of these approaches involve likelihood calculations, which require theoretical advances for each new problem, and often focus on a single aspect of the data (e.g., only allele frequency information) in the interest of mathematical and computational tractability. Directly interrogating the entirety of the input sequence data in a likelihood-free manner would thus offer a fruitful alternative. Here, we accomplish this by representing DNA sequence alignments as images and using a class of deep learning methods called convolutional neural networks (CNNs) to make population genetic inferences from these images. We apply CNNs to a number of evolutionary questions and find that they frequently match or exceed the accuracy of current methods. Importantly, we show that CNNs perform accurate evolutionary model selection and parameter estimation, even on problems that have not received detailed theoretical treatments. Thus, when applied to population genetic alignments, CNNs are capable of outperforming expert-derived statistical methods and offer a new path forward in cases where no likelihood approach exists.
Collapse
Affiliation(s)
- Lex Flagel
- Monsanto Company, Chesterfield, MO
- Department of Plant and Microbial Biology, University of Minnesota, St. Paul, MN
| | - Yaniv Brandvain
- Department of Plant and Microbial Biology, University of Minnesota, St. Paul, MN
| | - Daniel R Schrider
- Department of Genetics, University of North Carolina, Chapel Hill, NC
| |
Collapse
|
4
|
Schrider DR, Kern AD. Supervised Machine Learning for Population Genetics: A New Paradigm. Trends Genet 2018; 34:301-312. [PMID: 29331490 PMCID: PMC5905713 DOI: 10.1016/j.tig.2017.12.005] [Citation(s) in RCA: 210] [Impact Index Per Article: 35.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2017] [Revised: 11/29/2017] [Accepted: 12/08/2017] [Indexed: 01/21/2023]
Abstract
As population genomic datasets grow in size, researchers are faced with the daunting task of making sense of a flood of information. To keep pace with this explosion of data, computational methodologies for population genetic inference are rapidly being developed to best utilize genomic sequence data. In this review we discuss a new paradigm that has emerged in computational population genomics: that of supervised machine learning (ML). We review the fundamentals of ML, discuss recent applications of supervised ML to population genetics that outperform competing methods, and describe promising future directions in this area. Ultimately, we argue that supervised ML is an important and underutilized tool that has considerable potential for the world of evolutionary genomics.
Collapse
Affiliation(s)
- Daniel R Schrider
- Department of Genetics, and Human Genetics Institute of New Jersey, Rutgers University, Piscataway, NJ 08554, USA.
| | - Andrew D Kern
- Department of Genetics, and Human Genetics Institute of New Jersey, Rutgers University, Piscataway, NJ 08554, USA.
| |
Collapse
|
5
|
Sand A, Kristiansen M, Pedersen CNS, Mailund T. zipHMMlib: a highly optimised HMM library exploiting repetitions in the input to speed up the forward algorithm. BMC Bioinformatics 2013; 14:339. [PMID: 24266924 PMCID: PMC4222747 DOI: 10.1186/1471-2105-14-339] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2013] [Accepted: 11/14/2013] [Indexed: 11/10/2022] Open
Abstract
Background Hidden Markov models are widely used for genome analysis as they combine ease of modelling with efficient analysis algorithms. Calculating the likelihood of a model using the forward algorithm has worst case time complexity linear in the length of the sequence and quadratic in the number of states in the model. For genome analysis, however, the length runs to millions or billions of observations, and when maximising the likelihood hundreds of evaluations are often needed. A time efficient forward algorithm is therefore a key ingredient in an efficient hidden Markov model library. Results We have built a software library for efficiently computing the likelihood of a hidden Markov model. The library exploits commonly occurring substrings in the input to reuse computations in the forward algorithm. In a pre-processing step our library identifies common substrings and builds a structure over the computations in the forward algorithm which can be reused. This analysis can be saved between uses of the library and is independent of concrete hidden Markov models so one preprocessing can be used to run a number of different models. Using this library, we achieve up to 78 times shorter wall-clock time for realistic whole-genome analyses with a real and reasonably complex hidden Markov model. In one particular case the analysis was performed in less than 8 minutes compared to 9.6 hours for the previously fastest library. Conclusions We have implemented the preprocessing procedure and forward algorithm as a C++ library, zipHMM, with Python bindings for use in scripts. The library is available at http://birc.au.dk/software/ziphmm/.
Collapse
Affiliation(s)
- Andreas Sand
- Bioinformatics Research Centre, Aarhus University, Aarhus, Denmark.
| | | | | | | |
Collapse
|
6
|
Gompert Z, Buerkle CA. Analyses of genetic ancestry enable key insights for molecular ecology. Mol Ecol 2013; 22:5278-94. [DOI: 10.1111/mec.12488] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2013] [Revised: 08/05/2013] [Accepted: 08/08/2013] [Indexed: 12/15/2022]
Affiliation(s)
| | - C. Alex Buerkle
- Department of Botany; University of Wyoming; Laramie WY 82071 USA
| |
Collapse
|
7
|
Schoville SD, Bonin A, François O, Lobreaux S, Melodelima C, Manel S. Adaptive Genetic Variation on the Landscape: Methods and Cases. ANNUAL REVIEW OF ECOLOGY EVOLUTION AND SYSTEMATICS 2012. [DOI: 10.1146/annurev-ecolsys-110411-160248] [Citation(s) in RCA: 217] [Impact Index Per Article: 18.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Affiliation(s)
- Sean D. Schoville
- Laboratoire TIMC-IMAG, UMR-CNRS 5525, Université Joseph Fourier, 38041 Grenoble, France; ,
- Laboratoire d'Ecologie Alpine, UMR-CNRS 5553, Université Joseph Fourier, 38041 Grenoble, France; , , ,
| | - Aurélie Bonin
- Laboratoire d'Ecologie Alpine, UMR-CNRS 5553, Université Joseph Fourier, 38041 Grenoble, France; , , ,
| | - Olivier François
- Laboratoire TIMC-IMAG, UMR-CNRS 5525, Université Joseph Fourier, 38041 Grenoble, France; ,
| | - Stéphane Lobreaux
- Laboratoire d'Ecologie Alpine, UMR-CNRS 5553, Université Joseph Fourier, 38041 Grenoble, France; , , ,
| | - Christelle Melodelima
- Laboratoire d'Ecologie Alpine, UMR-CNRS 5553, Université Joseph Fourier, 38041 Grenoble, France; , , ,
| | - Stéphanie Manel
- Laboratoire d'Ecologie Alpine, UMR-CNRS 5553, Université Joseph Fourier, 38041 Grenoble, France; , , ,
- Laboratoire Population Environnement et Développement, UMR-IRD 151, Université Aix-Marseille, 13331 Marseille, France
| |
Collapse
|
8
|
Gompert Z, Parchman TL, Buerkle CA. Genomics of isolation in hybrids. Philos Trans R Soc Lond B Biol Sci 2012; 367:439-50. [PMID: 22201173 DOI: 10.1098/rstb.2011.0196] [Citation(s) in RCA: 102] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Hybrid zones are common in nature and can offer critical insights into the dynamics and components of reproductive isolation. Hybrids between diverged lineages are particularly informative about the genetic architecture of reproductive isolation, because introgression in an admixed population is a direct measure of isolation. In this paper, we combine simulations and a new statistical model to determine the extent to which different genetic architectures of isolation leave different signatures on genome-level patterns of introgression. We found that reproductive isolation caused by one or several loci of large effect caused greater heterogeneity in patterns of introgression than architectures involving many loci with small fitness effects, particularly when isolating factors were closely linked. The same conditions that led to heterogeneous introgression often resulted in a reasonable correspondence between outlier loci and the genetic loci that contributed to isolation. However, demographic conditions affected both of these results, highlighting potential limitations to the study of the speciation genomics. Further progress in understanding the genomics of speciation will require large-scale empirical studies of introgression in hybrid zones and model-based analyses, as well as more comprehensive modelling of the expected levels of isolation with different demographies and genetic architectures of isolation.
Collapse
Affiliation(s)
- Zachariah Gompert
- Department of Botany and Program in Ecology, University of Wyoming, Laramie, WY 82071, USA.
| | | | | |
Collapse
|
9
|
Hofer T, Foll M, Excoffier L. Evolutionary forces shaping genomic islands of population differentiation in humans. BMC Genomics 2012; 13:107. [PMID: 22439654 PMCID: PMC3317871 DOI: 10.1186/1471-2164-13-107] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2011] [Accepted: 03/22/2012] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND Levels of differentiation among populations depend both on demographic and selective factors: genetic drift and local adaptation increase population differentiation, which is eroded by gene flow and balancing selection. We describe here the genomic distribution and the properties of genomic regions with unusually high and low levels of population differentiation in humans to assess the influence of selective and neutral processes on human genetic structure. METHODS Individual SNPs of the Human Genome Diversity Panel (HGDP) showing significantly high or low levels of population differentiation were detected under a hierarchical-island model (HIM). A Hidden Markov Model allowed us to detect genomic regions or islands of high or low population differentiation. RESULTS Under the HIM, only 1.5% of all SNPs are significant at the 1% level, but their genomic spatial distribution is significantly non-random. We find evidence that local adaptation shaped high-differentiation islands, as they are enriched for non-synonymous SNPs and overlap with previously identified candidate regions for positive selection. Moreover there is a negative relationship between the size of islands and recombination rate, which is stronger for islands overlapping with genes. Gene ontology analysis supports the role of diet as a major selective pressure in those highly differentiated islands. Low-differentiation islands are also enriched for non-synonymous SNPs, and contain an overly high proportion of genes belonging to the 'Oncogenesis' biological process. CONCLUSIONS Even though selection seems to be acting in shaping islands of high population differentiation, neutral demographic processes might have promoted the appearance of some genomic islands since i) as much as 20% of islands are in non-genic regions ii) these non-genic islands are on average two times shorter than genic islands, suggesting a more rapid erosion by recombination, and iii) most loci are strongly differentiated between Africans and non-Africans, a result consistent with known human demographic history.
Collapse
Affiliation(s)
- Tamara Hofer
- Computational and Molecular Population Genetics Lab, Institute of Ecology and Evolution, University of Bern, 3012 Bern, Switzerland
- Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Matthieu Foll
- Computational and Molecular Population Genetics Lab, Institute of Ecology and Evolution, University of Bern, 3012 Bern, Switzerland
- Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Laurent Excoffier
- Computational and Molecular Population Genetics Lab, Institute of Ecology and Evolution, University of Bern, 3012 Bern, Switzerland
- Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| |
Collapse
|
10
|
HAUDRY ANNABELLE, ZHA HONGGUANG, STIFT MARC, MABLE BARBARAK. Disentangling the effects of breakdown of self-incompatibility and transition to selfing in North AmericanArabidopsis lyrata. Mol Ecol 2012; 21:1130-42. [DOI: 10.1111/j.1365-294x.2011.05435.x] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
11
|
Abstract
Vast tracts of noncoding DNA contain elements that regulate gene expression in higher eukaryotes. Describing these regulatory elements and understanding how they evolve represent major challenges for biologists. Advances in the ability to survey genome-scale DNA sequence data are providing unprecedented opportunities to use evolutionary models and computational tools to identify functionally important elements and the mode of selection acting on them in multiple species. This chapter reviews some of the current methods that have been developed and applied on noncoding DNA, what they have shown us, and how they are limited. Results of several recent studies reveal that a significantly larger fraction of noncoding DNA in eukaryotic organisms is likely to be functional than previously believed, implying that the functional annotation of most noncoding DNA in these organisms is largely incomplete. In Drosophila, recent studies have further suggested that a large fraction of noncoding DNA divergence observed between species may be the product of recurrent adaptive substitution. Similar studies in humans have revealed a more complex pattern, with signatures of recurrent positive selection being largely concentrated in conserved noncoding DNA elements. Understanding these patterns and the extent to which they generalize to other organisms awaits the analysis of forthcoming genome-scale polymorphism and divergence data from more species.
Collapse
Affiliation(s)
- Ying Zhen
- Department of Ecology and Evolutionary Biology, The Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA
| | | |
Collapse
|
12
|
Wilson DJ, Hernandez RD, Andolfatto P, Przeworski M. A population genetics-phylogenetics approach to inferring natural selection in coding sequences. PLoS Genet 2011; 7:e1002395. [PMID: 22144911 PMCID: PMC3228810 DOI: 10.1371/journal.pgen.1002395] [Citation(s) in RCA: 73] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2010] [Accepted: 10/08/2011] [Indexed: 01/23/2023] Open
Abstract
Through an analysis of polymorphism within and divergence between species, we can hope to learn about the distribution of selective effects of mutations in the genome, changes in the fitness landscape that occur over time, and the location of sites involved in key adaptations that distinguish modern-day species. We introduce a novel method for the analysis of variation in selection pressures within and between species, spatially along the genome and temporally between lineages. We model codon evolution explicitly using a joint population genetics-phylogenetics approach that we developed for the construction of multiallelic models with mutation, selection, and drift. Our approach has the advantage of performing direct inference on coding sequences, inferring ancestral states probabilistically, utilizing allele frequency information, and generalizing to multiple species. We use a Bayesian sliding window model for intragenic variation in selection coefficients that efficiently combines information across sites and captures spatial clustering within the genome. To demonstrate the utility of the method, we infer selective pressures acting in Drosophila melanogaster and D. simulans from polymorphism and divergence data for 100 X-linked coding regions.
Collapse
Affiliation(s)
- Daniel J Wilson
- Department of Human Genetics and Department of Ecology and Evolution, University of Chicago, Chicago, Illinois, USA.
| | | | | | | |
Collapse
|
13
|
Kolaczkowski B, Hupalo DN, Kern AD. Recurrent adaptation in RNA interference genes across the Drosophila phylogeny. Mol Biol Evol 2010; 28:1033-42. [PMID: 20971974 DOI: 10.1093/molbev/msq284] [Citation(s) in RCA: 70] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
RNA interference (RNAi) is quickly emerging as a vital component of genome organization, gene regulation, and immunity in Drosophila and other species. Previous studies have suggested that, as a whole, genes involved in RNAi are under intense positive selection in Drosophila melanogaster. Here, we characterize the extent and patterns of adaptive evolution in 23 known Drosophila RNAi genes, both within D. melanogaster and across the Drosophila phylogeny. We find strong evidence for recurrent protein-coding adaptation at a large number of RNAi genes, particularly those involved in antiviral immunity and defense against transposable elements. We identify specific functional domains involved in direct protein-RNA interactions as particular hotspots of recurrent adaptation in multiple RNAi genes, suggesting that targeted coadaptive arms races may be a general feature of RNAi evolution. Our observations suggest a predictive model of how selective pressures generated by evolutionary arms race scenarios may affect multiple genes across protein interaction networks and other biochemical pathways.
Collapse
|