1
|
Zhao H, Alachiotis N. Data preprocessing methods for selective sweep detection using convolutional neural networks. Methods 2025; 233:19-29. [PMID: 39550020 DOI: 10.1016/j.ymeth.2024.11.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2024] [Revised: 10/28/2024] [Accepted: 11/04/2024] [Indexed: 11/18/2024] Open
Abstract
The identification of positive selection has been framed as a classification task, with Convolutional Neural Networks (CNNs) already outperforming summary statistics and likelihood-based approaches in accuracy. Despite the prevalence of CNN-based methods that manipulate the pixels of images representing raw genomic data as a preprocessing step to improve classification accuracy, the efficacy of these pixel-rearrangement techniques remains inadequately examined, particularly in the presence of confounding factors like population bottlenecks, migration and recombination hotspots. We introduce a set of pixel rearrangement algorithms aimed at enhancing CNN classification accuracy in detecting selective sweeps. These algorithms are employed to assess the performance of four CNN models for selective sweep detection. Our findings illustrate that the judicious application of rearrangement algorithms notably enhances the overall classification accuracy of a CNN across various datasets simulating confounding factors. We observed that sorting the columns of the genomic matrices has higher on CNN performance than rearranging the sequences. To some extent, these rearrangement algorithms are more robust to misspecified demographic models compared with the utilization of the default preprocessing algorithm as suggested by the respective authors of each CNN architecture. We provide the data rearrangement algorithms as a distinct package available for download at: https://github.com/Zhaohq96/Genetic-data-rearrangement.
Collapse
Affiliation(s)
- Hanqing Zhao
- University of Twente, Drienerlolaan 5, Enschede, 7522 NB, Overijssel, the Netherlands.
| | - Nikolaos Alachiotis
- University of Twente, Drienerlolaan 5, Enschede, 7522 NB, Overijssel, the Netherlands.
| |
Collapse
|
2
|
Tripathi D, Bhattacharyya C, Basu A. Deep learning insights into distinct patterns of polygenic adaptation across human populations. Nucleic Acids Res 2024; 52:e102. [PMID: 39558170 DOI: 10.1093/nar/gkae1027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2024] [Revised: 10/10/2024] [Accepted: 10/17/2024] [Indexed: 11/20/2024] Open
Abstract
Response to spatiotemporal variation in selection gradients resulted in signatures of polygenic adaptation in human genomes. We introduce RAISING, a two-stage deep learning framework that optimizes neural network architecture through hyperparameter tuning before performing feature selection and prediction tasks. We tested RAISING on published and newly designed simulations that incorporate the complex interplay between demographic history and selection gradients. RAISING outperformed Phylogenetic Generalized Least Squares (PGLS), ridge regression and DeepGenomeScan, with significantly higher true positive rates (TPR) in detecting genetic adaptation. It reduced computational time by 60-fold and increased TPR by up to 28% compared to DeepGenomeScan on published data. In more complex demographic simulations, RAISING showed lower false discoveries and significantly higher TPR, up to 17-fold, compared to other methods. RAISING demonstrated robustness with least sensitivity to demographic history, selection gradient and their interactions. We developed a sliding window method for genome-wide implementation of RAISING to overcome the computational challenges of high-dimensional genomic data. Applied to African, European, South Asian and East Asian populations, we identified multiple genomic regions undergoing polygenic selection. Notably, ∼70% of the regions identified in Africans are unique, with broad patterns distinguishing them from non-Africans, corroborating the Out of Africa dispersal model.
Collapse
Affiliation(s)
- Devashish Tripathi
- Biotechnology Research Innovation Council-National Institute of Biomedical Genomics (BRIC-NIBMG), Kalyani, 741251, West Bengal, India
- Regional Centre for Biotechnology, NCR Biotech Science Cluster, 3rd Milestone, Faridabad-Gurugram Expressway, Faridabad 121001, Haryana (Delhi NCR), India
| | - Chandrika Bhattacharyya
- Biotechnology Research Innovation Council-National Institute of Biomedical Genomics (BRIC-NIBMG), Kalyani, 741251, West Bengal, India
| | - Analabha Basu
- Biotechnology Research Innovation Council-National Institute of Biomedical Genomics (BRIC-NIBMG), Kalyani, 741251, West Bengal, India
- Regional Centre for Biotechnology, NCR Biotech Science Cluster, 3rd Milestone, Faridabad-Gurugram Expressway, Faridabad 121001, Haryana (Delhi NCR), India
| |
Collapse
|
3
|
Amin MR, Hasan M, DeGiorgio M. Digital Image Processing to Detect Adaptive Evolution. Mol Biol Evol 2024; 41:msae242. [PMID: 39565932 PMCID: PMC11631197 DOI: 10.1093/molbev/msae242] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2024] [Revised: 10/28/2024] [Accepted: 11/13/2024] [Indexed: 11/22/2024] Open
Abstract
In recent years, advances in image processing and machine learning have fueled a paradigm shift in detecting genomic regions under natural selection. Early machine learning techniques employed population-genetic summary statistics as features, which focus on specific genomic patterns expected by adaptive and neutral processes. Though such engineered features are important when training data are limited, the ease at which simulated data can now be generated has led to the recent development of approaches that take in image representations of haplotype alignments and automatically extract important features using convolutional neural networks. Digital image processing methods termed α-molecules are a class of techniques for multiscale representation of objects that can extract a diverse set of features from images. One such α-molecule method, termed wavelet decomposition, lends greater control over high-frequency components of images. Another α-molecule method, termed curvelet decomposition, is an extension of the wavelet concept that considers events occurring along curves within images. We show that application of these α-molecule techniques to extract features from image representations of haplotype alignments yield high true positive rate and accuracy to detect hard and soft selective sweep signatures from genomic data with both linear and nonlinear machine learning classifiers. Moreover, we find that such models are easy to visualize and interpret, with performance rivaling those of contemporary deep learning approaches for detecting sweeps.
Collapse
Affiliation(s)
- Md Ruhul Amin
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA
| | - Mahmudul Hasan
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA
| | - Michael DeGiorgio
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA
| |
Collapse
|
4
|
Temple SD, Waples RK, Browning SR. Modeling recent positive selection using identity-by-descent segments. Am J Hum Genet 2024; 111:2510-2529. [PMID: 39362217 PMCID: PMC11568764 DOI: 10.1016/j.ajhg.2024.08.023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Revised: 08/29/2024] [Accepted: 08/30/2024] [Indexed: 10/05/2024] Open
Abstract
Recent positive selection can result in an excess of long identity-by-descent (IBD) haplotype segments overlapping a locus. The statistical methods that we propose here address three major objectives in studying selective sweeps: scanning for regions of interest, identifying possible sweeping alleles, and estimating a selection coefficient s. First, we implement a selection scan to locate regions with excess IBD rates. Second, we estimate the allele frequency and location of an unknown sweeping allele by aggregating over variants that are more abundant in an inferred outgroup with excess IBD rate versus the rest of the sample. Third, we propose an estimator for the selection coefficient and quantify uncertainty using the parametric bootstrap. Comparing against state-of-the-art methods in extensive simulations, we show that our methods are more precise at estimating s when s≥0.015. We also show that our 95% confidence intervals contain s in nearly 95% of our simulations. We apply these methods to study positive selection in European ancestry samples from the Trans-Omics for Precision Medicine project. We analyze eight loci where IBD rates are more than four standard deviations above the genome-wide median, including LCT where the maximum IBD rate is 35 standard deviations above the genome-wide median. Overall, we present robust and accurate approaches to study recent adaptive evolution without knowing the identity of the causal allele or using time series data.
Collapse
Affiliation(s)
- Seth D Temple
- Department of Statistics, University of Washington, Seattle, WA, USA.
| | - Ryan K Waples
- Department of Biostatistics, University of Washington, Seattle, WA, USA
| | - Sharon R Browning
- Department of Biostatistics, University of Washington, Seattle, WA, USA.
| |
Collapse
|
5
|
Gompert Z, DeRaad DA, Buerkle CA. A Next Generation of Hierarchical Bayesian Analyses of Hybrid Zones Enables Model-Based Quantification of Variation in Introgression in R. Ecol Evol 2024; 14:e70548. [PMID: 39583044 PMCID: PMC11582016 DOI: 10.1002/ece3.70548] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2024] [Revised: 10/18/2024] [Accepted: 10/28/2024] [Indexed: 11/26/2024] Open
Abstract
Hybrid zones, where genetically distinct groups of organisms meet and interbreed, offer valuable insights into the nature of species and speciation. Here, we present a new R package, bgchm, for population genomic analyses of hybrid zones. This R package extends and updates the existing bgc software and combines Bayesian analyses of hierarchical genomic clines with Bayesian methods for estimating hybrid indexes, interpopulation ancestry proportions, and geographic clines. Compared to existing software, bgchm offers enhanced efficiency through Hamiltonian Monte Carlo sampling and the ability to work with genotype likelihoods combined with a hierarchical Bayesian approach, enabling inference for diverse types of genetic data sets. The package also facilitates the quantification of introgression patterns across genomes, which is crucial for understanding reproductive isolation and speciation genetics. We first describe the models underlying bgchm and then provide an overview of the R package and illustrate its use through the analysis of simulated and empirical data sets. We show that bgchm generates accurate estimates of model parameters under a variety of conditions, especially when the genetic loci analyzed are highly ancestry informative. This includes relatively robust estimates of genome-wide variability in clines, which has not been the focus of previous models and methods. We also illustrate how both selection and genetic drift contribute to variability in introgression among loci and how additional information can be used to help distinguish these contributions. We conclude by describing the promises and limitations of bgchm, comparing bgchm to other software for genomic cline analyses, and identifying areas for fruitful future development.
Collapse
Affiliation(s)
| | - Devon A. DeRaad
- Department of Ecology & Evolutionary BiologyUniversity of KansasLawrenceKansasUSA
| | | |
Collapse
|
6
|
Zhao S, Chi L, Fu M, Chen H. HaploSweep: Detecting and Distinguishing Recent Soft and Hard Selective Sweeps through Haplotype Structure. Mol Biol Evol 2024; 41:msae192. [PMID: 39288167 PMCID: PMC11452351 DOI: 10.1093/molbev/msae192] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2024] [Revised: 07/29/2024] [Accepted: 09/03/2024] [Indexed: 09/19/2024] Open
Abstract
Identifying soft selective sweeps using genomic data is a challenging yet crucial task in population genetics. In this study, we present HaploSweep, a novel method for detecting and categorizing soft and hard selective sweeps based on haplotype structure. Through simulations spanning a broad range of selection intensities, softness levels, and demographic histories, we demonstrate that HaploSweep outperforms iHS, nSL, and H12 in detecting soft sweeps. HaploSweep achieves high classification accuracy-0.9247 for CHB, 0.9484 for CEU, and 0.9829 YRI-when applied to simulations in line with the human Out-of-Africa demographic model. We also observe that the classification accuracy remains consistently robust across different demographic models. Additionally, we introduce a refined method to accurately distinguish soft shoulders adjacent to hard sweeps from soft sweeps. Application of HaploSweep to genomic data of CHB, CEU, and YRI populations from the 1000 genomes project has led to the discovery of several new genes that bear strong evidence of population-specific soft sweeps (HRNR, AMBRA1, CBFA2T2, DYNC2H1, and RANBP2 etc.), with prevalent associations to immune functions and metabolic processes. The validated performance of HaploSweep, demonstrated through both simulated and real data, underscores its potential as a valuable tool for detecting and comprehending the role of soft sweeps in adaptive evolution.
Collapse
Affiliation(s)
- Shilei Zhao
- Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China
- School of Future Technology, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Lianjiang Chi
- Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China
- School of Future Technology, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Mincong Fu
- Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China
- School of Future Technology, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Hua Chen
- Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China
- School of Future Technology, University of Chinese Academy of Sciences, Beijing 100049, China
- CAS Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming 650223, China
| |
Collapse
|
7
|
van den Belt S, Zhao H, Alachiotis N. Scalable CNN-based classification of selective sweeps using derived allele frequencies. Bioinformatics 2024; 40:ii29-ii36. [PMID: 39230693 PMCID: PMC11373383 DOI: 10.1093/bioinformatics/btae385] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/05/2024] Open
Abstract
MOTIVATION Selective sweeps can successfully be distinguished from neutral genetic data using summary statistics and likelihood-based methods that analyze single nucleotide polymorphisms (SNPs). However, these methods are sensitive to confounding factors, such as severe population bottlenecks and old migration. By virtue of machine learning, and specifically convolutional neural networks (CNNs), new accurate classification models that are robust to confounding factors have been recently proposed. However, such methods are more computationally expensive than summary-statistic-based ones, yielding them impractical for processing large-scale genomic data. Moreover, SNP data are frequently preprocessed to improve classification accuracy, further exacerbating the long analysis times. RESULTS To this end, we propose a 1D CNN-based model, dubbed FAST-NN, that does not require any preprocessing while using only derived allele frequencies instead of summary statistics or raw SNP data, thereby yielding a sample-size-invariant, scalable solution. We evaluated several data fusion approaches to account for the variance of the density of genetic diversity across genomic regions (a selective sweep signature), and performed an extensive neural architecture search based on a state-of-the-art reference network architecture (SweepNet). The resulting model, FAST-NN, outperforms the reference architecture by up to 12% inference accuracy over all challenging evolutionary scenarios with confounding factors that were evaluated. Moreover, FAST-NN is between 30× and 259× faster on a single CPU core, and between 2.0× and 6.2× faster on a GPU, when processing sample sizes between 128 and 1000 samples. Our work paves the way for the practical use of CNNs in large-scale selective sweep detection. AVAILABILITY AND IMPLEMENTATION https://github.com/SjoerdvandenBelt/FAST-NN.
Collapse
Affiliation(s)
- Sjoerd van den Belt
- Department of Computer Science, Faculty of EEMCS, University of Twente, 7522NB Enschede, The Netherlands
| | - Hanqing Zhao
- Department of Computer Science, Faculty of EEMCS, University of Twente, 7522NB Enschede, The Netherlands
| | - Nikolaos Alachiotis
- Department of Computer Science, Faculty of EEMCS, University of Twente, 7522NB Enschede, The Netherlands
| |
Collapse
|
8
|
Fonseca EM, Carstens BC. Artificial intelligence enables unified analysis of historical and landscape influences on genetic diversity. Mol Phylogenet Evol 2024; 198:108116. [PMID: 38871263 DOI: 10.1016/j.ympev.2024.108116] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2023] [Revised: 04/04/2024] [Accepted: 06/04/2024] [Indexed: 06/15/2024]
Abstract
While genetic variation in any species is potentially shaped by a range of processes, phylogeography and landscape genetics are largely concerned with inferring how environmental conditions and landscape features impact neutral intraspecific diversity. However, even as both disciplines have come to utilize SNP data over the last decades, analytical approaches have remained for the most part focused on either broad-scale inferences of historical processes (phylogeography) or on more localized inferences about environmental and/or landscape features (landscape genetics). Here we demonstrate that an artificial intelligence model-based analytical framework can consider both deeper historical factors and landscape-level processes in an integrated analysis. We implement this framework using data collected from two Brazilian anurans, the Brazilian sibilator frog (Leptodactylus troglodytes) and granular toad (Rhinella granulosa). Our results indicate that historical demographic processes shape most the genetic variation in the sibulator frog, while landscape processes primarily influence variation in the granular toad. The machine learning framework used here allows both historical and landscape processes to be considered equally, rather than requiring researchers to make an a priori decision about which factors are important.
Collapse
Affiliation(s)
- Emanuel M Fonseca
- Museum of Biological Diversity & Department of Evolution, Ecology and Organismal Biology, The Ohio State University, 1315 Kinnear Rd., Columbus OH 43212, USA
| | - Bryan C Carstens
- Museum of Biological Diversity & Department of Evolution, Ecology and Organismal Biology, The Ohio State University, 1315 Kinnear Rd., Columbus OH 43212, USA.
| |
Collapse
|
9
|
Vaughn AH, Nielsen R. Fast and Accurate Estimation of Selection Coefficients and Allele Histories from Ancient and Modern DNA. Mol Biol Evol 2024; 41:msae156. [PMID: 39078618 PMCID: PMC11321360 DOI: 10.1093/molbev/msae156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2023] [Revised: 07/02/2024] [Accepted: 07/10/2024] [Indexed: 07/31/2024] Open
Abstract
We here present CLUES2, a full-likelihood method to infer natural selection from sequence data that is an extension of the method CLUES. We make several substantial improvements to the CLUES method that greatly increases both its applicability and its speed. We add the ability to use ancestral recombination graphs on ancient data as emissions to the underlying hidden Markov model, which enables CLUES2 to use both temporal and linkage information to make estimates of selection coefficients. We also fully implement the ability to estimate distinct selection coefficients in different epochs, which allows for the analysis of changes in selective pressures through time, as well as selection with dominance. In addition, we greatly increase the computational efficiency of CLUES2 over CLUES using several approximations to the forward-backward algorithms and develop a new way to reconstruct historic allele frequencies by integrating over the uncertainty in the estimation of the selection coefficients. We illustrate the accuracy of CLUES2 through extensive simulations and validate the importance sampling framework for integrating over the uncertainty in the inference of gene trees. We also show that CLUES2 is well-calibrated by showing that under the null hypothesis, the distribution of log-likelihood ratios follows a χ2 distribution with the appropriate degrees of freedom. We run CLUES2 on a set of recently published ancient human data from Western Eurasia and test for evidence of changing selection coefficients through time. We find significant evidence of changing selective pressures in several genes correlated with the introduction of agriculture to Europe and the ensuing dietary and demographic shifts of that time. In particular, our analysis supports previous hypotheses of strong selection on lactase persistence during periods of ancient famines and attenuated selection in more modern periods.
Collapse
Affiliation(s)
- Andrew H Vaughn
- Center for Computational Biology, University of California, Berkeley, CA 94720, USA
| | - Rasmus Nielsen
- Departments of Integrative Biology and Statistics, University of California, Berkeley, CA 94720, USA
- Center for GeoGenetics, University of Copenhagen, Copenhagen DK-1350, Denmark
| |
Collapse
|
10
|
Fonseca EM, Pope NS, Peterman WE, Werneck FP, Colli GR, Carstens BC. Genetic structure and landscape effects on gene flow in the Neotropical lizard Norops brasiliensis (Squamata: Dactyloidae). Heredity (Edinb) 2024; 132:284-295. [PMID: 38575800 PMCID: PMC11166928 DOI: 10.1038/s41437-024-00682-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Revised: 03/12/2024] [Accepted: 03/18/2024] [Indexed: 04/06/2024] Open
Abstract
One key research goal of evolutionary biology is to understand the origin and maintenance of genetic variation. In the Cerrado, the South American savanna located primarily in the Central Brazilian Plateau, many hypotheses have been proposed to explain how landscape features (e.g., geographic distance, river barriers, topographic compartmentalization, and historical climatic fluctuations) have promoted genetic structure by mediating gene flow. Here, we asked whether these landscape features have influenced the genetic structure and differentiation in the lizard species Norops brasiliensis (Squamata: Dactyloidae). To achieve our goal, we used a genetic clustering analysis and estimate an effective migration surface to assess genetic structure in the focal species. Optimized isolation-by-resistance models and a simulation-based approach combined with machine learning (convolutional neural network; CNN) were then used to infer current and historical effects on population genetic structure through 12 unique landscape models. We recovered five geographically distributed populations that are separated by regions of lower-than-expected gene flow. The results of the CNN showed that geographic distance is the sole predictor of genetic variation in N. brasiliensis, and that slope, rivers, and historical climate had no discernible influence on gene flow. Our novel CNN approach was accurate (89.5%) in differentiating each landscape model. CNN and other machine learning approaches are still largely unexplored in landscape genetics studies, representing promising avenues for future research with increasingly accessible genomic datasets.
Collapse
Affiliation(s)
- Emanuel M Fonseca
- Department of Evolution, Ecology and Organismal Biology, The Ohio State University, Columbus, OH, USA
| | - Nathaniel S Pope
- Institute of Ecology and Evolution, University of Oregon, Eugene, OR, 97403, USA
| | - William E Peterman
- School of Environment and Natural Resources, The Ohio State University, Columbus, OH, USA
| | - Fernanda P Werneck
- Coordenação de Biodiversidade, Programa de Coleções Científicas Biológicas, Instituto Nacional de Pesquisas da Amazônia (INPA), Manaus, Brazil
| | - Guarino R Colli
- Departamento de Zoologia, Universidade de Brasília, Brasília, Brazil
| | - Bryan C Carstens
- Department of Evolution, Ecology and Organismal Biology, The Ohio State University, Columbus, OH, USA.
| |
Collapse
|
11
|
Song H, Chu J, Li W, Li X, Fang L, Han J, Zhao S, Ma Y. A Novel Approach Utilizing Domain Adversarial Neural Networks for the Detection and Classification of Selective Sweeps. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2024; 11:e2304842. [PMID: 38308186 PMCID: PMC11005742 DOI: 10.1002/advs.202304842] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/17/2023] [Revised: 01/10/2024] [Indexed: 02/04/2024]
Abstract
The identification and classification of selective sweeps are of great significance for improving the understanding of biological evolution and exploring opportunities for precision medicine and genetic improvement. Here, a domain adaptation sweep detection and classification (DASDC) method is presented to balance the alignment of two domains and the classification performance through a domain-adversarial neural network and its adversarial learning modules. DASDC effectively addresses the issue of mismatch between training data and real genomic data in deep learning models, leading to a significant improvement in its generalization capability, prediction robustness, and accuracy. The DASDC method demonstrates improved identification performance compared to existing methods and excels in classification performance, particularly in scenarios where there is a mismatch between application data and training data. The successful implementation of DASDC in real data of three distinct species highlights its potential as a useful tool for identifying crucial functional genes and investigating adaptive evolutionary mechanisms, particularly with the increasing availability of genomic data.
Collapse
Affiliation(s)
- Hui Song
- Key Laboratory of Agricultural Animal GeneticsBreeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of AgricultureHuazhong Agricultural UniversityWuhan430070China
| | - Jinyu Chu
- Key Laboratory of Agricultural Animal GeneticsBreeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of AgricultureHuazhong Agricultural UniversityWuhan430070China
| | - Wangjiao Li
- Key Laboratory of Agricultural Animal GeneticsBreeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of AgricultureHuazhong Agricultural UniversityWuhan430070China
| | - Xinyun Li
- Key Laboratory of Agricultural Animal GeneticsBreeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of AgricultureHuazhong Agricultural UniversityWuhan430070China
- Hubei Hongshan LaboratoryWuhan430070China
| | - Lingzhao Fang
- Center for Quantitative Genetics and GenomicsAarhus UniversityAarhus8000Denmark
| | - Jianlin Han
- Key Laboratory of Agricultural Animal GeneticsBreeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of AgricultureHuazhong Agricultural UniversityWuhan430070China
- CAAS‐ILRI Joint Laboratory on Livestock and Forage Genetic ResourcesInstitute of Animal ScienceChinese Academy of Agricultural Sciences (CAAS)Beijing100193China
- Livestock Genetics ProgramInternational Livestock Research Institute (ILRI)Nairobi00100Kenya
| | - Shuhong Zhao
- Key Laboratory of Agricultural Animal GeneticsBreeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of AgricultureHuazhong Agricultural UniversityWuhan430070China
- Hubei Hongshan LaboratoryWuhan430070China
- Lingnan Modern Agricultural Science and Technology Guangdong LaboratoryGuangzhou510642China
| | - Yunlong Ma
- Key Laboratory of Agricultural Animal GeneticsBreeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of AgricultureHuazhong Agricultural UniversityWuhan430070China
- Hubei Hongshan LaboratoryWuhan430070China
- Lingnan Modern Agricultural Science and Technology Guangdong LaboratoryGuangzhou510642China
| |
Collapse
|
12
|
Huang X, Rymbekova A, Dolgova O, Lao O, Kuhlwilm M. Harnessing deep learning for population genetic inference. Nat Rev Genet 2024; 25:61-78. [PMID: 37666948 DOI: 10.1038/s41576-023-00636-3] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/11/2023] [Indexed: 09/06/2023]
Abstract
In population genetics, the emergence of large-scale genomic data for various species and populations has provided new opportunities to understand the evolutionary forces that drive genetic diversity using statistical inference. However, the era of population genomics presents new challenges in analysing the massive amounts of genomes and variants. Deep learning has demonstrated state-of-the-art performance for numerous applications involving large-scale data. Recently, deep learning approaches have gained popularity in population genetics; facilitated by the advent of massive genomic data sets, powerful computational hardware and complex deep learning architectures, they have been used to identify population structure, infer demographic history and investigate natural selection. Here, we introduce common deep learning architectures and provide comprehensive guidelines for implementing deep learning models for population genetic inference. We also discuss current challenges and future directions for applying deep learning in population genetics, focusing on efficiency, robustness and interpretability.
Collapse
Affiliation(s)
- Xin Huang
- Department of Evolutionary Anthropology, University of Vienna, Vienna, Austria.
- Human Evolution and Archaeological Sciences (HEAS), University of Vienna, Vienna, Austria.
| | - Aigerim Rymbekova
- Department of Evolutionary Anthropology, University of Vienna, Vienna, Austria
- Human Evolution and Archaeological Sciences (HEAS), University of Vienna, Vienna, Austria
| | - Olga Dolgova
- Integrative Genomics Laboratory, CIC bioGUNE - Centro de Investigación Cooperativa en Biociencias, Derio, Biscaya, Spain
| | - Oscar Lao
- Institute of Evolutionary Biology, CSIC-Universitat Pompeu Fabra, Barcelona, Spain.
| | - Martin Kuhlwilm
- Department of Evolutionary Anthropology, University of Vienna, Vienna, Austria.
- Human Evolution and Archaeological Sciences (HEAS), University of Vienna, Vienna, Austria.
| |
Collapse
|
13
|
Cecil RM, Sugden LA. On convolutional neural networks for selection inference: Revealing the effect of preprocessing on model learning and the capacity to discover novel patterns. PLoS Comput Biol 2023; 19:e1010979. [PMID: 38011281 PMCID: PMC10703409 DOI: 10.1371/journal.pcbi.1010979] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2023] [Revised: 12/07/2023] [Accepted: 10/26/2023] [Indexed: 11/29/2023] Open
Abstract
A central challenge in population genetics is the detection of genomic footprints of selection. As machine learning tools including convolutional neural networks (CNNs) have become more sophisticated and applied more broadly, these provide a logical next step for increasing our power to learn and detect such patterns; indeed, CNNs trained on simulated genome sequences have recently been shown to be highly effective at this task. Unlike previous approaches, which rely upon human-crafted summary statistics, these methods are able to be applied directly to raw genomic data, allowing them to potentially learn new signatures that, if well-understood, could improve the current theory surrounding selective sweeps. Towards this end, we examine a representative CNN from the literature, paring it down to the minimal complexity needed to maintain comparable performance; this low-complexity CNN allows us to directly interpret the learned evolutionary signatures. We then validate these patterns in more complex models using metrics that evaluate feature importance. Our findings reveal that preprocessing steps, which determine how the population genetic data is presented to the model, play a central role in the learned prediction method. This results in models that mimic previously-defined summary statistics; in one case, the summary statistic itself achieves similarly high accuracy. For evolutionary processes that are less well understood than selective sweeps, we hope this provides an initial framework for using CNNs in ways that go beyond simply achieving high classification performance. Instead, we propose that CNNs might be useful as tools for learning novel patterns that can translate to easy-to-implement summary statistics available to a wider community of researchers.
Collapse
Affiliation(s)
- Ryan M. Cecil
- Department of Mathematics and Computer Science, Duquesne University, Pittsburgh, Pennsylvania, United States of America
- Department of Statistics, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Lauren A. Sugden
- Department of Mathematics and Computer Science, Duquesne University, Pittsburgh, Pennsylvania, United States of America
| |
Collapse
|
14
|
Mo Z, Siepel A. Domain-adaptive neural networks improve supervised machine learning based on simulated population genetic data. PLoS Genet 2023; 19:e1011032. [PMID: 37934781 PMCID: PMC10655966 DOI: 10.1371/journal.pgen.1011032] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2023] [Revised: 11/17/2023] [Accepted: 10/23/2023] [Indexed: 11/09/2023] Open
Abstract
Investigators have recently introduced powerful methods for population genetic inference that rely on supervised machine learning from simulated data. Despite their performance advantages, these methods can fail when the simulated training data does not adequately resemble data from the real world. Here, we show that this "simulation mis-specification" problem can be framed as a "domain adaptation" problem, where a model learned from one data distribution is applied to a dataset drawn from a different distribution. By applying an established domain-adaptation technique based on a gradient reversal layer (GRL), originally introduced for image classification, we show that the effects of simulation mis-specification can be substantially mitigated. We focus our analysis on two state-of-the-art deep-learning population genetic methods-SIA, which infers positive selection from features of the ancestral recombination graph (ARG), and ReLERNN, which infers recombination rates from genotype matrices. In the case of SIA, the domain adaptive framework also compensates for ARG inference error. Using the domain-adaptive SIA (dadaSIA) model, we estimate improved selection coefficients at selected loci in the 1000 Genomes CEU population. We anticipate that domain adaptation will prove to be widely applicable in the growing use of supervised machine learning in population genetics.
Collapse
Affiliation(s)
- Ziyi Mo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
- School of Biological Sciences, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| | - Adam Siepel
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
- School of Biological Sciences, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| |
Collapse
|
15
|
Kloska A, Giełczyk A, Grzybowski T, Płoski R, Kloska SM, Marciniak T, Pałczyński K, Rogalla-Ładniak U, Malyarchuk BA, Derenko MV, Kovačević-Grujičić N, Stevanović M, Drakulić D, Davidović S, Spólnicka M, Zubańska M, Woźniak M. A Machine-Learning-Based Approach to Prediction of Biogeographic Ancestry within Europe. Int J Mol Sci 2023; 24:15095. [PMID: 37894775 PMCID: PMC10606184 DOI: 10.3390/ijms242015095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2023] [Revised: 10/03/2023] [Accepted: 10/07/2023] [Indexed: 10/29/2023] Open
Abstract
Data obtained with the use of massive parallel sequencing (MPS) can be valuable in population genetics studies. In particular, such data harbor the potential for distinguishing samples from different populations, especially from those coming from adjacent populations of common origin. Machine learning (ML) techniques seem to be especially well suited for analyzing large datasets obtained using MPS. The Slavic populations constitute about a third of the population of Europe and inhabit a large area of the continent, while being relatively closely related in population genetics terms. In this proof-of-concept study, various ML techniques were used to classify DNA samples from Slavic and non-Slavic individuals. The primary objective of this study was to empirically evaluate the feasibility of discerning the genetic provenance of individuals of Slavic descent who exhibit genetic similarity, with the overarching goal of categorizing DNA specimens derived from diverse Slavic population representatives. Raw sequencing data were pre-processed, to obtain a 1200 character-long binary vector. A total of three classifiers were used-Random Forest, Support Vector Machine (SVM), and XGBoost. The most-promising results were obtained using SVM with a linear kernel, with 99.9% accuracy and F1-scores of 0.9846-1.000 for all classes.
Collapse
Affiliation(s)
- Anna Kloska
- Department of Forensic Medicine, The Ludwik Rydygier Collegium Medicum in Bydgoszcz, Nicolaus Copernicus University in Torun, 85067 Bydgoszcz, Poland
- Faculty of Medical Sciences, Bydgoszcz University of Science and Technology, 85796 Bydgoszcz, Poland
| | - Agata Giełczyk
- Faculty of Telecommunications, Computer Science and Electrical Engineering, Bydgoszcz University of Science and Technology, 85796 Bydgoszcz, Poland
| | - Tomasz Grzybowski
- Department of Forensic Medicine, The Ludwik Rydygier Collegium Medicum in Bydgoszcz, Nicolaus Copernicus University in Torun, 85067 Bydgoszcz, Poland
| | - Rafał Płoski
- Department of Medical Genetics, Warsaw Medical University, 02106 Warsaw, Poland
| | - Sylwester M. Kloska
- Department of Forensic Medicine, The Ludwik Rydygier Collegium Medicum in Bydgoszcz, Nicolaus Copernicus University in Torun, 85067 Bydgoszcz, Poland
- Faculty of Medical Sciences, Bydgoszcz University of Science and Technology, 85796 Bydgoszcz, Poland
| | - Tomasz Marciniak
- Faculty of Telecommunications, Computer Science and Electrical Engineering, Bydgoszcz University of Science and Technology, 85796 Bydgoszcz, Poland
| | - Krzysztof Pałczyński
- Faculty of Telecommunications, Computer Science and Electrical Engineering, Bydgoszcz University of Science and Technology, 85796 Bydgoszcz, Poland
| | - Urszula Rogalla-Ładniak
- Department of Forensic Medicine, The Ludwik Rydygier Collegium Medicum in Bydgoszcz, Nicolaus Copernicus University in Torun, 85067 Bydgoszcz, Poland
| | - Boris A. Malyarchuk
- Institute of Biological Problems of the North, Russian Academy of Sciences, 685000 Magadan, Russia
| | - Miroslava V. Derenko
- Institute of Biological Problems of the North, Russian Academy of Sciences, 685000 Magadan, Russia
| | - Nataša Kovačević-Grujičić
- Institute of Molecular Genetics and Genetic Engineering, University of Belgrade, 11042 Belgrade, Serbia
| | - Milena Stevanović
- Institute of Molecular Genetics and Genetic Engineering, University of Belgrade, 11042 Belgrade, Serbia
- Faculty of Biology, University of Belgrade, 11000 Belgrade, Serbia
- Serbian Academy of Sciences and Arts, 11000 Belgrade, Serbia
| | - Danijela Drakulić
- Institute of Molecular Genetics and Genetic Engineering, University of Belgrade, 11042 Belgrade, Serbia
| | - Slobodan Davidović
- Institute for Biological Research “Siniša Stanković”, National Institute of Republic of Serbia, University of Belgrade, 11060 Belgrade, Serbia
| | | | - Magdalena Zubańska
- Faculty of Law and Administration, Department of Criminology and Forensic Sciences, University of Warmia and Mazury, 10726 Olsztyn, Poland
| | - Marcin Woźniak
- Department of Forensic Medicine, The Ludwik Rydygier Collegium Medicum in Bydgoszcz, Nicolaus Copernicus University in Torun, 85067 Bydgoszcz, Poland
| |
Collapse
|
16
|
Amin MR, Hasan M, Arnab SP, DeGiorgio M. Tensor Decomposition-based Feature Extraction and Classification to Detect Natural Selection from Genomic Data. Mol Biol Evol 2023; 40:msad216. [PMID: 37772983 PMCID: PMC10581699 DOI: 10.1093/molbev/msad216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2023] [Revised: 08/10/2023] [Accepted: 09/14/2023] [Indexed: 09/30/2023] Open
Abstract
Inferences of adaptive events are important for learning about traits, such as human digestion of lactose after infancy and the rapid spread of viral variants. Early efforts toward identifying footprints of natural selection from genomic data involved development of summary statistic and likelihood methods. However, such techniques are grounded in simple patterns or theoretical models that limit the complexity of settings they can explore. Due to the renaissance in artificial intelligence, machine learning methods have taken center stage in recent efforts to detect natural selection, with strategies such as convolutional neural networks applied to images of haplotypes. Yet, limitations of such techniques include estimation of large numbers of model parameters under nonconvex settings and feature identification without regard to location within an image. An alternative approach is to use tensor decomposition to extract features from multidimensional data although preserving the latent structure of the data, and to feed these features to machine learning models. Here, we adopt this framework and present a novel approach termed T-REx, which extracts features from images of haplotypes across sampled individuals using tensor decomposition, and then makes predictions from these features using classical machine learning methods. As a proof of concept, we explore the performance of T-REx on simulated neutral and selective sweep scenarios and find that it has high power and accuracy to discriminate sweeps from neutrality, robustness to common technical hurdles, and easy visualization of feature importance. Therefore, T-REx is a powerful addition to the toolkit for detecting adaptive processes from genomic data.
Collapse
Affiliation(s)
- Md Ruhul Amin
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA
| | - Mahmudul Hasan
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA
| | - Sandipan Paul Arnab
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA
| | - Michael DeGiorgio
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA
| |
Collapse
|
17
|
Mo Z, Siepel A. Domain-adaptive neural networks improve supervised machine learning based on simulated population genetic data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.03.01.529396. [PMID: 36909514 PMCID: PMC10002701 DOI: 10.1101/2023.03.01.529396] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/06/2023]
Abstract
Investigators have recently introduced powerful methods for population genetic inference that rely on supervised machine learning from simulated data. Despite their performance advantages, these methods can fail when the simulated training data does not adequately resemble data from the real world. Here, we show that this "simulation mis-specification" problem can be framed as a "domain adaptation" problem, where a model learned from one data distribution is applied to a dataset drawn from a different distribution. By applying an established domain-adaptation technique based on a gradient reversal layer (GRL), originally introduced for image classification, we show that the effects of simulation mis-specification can be substantially mitigated. We focus our analysis on two state-of-the-art deep-learning population genetic methods-SIA, which infers positive selection from features of the ancestral recombination graph (ARG), and ReLERNN, which infers recombination rates from genotype matrices. In the case of SIA, the domain adaptive framework also compensates for ARG inference error. Using the domain-adaptive SIA (dadaSIA) model, we estimate improved selection coefficients at selected loci in the 1000 Genomes CEU population. We anticipate that domain adaptation will prove to be widely applicable in the growing use of supervised machine learning in population genetics.
Collapse
Affiliation(s)
- Ziyi Mo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY
- School of Biological Sciences, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY
| | - Adam Siepel
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY
- School of Biological Sciences, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY
| |
Collapse
|
18
|
Arnab SP, Amin MR, DeGiorgio M. Uncovering Footprints of Natural Selection Through Spectral Analysis of Genomic Summary Statistics. Mol Biol Evol 2023; 40:msad157. [PMID: 37433019 PMCID: PMC10365025 DOI: 10.1093/molbev/msad157] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2022] [Revised: 06/28/2023] [Accepted: 07/06/2023] [Indexed: 07/13/2023] Open
Abstract
Natural selection leaves a spatial pattern along the genome, with a haplotype distribution distortion near the selected locus that fades with distance. Evaluating the spatial signal of a population-genetic summary statistic across the genome allows for patterns of natural selection to be distinguished from neutrality. Considering the genomic spatial distribution of multiple summary statistics is expected to aid in uncovering subtle signatures of selection. In recent years, numerous methods have been devised that consider genomic spatial distributions across summary statistics, utilizing both classical machine learning and deep learning architectures. However, better predictions may be attainable by improving the way in which features are extracted from these summary statistics. We apply wavelet transform, multitaper spectral analysis, and S-transform to summary statistic arrays to achieve this goal. Each analysis method converts one-dimensional summary statistic arrays to two-dimensional images of spectral analysis, allowing simultaneous temporal and spectral assessment. We feed these images into convolutional neural networks and consider combining models using ensemble stacking. Our modeling framework achieves high accuracy and power across a diverse set of evolutionary settings, including population size changes and test sets of varying sweep strength, softness, and timing. A scan of central European whole-genome sequences recapitulated well-established sweep candidates and predicted novel cancer-associated genes as sweeps with high support. Given that this modeling framework is also robust to missing genomic segments, we believe that it will represent a welcome addition to the population-genomic toolkit for learning about adaptive processes from genomic data.
Collapse
Affiliation(s)
- Sandipan Paul Arnab
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA
| | - Md Ruhul Amin
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA
| | - Michael DeGiorgio
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA
| |
Collapse
|
19
|
Zhao H, Souilljee M, Pavlidis P, Alachiotis N. Genome-wide scans for selective sweeps using convolutional neural networks. Bioinformatics 2023; 39:i194-i203. [PMID: 37387128 DOI: 10.1093/bioinformatics/btad265] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION Recent methods for selective sweep detection cast the problem as a classification task and use summary statistics as features to capture region characteristics that are indicative of a selective sweep, thereby being sensitive to confounding factors. Furthermore, they are not designed to perform whole-genome scans or to estimate the extent of the genomic region that was affected by positive selection; both are required for identifying candidate genes and the time and strength of selection. RESULTS We present ASDEC (https://github.com/pephco/ASDEC), a neural-network-based framework that can scan whole genomes for selective sweeps. ASDEC achieves similar classification performance to other convolutional neural network-based classifiers that rely on summary statistics, but it is trained 10× faster and classifies genomic regions 5× faster by inferring region characteristics from the raw sequence data directly. Deploying ASDEC for genomic scans achieved up to 15.2× higher sensitivity, 19.4× higher success rates, and 4× higher detection accuracy than state-of-the-art methods. We used ASDEC to scan human chromosome 1 of the Yoruba population (1000Genomes project), identifying nine known candidate genes.
Collapse
Affiliation(s)
- Hanqing Zhao
- Faculty of EEMCS, University of Twente, Enschede, The Netherlands
| | | | - Pavlos Pavlidis
- Institute of Computer Science, Foundation for Research and Technology-Hellas, Heraklion, Greece
| | | |
Collapse
|
20
|
Hamid I, Korunes KL, Schrider DR, Goldberg A. Localizing Post-Admixture Adaptive Variants with Object Detection on Ancestry-Painted Chromosomes. Mol Biol Evol 2023; 40:msad074. [PMID: 36947126 PMCID: PMC10116606 DOI: 10.1093/molbev/msad074] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2022] [Revised: 03/14/2023] [Accepted: 03/20/2023] [Indexed: 03/23/2023] Open
Abstract
Gene flow between previously differentiated populations during the founding of an admixed or hybrid population has the potential to introduce adaptive alleles into the new population. If the adaptive allele is common in one source population, but not the other, then as the adaptive allele rises in frequency in the admixed population, genetic ancestry from the source containing the adaptive allele will increase nearby as well. Patterns of genetic ancestry have therefore been used to identify post-admixture positive selection in humans and other animals, including examples in immunity, metabolism, and animal coloration. A common method identifies regions of the genome that have local ancestry "outliers" compared with the distribution across the rest of the genome, considering each locus independently. However, we lack theoretical models for expected distributions of ancestry under various demographic scenarios, resulting in potential false positives and false negatives. Further, ancestry patterns between distant sites are often not independent. As a result, current methods tend to infer wide genomic regions containing many genes as under selection, limiting biological interpretation. Instead, we develop a deep learning object detection method applied to images generated from local ancestry-painted genomes. This approach preserves information from the surrounding genomic context and avoids potential pitfalls of user-defined summary statistics. We find the method is robust to a variety of demographic misspecifications using simulated data. Applied to human genotype data from Cabo Verde, we localize a known adaptive locus to a single narrow region compared with multiple or long windows obtained using two other ancestry-based methods.
Collapse
Affiliation(s)
- Iman Hamid
- Department of Evolutionary Anthropology, Duke University, Durham, NC
| | | | - Daniel R Schrider
- Department of Genetics, University of North Carolina, Chapel Hill, NC
| | - Amy Goldberg
- Department of Evolutionary Anthropology, Duke University, Durham, NC
| |
Collapse
|
21
|
Amin MR, Hasan M, Arnab SP, DeGiorgio M. Tensor decomposition based feature extraction and classification to detect natural selection from genomic data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.03.27.527731. [PMID: 37034767 PMCID: PMC10081272 DOI: 10.1101/2023.03.27.527731] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Inferences of adaptive events are important for learning about traits, such as human digestion of lactose after infancy and the rapid spread of viral variants. Early efforts toward identifying footprints of natural selection from genomic data involved development of summary statistic and likelihood methods. However, such techniques are grounded in simple patterns or theoretical models that limit the complexity of settings they can explore. Due to the renaissance in artificial intelligence, machine learning methods have taken center stage in recent efforts to detect natural selection, with strategies such as convolutional neural networks applied to images of haplotypes. Yet, limitations of such techniques include estimation of large numbers of model parameters under non-convex settings and feature identification without regard to location within an image. An alternative approach is to use tensor decomposition to extract features from multidimensional data while preserving the latent structure of the data, and to feed these features to machine learning models. Here, we adopt this framework and present a novel approach termed T-REx , which extracts features from images of haplotypes across sampled individuals using tensor decomposition, and then makes predictions from these features using classical machine learning methods. As a proof of concept, we explore the performance of T-REx on simulated neutral and selective sweep scenarios and find that it has high power and accuracy to discriminate sweeps from neutrality, robustness to common technical hurdles, and easy visualization of feature importance. Therefore, T-REx is a powerful addition to the toolkit for detecting adaptive processes from genomic data.
Collapse
|
22
|
Kirsch-Gerweck B, Bohnenkämper L, Henrichs MT, Alanko JN, Bannai H, Cazaux B, Peterlongo P, Burger J, Stoye J, Diekmann Y. HaploBlocks: Efficient Detection of Positive Selection in Large Population Genomic Datasets. Mol Biol Evol 2023; 40:msad027. [PMID: 36790822 PMCID: PMC9985328 DOI: 10.1093/molbev/msad027] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2022] [Revised: 02/01/2023] [Accepted: 02/06/2023] [Indexed: 02/16/2023] Open
Abstract
Genomic regions under positive selection harbor variation linked for example to adaptation. Most tools for detecting positively selected variants have computational resource requirements rendering them impractical on population genomic datasets with hundreds of thousands of individuals or more. We have developed and implemented an efficient haplotype-based approach able to scan large datasets and accurately detect positive selection. We achieve this by combining a pattern matching approach based on the positional Burrows-Wheeler transform with model-based inference which only requires the evaluation of closed-form expressions. We evaluate our approach with simulations, and find it to be both sensitive and specific. The computational resource requirements quantified using UK Biobank data indicate that our implementation is scalable to population genomic datasets with millions of individuals. Our approach may serve as an algorithmic blueprint for the era of "big data" genomics: a combinatorial core coupled with statistical inference in closed form.
Collapse
Affiliation(s)
- Benedikt Kirsch-Gerweck
- Palaeogenetics Group, Institute of Organismic and Molecular Evolution (iomE), Johannes Gutenberg University, 55128 Mainz, Germany
| | - Leonard Bohnenkämper
- Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Universitätsstr. 25, 33615 Bielefeld, Germany
| | - Michel T Henrichs
- Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Universitätsstr. 25, 33615 Bielefeld, Germany
| | - Jarno N Alanko
- Department of Computer Science, University of Helsinki, P.O 68, Pietari Kalmin katu 5, 00014 Helsinki, Finland
| | - Hideo Bannai
- M&D Data Science Center, Tokyo Medical and Dental University (TMDU), 2-3-10 Kanda-Surugadai, Chiyoda-ku, Tokyo 101-0062, Japan
| | - Bastien Cazaux
- CNRS, Centrale Lille, UMR 9189, Univ. Lille, CRIStAL, F-59000 Lille, France
| | - Pierre Peterlongo
- GenScale, Inria/Irisa Campus de Beaulieu, 35042 Rennes Cedex, France
| | - Joachim Burger
- Palaeogenetics Group, Institute of Organismic and Molecular Evolution (iomE), Johannes Gutenberg University, 55128 Mainz, Germany
| | - Jens Stoye
- Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Universitätsstr. 25, 33615 Bielefeld, Germany
| | - Yoan Diekmann
- Palaeogenetics Group, Institute of Organismic and Molecular Evolution (iomE), Johannes Gutenberg University, 55128 Mainz, Germany
- Research Department of Genetics, Evolution and Environment, University College London, London WC1E 6BT, United Kingdom
| |
Collapse
|
23
|
Korfmann K, Gaggiotti OE, Fumagalli M. Deep Learning in Population Genetics. Genome Biol Evol 2023; 15:evad008. [PMID: 36683406 PMCID: PMC9897193 DOI: 10.1093/gbe/evad008] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2022] [Revised: 12/19/2022] [Accepted: 01/16/2023] [Indexed: 01/24/2023] Open
Abstract
Population genetics is transitioning into a data-driven discipline thanks to the availability of large-scale genomic data and the need to study increasingly complex evolutionary scenarios. With likelihood and Bayesian approaches becoming either intractable or computationally unfeasible, machine learning, and in particular deep learning, algorithms are emerging as popular techniques for population genetic inferences. These approaches rely on algorithms that learn non-linear relationships between the input data and the model parameters being estimated through representation learning from training data sets. Deep learning algorithms currently employed in the field comprise discriminative and generative models with fully connected, convolutional, or recurrent layers. Additionally, a wide range of powerful simulators to generate training data under complex scenarios are now available. The application of deep learning to empirical data sets mostly replicates previous findings of demography reconstruction and signals of natural selection in model organisms. To showcase the feasibility of deep learning to tackle new challenges, we designed a branched architecture to detect signals of recent balancing selection from temporal haplotypic data, which exhibited good predictive performance on simulated data. Investigations on the interpretability of neural networks, their robustness to uncertain training data, and creative representation of population genetic data, will provide further opportunities for technological advancements in the field.
Collapse
Affiliation(s)
- Kevin Korfmann
- Professorship for Population Genetics, Department of Life Science Systems, Technical University of Munich, Germany
| | - Oscar E Gaggiotti
- Centre for Biological Diversity, Sir Harold Mitchell Building, University of St Andrews, Fife KY16 9TF, UK
| | - Matteo Fumagalli
- Department of Biological and Behavioural Sciences, Queen Mary University of London, UK
| |
Collapse
|
24
|
Sanchez T, Bray EM, Jobic P, Guez J, Letournel AC, Charpiat G, Cury J, Jay F. dnadna: a deep learning framework for population genetics inference. Bioinformatics 2022; 39:6851140. [PMID: 36445000 PMCID: PMC9825738 DOI: 10.1093/bioinformatics/btac765] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2021] [Revised: 10/30/2022] [Accepted: 11/28/2022] [Indexed: 11/30/2022] Open
Abstract
MOTIVATION We present dnadna, a flexible python-based software for deep learning inference in population genetics. It is task-agnostic and aims at facilitating the development, reproducibility, dissemination and re-usability of neural networks designed for population genetic data. RESULTS dnadna defines multiple user-friendly workflows. First, users can implement new architectures and tasks, while benefiting from dnadna utility functions, training procedure and test environment, which saves time and decreases the likelihood of bugs. Second, the implemented networks can be re-optimized based on user-specified training sets and/or tasks. Newly implemented architectures and pre-trained networks are easily shareable with the community for further benchmarking or other applications. Finally, users can apply pre-trained networks in order to predict evolutionary history from alternative real or simulated genetic datasets, without requiring extensive knowledge in deep learning or coding in general. dnadna comes with a peer-reviewed, exchangeable neural network, allowing demographic inference from SNP data, that can be used directly or retrained to solve other tasks. Toy networks are also available to ease the exploration of the software, and we expect that the range of available architectures will keep expanding thanks to community contributions. AVAILABILITY AND IMPLEMENTATION dnadna is a Python (≥3.7) package, its repository is available at gitlab.com/mlgenetics/dnadna and its associated documentation at mlgenetics.gitlab.io/dnadna/.
Collapse
Affiliation(s)
| | | | - Pierre Jobic
- Université Paris-Saclay, CNRS UMR 9015, INRIA, Laboratoire Interdisciplinaire des Sciences du Numérique, 91400 Orsay, France
- ENS Paris-Saclay, 91190 Gif-sur-Yvette, France
| | - Jérémy Guez
- Université Paris-Saclay, CNRS UMR 9015, INRIA, Laboratoire Interdisciplinaire des Sciences du Numérique, 91400 Orsay, France
- UMR7206 Eco-Anthropologie, Muséum National d’Histoire Naturelle, CNRS, Université de Paris, 75016 Paris, France
| | - Anne-Catherine Letournel
- Université Paris-Saclay, CNRS UMR 9015, INRIA, Laboratoire Interdisciplinaire des Sciences du Numérique, 91400 Orsay, France
| | - Guillaume Charpiat
- Université Paris-Saclay, CNRS UMR 9015, INRIA, Laboratoire Interdisciplinaire des Sciences du Numérique, 91400 Orsay, France
| | - Jean Cury
- To whom correspondence should be addressed. or
| | - Flora Jay
- To whom correspondence should be addressed. or
| |
Collapse
|
25
|
Rahman M, Mandal A, Gayari I, Bidyalaxmi K, Sarkar D, Allu T, Debbarma A. Prospect and scope of artificial neural network in livestock farming: a review. BIOL RHYTHM RES 2022. [DOI: 10.1080/09291016.2022.2139389] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
Affiliation(s)
- Mokidur Rahman
- Animal Genetics and Breeding, Eastern Regional Station, ICAR-NDRI, Kalyani, India, 741235
| | - Ajoy Mandal
- Animal Genetics and Breeding, Eastern Regional Station, ICAR-NDRI, Kalyani, India, 741235
| | - Indrajit Gayari
- Animal Genetics and Breeding, Eastern Regional Station, ICAR-NDRI, Kalyani, India, 741235
| | - Kangabam Bidyalaxmi
- Animal Genetics and Breeding, Eastern Regional Station, ICAR-NDRI, Kalyani, India, 741235
| | - Debajyoti Sarkar
- Animal Reproduction Gynaecology and Obstetrics, Eastern Regional Station, ICAR- NDRI, Kalyani, India, 741235
| | - Teja Allu
- Animal Reproduction Gynaecology and Obstetrics, Southern Regional Station, ICAR-NDRI, Adugodi, India, 560030
| | - Asish Debbarma
- Livestock Production and Management, Eastern Regional Station, ICAR-NDRI, Kalyani, India, 741235
| |
Collapse
|
26
|
Nikolakis ZL, Adams RH, Wade KJ, Lund AJ, Carlton EJ, Castoe TA, Pollock DD. Prospects for genomic surveillance for selection in schistosome parasites. FRONTIERS IN EPIDEMIOLOGY 2022; 2:932021. [PMID: 38455290 PMCID: PMC10910990 DOI: 10.3389/fepid.2022.932021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/29/2022] [Accepted: 09/12/2022] [Indexed: 03/09/2024]
Abstract
Schistosomiasis is a neglected tropical disease caused by multiple parasitic Schistosoma species, and which impacts over 200 million people globally, mainly in low- and middle-income countries. Genomic surveillance to detect evidence for natural selection in schistosome populations represents an emerging and promising approach to identify and interpret schistosome responses to ongoing control efforts or other environmental factors. Here we review how genomic variation is used to detect selection, how these approaches have been applied to schistosomes, and how future studies to detect selection may be improved. We discuss the theory of genomic analyses to detect selection, identify experimental designs for such analyses, and review studies that have applied these approaches to schistosomes. We then consider the biological characteristics of schistosomes that are expected to respond to selection, particularly those that may be impacted by control programs. Examples include drug resistance, host specificity, and life history traits, and we review our current understanding of specific genes that underlie them in schistosomes. We also discuss how inherent features of schistosome reproduction and demography pose substantial challenges for effective identification of these traits and their genomic bases. We conclude by discussing how genomic surveillance for selection should be designed to improve understanding of schistosome biology, and how the parasite changes in response to selection.
Collapse
Affiliation(s)
- Zachary L. Nikolakis
- Department of Biology, University of Texas at Arlington, Arlington, TX, United States
| | - Richard H. Adams
- Department of Biological and Environmental Sciences, Georgia College and State University, Milledgeville, GA, United States
| | - Kristen J. Wade
- Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, CO, United States
| | - Andrea J. Lund
- Department of Environmental and Occupational Health, Colorado School of Public Health, University of Colorado, Anschutz, Aurora, CO, United States
| | - Elizabeth J. Carlton
- Department of Environmental and Occupational Health, Colorado School of Public Health, University of Colorado, Anschutz, Aurora, CO, United States
| | - Todd A. Castoe
- Department of Biology, University of Texas at Arlington, Arlington, TX, United States
| | - David D. Pollock
- Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, CO, United States
| |
Collapse
|
27
|
Qin X, Chiang CWK, Gaggiotti OE. Deciphering signatures of natural selection via deep learning. Brief Bioinform 2022; 23:6686736. [PMID: 36056746 PMCID: PMC9487700 DOI: 10.1093/bib/bbac354] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2022] [Revised: 07/11/2022] [Accepted: 07/28/2022] [Indexed: 11/12/2022] Open
Abstract
Identifying genomic regions influenced by natural selection provides fundamental insights into the genetic basis of local adaptation. However, it remains challenging to detect loci under complex spatially varying selection. We propose a deep learning-based framework, DeepGenomeScan, which can detect signatures of spatially varying selection. We demonstrate that DeepGenomeScan outperformed principal component analysis- and redundancy analysis-based genome scans in identifying loci underlying quantitative traits subject to complex spatial patterns of selection. Noticeably, DeepGenomeScan increases statistical power by up to 47.25% under nonlinear environmental selection patterns. We applied DeepGenomeScan to a European human genetic dataset and identified some well-known genes under selection and a substantial number of clinically important genes that were not identified by SPA, iHS, Fst and Bayenv when applied to the same dataset.
Collapse
Affiliation(s)
- Xinghu Qin
- Centre for Biological Diversity, Sir Harold Mitchell Building, University of St Andrews, Fife, KY16 9TF, UK
| | - Charleston W K Chiang
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine & Department of Quantitative and Computational Biology, University of Southern California, USA
| | - Oscar E Gaggiotti
- Centre for Biological Diversity, Sir Harold Mitchell Building, University of St Andrews, Fife, KY16 9TF, UK
| |
Collapse
|
28
|
A Neural Network-Based Spectral Approach for the Assignment of Individual Trees to Genetically Differentiated Subpopulations. REMOTE SENSING 2022. [DOI: 10.3390/rs14122898] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
Studying population structure has made an essential contribution to understanding evolutionary processes and demographic history in forest ecology research. This inference process basically involves the identification of common genetic variants among individuals, then grouping the similar individuals into subpopulations. In this study, a spectral-based classification of genetically differentiated groups was carried out using a provenance–progeny trial of Eucalyptus cladocalyx. First, the genetic structure was inferred through a Bayesian analysis using single-nucleotide polymorphisms (SNPs). Then, different machine learning models were trained with foliar spectral information to assign individual trees to subpopulations. The results revealed that spectral-based classification using the multilayer perceptron method was very successful at classifying individuals into their respective subpopulations (with an average of 87% of correct individual assignments), whereas 85% and 81% of individuals were assigned to their respective classes correctly by convolutional neural network and partial least squares discriminant analysis, respectively. Notably, 93% of individual trees were assigned correctly to the class with the smallest size using the spectral data-based multi-layer perceptron classification method. In conclusion, spectral data, along with neural network models, are able to discriminate and assign individuals to a given subpopulation, which could facilitate the implementation and application of population structure studies on a large scale.
Collapse
|
29
|
Borowiec ML, Dikow RB, Frandsen PB, McKeeken A, Valentini G, White AE. Deep learning as a tool for ecology and evolution. Methods Ecol Evol 2022. [DOI: 10.1111/2041-210x.13901] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Marek L. Borowiec
- Entomology, Plant Pathology and Nematology University of Idaho Moscow ID USA
- Institute for Bioinformatics and Evolutionary Studies (IBEST) University of Idaho Moscow ID USA
| | - Rebecca B. Dikow
- Data Science Lab, Office of the Chief Information Officer Smithsonian Institution Washington DC USA
| | - Paul B. Frandsen
- Data Science Lab, Office of the Chief Information Officer Smithsonian Institution Washington DC USA
- Department of Plant and Wildlife Sciences Brigham Young University Provo UT USA
| | - Alexander McKeeken
- Entomology, Plant Pathology and Nematology University of Idaho Moscow ID USA
| | | | - Alexander E. White
- Data Science Lab, Office of the Chief Information Officer Smithsonian Institution Washington DC USA
- Department of Botany, National Museum of Natural History Smithsonian Institution Washington DC USA
| |
Collapse
|
30
|
Kumar H, Panigrahi M, Panwar A, Rajawat D, Nayak SS, Saravanan KA, Kaisa K, Parida S, Bhushan B, Dutt T. Machine-Learning Prospects for Detecting Selection Signatures Using Population Genomics Data. J Comput Biol 2022; 29:943-960. [PMID: 35639362 DOI: 10.1089/cmb.2021.0447] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
Natural selection has been given a lot of attention because it relates to the adaptation of populations to their environments, both biotic and abiotic. An allele is selected when it is favored by natural selection. Consequently, the favored allele increases in frequency in the population and neighboring linked variation diminishes, causing so-called selective sweeps. A high-throughput genomic sequence allows one to disentangle the evolutionary forces at play in populations. With the development of high-throughput genome sequencing technologies, it has become easier to detect these selective sweeps/selection signatures. Various methods can be used to detect selective sweeps, from simple implementations using summary statistics to complex statistical approaches. One of the important problems of these statistical models is the potential to provide inaccurate results when their assumptions are violated. The use of machine learning (ML) in population genetics has been introduced as an alternative method of detecting selection by treating the problem of detecting selection signatures as a classification problem. Since the availability of population genomics data is increasing, researchers may incorporate ML into these statistical models to infer signatures of selection with higher predictive accuracy and better resolution. This article describes how ML can be used to aid in detecting and studying natural selection patterns using population genomic data.
Collapse
Affiliation(s)
- Harshit Kumar
- Divisions of Animal Genetics, ICAR-Indian Veterinary Research Institute, Izatnagar, India
| | - Manjit Panigrahi
- Divisions of Animal Genetics, ICAR-Indian Veterinary Research Institute, Izatnagar, India
| | - Anuradha Panwar
- Divisions of Animal Genetics, ICAR-Indian Veterinary Research Institute, Izatnagar, India
| | - Divya Rajawat
- Divisions of Animal Genetics, ICAR-Indian Veterinary Research Institute, Izatnagar, India
| | - Sonali Sonejita Nayak
- Divisions of Animal Genetics, ICAR-Indian Veterinary Research Institute, Izatnagar, India
| | - K A Saravanan
- Divisions of Animal Genetics, ICAR-Indian Veterinary Research Institute, Izatnagar, India
| | - Kaiho Kaisa
- Divisions of Animal Genetics, ICAR-Indian Veterinary Research Institute, Izatnagar, India
| | - Subhashree Parida
- Divisions of Pharmacology and Toxicology, ICAR-Indian Veterinary Research Institute, Izatnagar, India
| | - Bharat Bhushan
- Divisions of Animal Genetics, ICAR-Indian Veterinary Research Institute, Izatnagar, India
| | - Triveni Dutt
- Livestock Production and Management Section, ICAR-Indian Veterinary Research Institute, Izatnagar, India
| |
Collapse
|
31
|
Avecilla G, Chuong JN, Li F, Sherlock G, Gresham D, Ram Y. Neural networks enable efficient and accurate simulation-based inference of evolutionary parameters from adaptation dynamics. PLoS Biol 2022; 20:e3001633. [PMID: 35622868 PMCID: PMC9140244 DOI: 10.1371/journal.pbio.3001633] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2021] [Accepted: 04/14/2022] [Indexed: 11/24/2022] Open
Abstract
The rate of adaptive evolution depends on the rate at which beneficial mutations are introduced into a population and the fitness effects of those mutations. The rate of beneficial mutations and their expected fitness effects is often difficult to empirically quantify. As these 2 parameters determine the pace of evolutionary change in a population, the dynamics of adaptive evolution may enable inference of their values. Copy number variants (CNVs) are a pervasive source of heritable variation that can facilitate rapid adaptive evolution. Previously, we developed a locus-specific fluorescent CNV reporter to quantify CNV dynamics in evolving populations maintained in nutrient-limiting conditions using chemostats. Here, we use CNV adaptation dynamics to estimate the rate at which beneficial CNVs are introduced through de novo mutation and their fitness effects using simulation-based likelihood-free inference approaches. We tested the suitability of 2 evolutionary models: a standard Wright-Fisher model and a chemostat model. We evaluated 2 likelihood-free inference algorithms: the well-established Approximate Bayesian Computation with Sequential Monte Carlo (ABC-SMC) algorithm, and the recently developed Neural Posterior Estimation (NPE) algorithm, which applies an artificial neural network to directly estimate the posterior distribution. By systematically evaluating the suitability of different inference methods and models, we show that NPE has several advantages over ABC-SMC and that a Wright-Fisher evolutionary model suffices in most cases. Using our validated inference framework, we estimate the CNV formation rate at the GAP1 locus in the yeast Saccharomyces cerevisiae to be 10-4.7 to 10-4 CNVs per cell division and a fitness coefficient of 0.04 to 0.1 per generation for GAP1 CNVs in glutamine-limited chemostats. We experimentally validated our inference-based estimates using 2 distinct experimental methods-barcode lineage tracking and pairwise fitness assays-which provide independent confirmation of the accuracy of our approach. Our results are consistent with a beneficial CNV supply rate that is 10-fold greater than the estimated rates of beneficial single-nucleotide mutations, explaining the outsized importance of CNVs in rapid adaptive evolution. More generally, our study demonstrates the utility of novel neural network-based likelihood-free inference methods for inferring the rates and effects of evolutionary processes from empirical data with possible applications ranging from tumor to viral evolution.
Collapse
Affiliation(s)
- Grace Avecilla
- Department of Biology, New York University, New York, New York, United States of America
- Center for Genomics and Systems Biology, New York University, New York, New York, United States of America
| | - Julie N. Chuong
- Department of Biology, New York University, New York, New York, United States of America
- Center for Genomics and Systems Biology, New York University, New York, New York, United States of America
| | - Fangfei Li
- Department of Genetics, Stanford University, California, Stanford, United States of America
| | - Gavin Sherlock
- Department of Genetics, Stanford University, California, Stanford, United States of America
| | - David Gresham
- Department of Biology, New York University, New York, New York, United States of America
- Center for Genomics and Systems Biology, New York University, New York, New York, United States of America
| | - Yoav Ram
- School of Zoology, Faculty of Life Sciences, Tel Aviv University, Tel Aviv, Israel
| |
Collapse
|
32
|
A Novel Attention-Mechanism Based Cox Survival Model by Exploiting Pan-Cancer Empirical Genomic Information. Cells 2022; 11:cells11091421. [PMID: 35563727 PMCID: PMC9100007 DOI: 10.3390/cells11091421] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2022] [Revised: 04/15/2022] [Accepted: 04/19/2022] [Indexed: 01/27/2023] Open
Abstract
Cancer prognosis is an essential goal for early diagnosis, biomarker selection, and medical therapy. In the past decade, deep learning has successfully solved a variety of biomedical problems. However, due to the high dimensional limitation of human cancer transcriptome data and the small number of training samples, there is still no mature deep learning-based survival analysis model that can completely solve problems in the training process like overfitting and accurate prognosis. Given these problems, we introduced a novel framework called SAVAE-Cox for survival analysis of high-dimensional transcriptome data. This model adopts a novel attention mechanism and takes full advantage of the adversarial transfer learning strategy. We trained the model on 16 types of TCGA cancer RNA-seq data sets. Experiments show that our module outperformed state-of-the-art survival analysis models such as the Cox proportional hazard model (Cox-ph), Cox-lasso, Cox-ridge, Cox-nnet, and VAECox on the concordance index. In addition, we carry out some feature analysis experiments. Based on the experimental results, we concluded that our model is helpful for revealing cancer-related genes and biological functions.
Collapse
|
33
|
Hejase HA, Mo Z, Campagna L, Siepel A. A Deep-Learning Approach for Inference of Selective Sweeps from the Ancestral Recombination Graph. Mol Biol Evol 2022; 39:msab332. [PMID: 34888675 PMCID: PMC8789311 DOI: 10.1093/molbev/msab332] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open
Abstract
Detecting signals of selection from genomic data is a central problem in population genetics. Coupling the rich information in the ancestral recombination graph (ARG) with a powerful and scalable deep-learning framework, we developed a novel method to detect and quantify positive selection: Selection Inference using the Ancestral recombination graph (SIA). Built on a Long Short-Term Memory (LSTM) architecture, a particular type of a Recurrent Neural Network (RNN), SIA can be trained to explicitly infer a full range of selection coefficients, as well as the allele frequency trajectory and time of selection onset. We benchmarked SIA extensively on simulations under a European human demographic model, and found that it performs as well or better as some of the best available methods, including state-of-the-art machine-learning and ARG-based methods. In addition, we used SIA to estimate selection coefficients at several loci associated with human phenotypes of interest. SIA detected novel signals of selection particular to the European (CEU) population at the MC1R and ABCC11 loci. In addition, it recapitulated signals of selection at the LCT locus and several pigmentation-related genes. Finally, we reanalyzed polymorphism data of a collection of recently radiated southern capuchino seedeater taxa in the genus Sporophila to quantify the strength of selection and improved the power of our previous methods to detect partial soft sweeps. Overall, SIA uses deep learning to leverage the ARG and thereby provides new insight into how selective sweeps shape genomic diversity.
Collapse
Affiliation(s)
- Hussein A Hejase
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Ziyi Mo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
- School of Biological Sciences, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Leonardo Campagna
- Fuller Evolutionary Biology Program, Cornell Lab of Ornithology, Ithaca, NY, USA
- Department of Ecology and Evolutionary Biology, Cornell University, Ithaca, NY, USA
| | - Adam Siepel
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| |
Collapse
|
34
|
Nguembang Fadja A, Riguzzi F, Bertorelle G, Trucchi E. Identification of natural selection in genomic data with deep convolutional neural network. BioData Min 2021; 14:51. [PMID: 34863217 PMCID: PMC8642854 DOI: 10.1186/s13040-021-00280-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2021] [Accepted: 10/25/2021] [Indexed: 11/10/2022] Open
Abstract
Background With the increase in the size of genomic datasets describing variability in populations, extracting relevant information becomes increasingly useful as well as complex. Recently, computational methodologies such as Supervised Machine Learning and specifically Convolutional Neural Networks have been proposed to make inferences on demographic and adaptive processes using genomic data. Even though it was already shown to be powerful and efficient in different fields of investigation, Supervised Machine Learning has still to be explored as to unfold its enormous potential in evolutionary genomics. Results The paper proposes a method based on Supervised Machine Learning for classifying genomic data, represented as windows of genomic sequences from a sample of individuals belonging to the same population. A Convolutional Neural Network is used to test whether a genomic window shows the signature of natural selection. Training performed on simulated data show that the proposed model can accurately predict neutral and selection processes on portions of genomes taken from real populations with almost 90% accuracy.
Collapse
Affiliation(s)
- Arnaud Nguembang Fadja
- Dipartimento di Matematica e Informatica, University of Ferrara, Via Saragat 1, Ferrara, I-44122, Italy.
| | - Fabrizio Riguzzi
- Dipartimento di Matematica e Informatica, University of Ferrara, Via Saragat 1, Ferrara, I-44122, Italy
| | - Giorgio Bertorelle
- Dipartimento di Scienze della Vita e Biotecnologie, University of Ferrara, Via Luigi Borsari 46, Ferrara, I-44121, Italy
| | - Emiliano Trucchi
- Dipartimento di Scienze della Vita e dell'Ambiente, Marche Polytechnic University, Via Brecce Bianche, Ancona, I-60131, Italy
| |
Collapse
|
35
|
Montinaro F, Pankratov V, Yelmen B, Pagani L, Mondal M. Revisiting the out of Africa event with a deep-learning approach. Am J Hum Genet 2021; 108:2037-2051. [PMID: 34626535 PMCID: PMC8595897 DOI: 10.1016/j.ajhg.2021.09.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Accepted: 09/09/2021] [Indexed: 10/20/2022] Open
Abstract
Anatomically modern humans evolved around 300 thousand years ago in Africa. They started to appear in the fossil record outside of Africa as early as 100 thousand years ago, although other hominins existed throughout Eurasia much earlier. Recently, several studies argued in favor of a single out of Africa event for modern humans on the basis of whole-genome sequence analyses. However, the single out of Africa model is in contrast with some of the findings from fossil records, which support two out of Africa events, and uniparental data, which propose a back to Africa movement. Here, we used a deep-learning approach coupled with approximate Bayesian computation and sequential Monte Carlo to revisit these hypotheses from the whole-genome sequence perspective. Our results support the back to Africa model over other alternatives. We estimated that there are two sequential separations between Africa and out of African populations happening around 60-90 thousand years ago and separated by 13-15 thousand years. One of the populations resulting from the more recent split has replaced the older West African population to a large extent, while the other one has founded the out of Africa populations.
Collapse
Affiliation(s)
- Francesco Montinaro
- Institute of Genomics, University of Tartu, Tartu 51010, Estonia; Department of Biology-Genetics, University of Bari, Bari 70124, Italy
| | - Vasili Pankratov
- Institute of Genomics, University of Tartu, Tartu 51010, Estonia
| | - Burak Yelmen
- Institute of Genomics, University of Tartu, Tartu 51010, Estonia; Institute of Molecular and Cell Biology, University of Tartu, Tartu 51010, Estonia; Université Paris-Saclay, CNRS UMR 9015, INRIA, Laboratoire Interdisciplinaire des Sciences du Numérique, 91400 Orsay, France
| | - Luca Pagani
- Institute of Genomics, University of Tartu, Tartu 51010, Estonia; Department of Biology, University of Padova, Padova 35121, Italy
| | - Mayukh Mondal
- Institute of Genomics, University of Tartu, Tartu 51010, Estonia.
| |
Collapse
|
36
|
Vaid A, Johnson KW, Badgeley MA, Somani SS, Bicak M, Landi I, Russak A, Zhao S, Levin MA, Freeman RS, Charney AW, Kukar A, Kim B, Danilov T, Lerakis S, Argulian E, Narula J, Nadkarni GN, Glicksberg BS. Using Deep-Learning Algorithms to Simultaneously Identify Right and Left Ventricular Dysfunction From the Electrocardiogram. JACC Cardiovasc Imaging 2021; 15:395-410. [PMID: 34656465 PMCID: PMC8917975 DOI: 10.1016/j.jcmg.2021.08.004] [Citation(s) in RCA: 40] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/01/2021] [Revised: 07/29/2021] [Accepted: 08/05/2021] [Indexed: 12/20/2022]
Abstract
OBJECTIVES This study sought to develop DL models capable of comprehensively quantifying left and right ventricular dysfunction from ECG data in a large, diverse population. BACKGROUND Rapid evaluation of left and right ventricular function using deep learning (DL) on electrocardiograms (ECGs) can assist diagnostic workflow. However, DL tools to estimate right- ventricular (RV) function do not exist, whereas those to estimate left ventricular (LV) function are restricted to quantification of very low LV function only. METHODS A multicenter study was conducted with data from 5 New York City hospitals: 4 for internal testing and 1 serving as external validation. We created novel DL models to classify left ventricular ejection fraction (LVEF) into categories derived from the latest universal definition of heart failure, estimate LVEF through regression, and predict a composite outcome of either RV systolic dysfunction or RV dilation. RESULTS We obtained echocardiogram LVEF estimates for 147,636 patients paired to 715,890 ECGs. We used natural language processing (NLP) to extract RV size and systolic function information from 404,502 echocardiogram reports paired to 761,510 ECGs for 148,227 patients. For LVEF classification in internal testing, area under curve (AUC) at detection of LVEF ≤40%, 40% < LVEF ≤50%, and LVEF >50% was 0.94 (95% CI: 0.94-0.94), 0.82 (95% CI: 0.81-0.83), and 0.89 (95% CI: 0.89-0.89), respectively. For external validation, these results were 0.94 (95% CI: 0.94-0.95), 0.73 (95% CI: 0.72-0.74), and 0.87 (95% CI: 0.87-0.88). For regression, the mean absolute error was 5.84% (95% CI: 5.82%-5.85%) for internal testing and 6.14% (95% CI: 6.13%-6.16%) in external validation. For prediction of the composite RV outcome, AUC was 0.84 (95% CI: 0.84-0.84) in both internal testing and external validation. CONCLUSIONS DL on ECG data can be used to create inexpensive screening, diagnostic, and predictive tools for both LV and RV dysfunction. Such tools may bridge the applicability of ECGs and echocardiography and enable prioritization of patients for further interventions for either sided failure progressing to biventricular disease.
Collapse
Affiliation(s)
- Akhil Vaid
- Mount Sinai Clinical Intelligence Center, Icahn School of Medicine at Mount Sinai, New York, New York, USA; The Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, New York, New York, USA; Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Kipp W Johnson
- Mount Sinai Clinical Intelligence Center, Icahn School of Medicine at Mount Sinai, New York, New York, USA; The Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | | | - Sulaiman S Somani
- The Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Mesude Bicak
- The Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, New York, New York, USA; Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Isotta Landi
- Mount Sinai Clinical Intelligence Center, Icahn School of Medicine at Mount Sinai, New York, New York, USA; The Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, New York, New York, USA; Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, USA; Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Adam Russak
- Department of Anesthesiology, Perioperative and Pain Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Shan Zhao
- The Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, New York, New York, USA; Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Matthew A Levin
- Mount Sinai Clinical Intelligence Center, Icahn School of Medicine at Mount Sinai, New York, New York, USA; Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Robert S Freeman
- Mount Sinai Clinical Intelligence Center, Icahn School of Medicine at Mount Sinai, New York, New York, USA; Institute for Healthcare Delivery Science, Icahn School of Medicine at Mount Sinai, New York, New York, USA; The Pamela Sklar Division of Psychiatric Genomics, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Alexander W Charney
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, USA; Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA; Department of Psychiatry, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Atul Kukar
- Department of Cardiology, Mount Sinai Queens Hospital, Astoria, New York, USA, and Icahn School of Medicine at Mount Sinai, New York, New York, USA; Division of Cardiology, Mount Sinai West Hospital and Icahn School of Medicine at Mount Sinai, New York, New York USA
| | - Bette Kim
- Mount Sinai Beth Israel Hospital, Icahn School of Medicine at Mount Sinai, New York, New York, USA; Mount Sinai Heart, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Tatyana Danilov
- Department of Cardiology, Mount Sinai Morningside Hospital, Icahn School of Medicine at Mount Sinai, New York, New York, USA; The Zena and Michael A. Wiener Cardiovascular Institute, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Stamatios Lerakis
- The Zena and Michael A. Wiener Cardiovascular Institute, Icahn School of Medicine at Mount Sinai, New York, New York, USA; The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Edgar Argulian
- Mount Sinai Heart, Icahn School of Medicine at Mount Sinai, New York, New York, USA; The Zena and Michael A. Wiener Cardiovascular Institute, Icahn School of Medicine at Mount Sinai, New York, New York, USA; The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Jagat Narula
- The Zena and Michael A. Wiener Cardiovascular Institute, Icahn School of Medicine at Mount Sinai, New York, New York, USA; The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA; The Zena and Michael A. Wiener Cardiovascular Institute, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Girish N Nadkarni
- Mount Sinai Clinical Intelligence Center, Icahn School of Medicine at Mount Sinai, New York, New York, USA; The Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, New York, New York, USA; The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Benjamin S Glicksberg
- Mount Sinai Clinical Intelligence Center, Icahn School of Medicine at Mount Sinai, New York, New York, USA; The Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, New York, New York, USA; Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, USA.
| |
Collapse
|
37
|
Gower G, Picazo PI, Fumagalli M, Racimo F. Detecting adaptive introgression in human evolution using convolutional neural networks. eLife 2021; 10:64669. [PMID: 34032215 PMCID: PMC8192126 DOI: 10.7554/elife.64669] [Citation(s) in RCA: 39] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2020] [Accepted: 05/24/2021] [Indexed: 01/10/2023] Open
Abstract
Studies in a variety of species have shown evidence for positively selected variants introduced into a population via introgression from another, distantly related population—a process known as adaptive introgression. However, there are few explicit frameworks for jointly modelling introgression and positive selection, in order to detect these variants using genomic sequence data. Here, we develop an approach based on convolutional neural networks (CNNs). CNNs do not require the specification of an analytical model of allele frequency dynamics and have outperformed alternative methods for classification and parameter estimation tasks in various areas of population genetics. Thus, they are potentially well suited to the identification of adaptive introgression. Using simulations, we trained CNNs on genotype matrices derived from genomes sampled from the donor population, the recipient population and a related non-introgressed population, in order to distinguish regions of the genome evolving under adaptive introgression from those evolving neutrally or experiencing selective sweeps. Our CNN architecture exhibits 95% accuracy on simulated data, even when the genomes are unphased, and accuracy decreases only moderately in the presence of heterosis. As a proof of concept, we applied our trained CNNs to human genomic datasets—both phased and unphased—to detect candidates for adaptive introgression that shaped our evolutionary history.
Collapse
Affiliation(s)
- Graham Gower
- Lundbeck GeoGenetics Centre, Globe Institute, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Pablo Iáñez Picazo
- Lundbeck GeoGenetics Centre, Globe Institute, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Matteo Fumagalli
- Department of Life Sciences, Silwood Park Campus, Imperial College London, London, United Kingdom
| | - Fernando Racimo
- Lundbeck GeoGenetics Centre, Globe Institute, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| |
Collapse
|
38
|
Fonseca EM, Colli GR, Werneck FP, Carstens BC. Phylogeographic model selection using convolutional neural networks. Mol Ecol Resour 2021; 21:2661-2675. [PMID: 33973350 DOI: 10.1111/1755-0998.13427] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2020] [Revised: 04/02/2021] [Accepted: 04/28/2021] [Indexed: 11/26/2022]
Abstract
The discipline of phylogeography has evolved rapidly in terms of the analytical toolkit used to analyse large genomic data sets. Despite substantial advances, analytical tools that could potentially address the challenges posed by increased model complexity have not been fully explored. For example, deep learning techniques are underutilized for phylogeographic model selection. In non-model organisms, the lack of information about their ecology and evolution can lead to uncertainty about which demographic models are appropriate. Here, we assess the utility of convolutional neural networks (CNNs) for assessing demographic models in South American lizards in the genus Norops. Three demographic scenarios (constant, expansion, and bottleneck) were considered for each of four inferred population-level lineages, and we found that the overall model accuracy was higher than 98% for all lineages. We then evaluated a set of 26 models that accounted for evolutionary relationships, gene flow, and changes in effective population size among the four lineages, identifying a single model with an estimated overall accuracy of 87% when using CNNs. The inferred demography of the lizard system suggests that gene flow between non-sister populations and changes in effective population sizes through time, probably in response to Pleistocene climatic oscillations, have shaped genetic diversity in this system. Approximate Bayesian computation (ABC) was applied to provide a comparison to the performance of CNNs. ABC was unable to identify a single model among the larger set of 26 models in the subsequent analysis. Our results demonstrate that CNNs can be easily and usefully incorporated into the phylogeographer's toolkit.
Collapse
Affiliation(s)
- Emanuel M Fonseca
- Department of Evolution, Ecology and Organismal Biology, The Ohio State University, Columbus, OH, USA
| | - Guarino R Colli
- Departamento de Zoologia, Universidade de Brasília, Brasília, Brazil
| | - Fernanda P Werneck
- Coordenação de Biodiversidade, Programa de Coleções Científicas Biológicas, Instituto Nacional de Pesquisas da Amazônia (INPA), Manaus, Brazil
| | - Bryan C Carstens
- Department of Evolution, Ecology and Organismal Biology, The Ohio State University, Columbus, OH, USA
| |
Collapse
|
39
|
Isildak U, Stella A, Fumagalli M. Distinguishing between recent balancing selection and incomplete sweep using deep neural networks. Mol Ecol Resour 2021; 21:2706-2718. [PMID: 33749134 DOI: 10.1111/1755-0998.13379] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2020] [Revised: 03/01/2021] [Accepted: 03/05/2021] [Indexed: 12/12/2022]
Abstract
Balancing selection is an important adaptive mechanism underpinning a wide range of phenotypes. Despite its relevance, the detection of recent balancing selection from genomic data is challenging as its signatures are qualitatively similar to those left by ongoing positive selection. In this study, we developed and implemented two deep neural networks and tested their performance to predict loci under recent selection, either due to balancing selection or incomplete sweep, from population genomic data. Specifically, we generated forward-in-time simulations to train and test an artificial neural network (ANN) and a convolutional neural network (CNN). ANN received as input multiple summary statistics calculated on the locus of interest, while CNN was applied directly on the matrix of haplotypes. We found that both architectures have high accuracy to identify loci under recent selection. CNN generally outperformed ANN to distinguish between signals of balancing selection and incomplete sweep and was less affected by incorrect training data. We deployed both trained networks on neutral genomic regions in European populations and demonstrated a lower false-positive rate for CNN than ANN. We finally deployed CNN within the MEFV gene region and identified several common variants predicted to be under incomplete sweep in a European population. Notably, two of these variants are functional changes and could modulate susceptibility to familial Mediterranean fever, possibly as a consequence of past adaptation to pathogens. In conclusion, deep neural networks were able to characterize signals of selection on intermediate frequency variants, an analysis currently inaccessible by commonly used strategies.
Collapse
Affiliation(s)
- Ulas Isildak
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| | - Alessandro Stella
- Laboratory of Medical Genetics, Department of Biomedical Sciences and Human Oncology, Università degli Studi di Bari Aldo Moro, Bari, Italy
| | - Matteo Fumagalli
- Department of Life Sciences, Silwood Park Campus, Imperial College London, London, UK
| |
Collapse
|
40
|
Wang Z, Wang J, Kourakos M, Hoang N, Lee HH, Mathieson I, Mathieson S. Automatic inference of demographic parameters using generative adversarial networks. Mol Ecol Resour 2021; 21:2689-2705. [PMID: 33745225 PMCID: PMC8596911 DOI: 10.1111/1755-0998.13386] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2020] [Accepted: 03/05/2021] [Indexed: 12/12/2022]
Abstract
Population genetics relies heavily on simulated data for validation, inference and intuition. In particular, since the evolutionary ‘ground truth’ for real data is always limited, simulated data are crucial for training supervised machine learning methods. Simulation software can accurately model evolutionary processes but requires many hand‐selected input parameters. As a result, simulated data often fail to mirror the properties of real genetic data, which limits the scope of methods that rely on it. Here, we develop a novel approach to estimating parameters in population genetic models that automatically adapts to data from any population. Our method, pg‐gan, is based on a generative adversarial network that gradually learns to generate realistic synthetic data. We demonstrate that our method is able to recover input parameters in a simulated isolation‐with‐migration model. We then apply our method to human data from the 1000 Genomes Project and show that we can accurately recapitulate the features of real data.
Collapse
Affiliation(s)
- Zhanpeng Wang
- Department of Computer Science, Haverford College, Haverford, PA, USA
| | - Jiaping Wang
- Department of Computer Science, Haverford College, Haverford, PA, USA
| | - Michael Kourakos
- Department of Computer Science, Swarthmore College, Swarthmore, PA, USA
| | - Nhung Hoang
- Department of Computer Science, Swarthmore College, Swarthmore, PA, USA
| | - Hyong Hark Lee
- Department of Computer Science, Swarthmore College, Swarthmore, PA, USA
| | - Iain Mathieson
- Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA
| | - Sara Mathieson
- Department of Computer Science, Haverford College, Haverford, PA, USA
| |
Collapse
|
41
|
Lindo J, DeGiorgio M. Understanding the Adaptive Evolutionary Histories of South American Ancient and Present-Day Populations via Genomics. Genes (Basel) 2021; 12:360. [PMID: 33801556 PMCID: PMC8001801 DOI: 10.3390/genes12030360] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2021] [Revised: 02/18/2021] [Accepted: 02/22/2021] [Indexed: 12/03/2022] Open
Abstract
The South American continent is remarkably diverse in its ecological zones, spanning the Amazon rainforest, the high-altitude Andes, and Tierra del Fuego. Yet the original human populations of the continent successfully inhabited all these zones, well before the buffering effects of modern technology. Therefore, it is likely that the various cultures were successful, in part, due to positive natural selection that allowed them to successfully establish populations for thousands of years. Detecting positive selection in these populations is still in its infancy, as the ongoing effects of European contact have decimated many of these populations and introduced gene flow from outside of the continent. In this review, we explore hypotheses of possible human biological adaptation, methods to identify positive selection, the utilization of ancient DNA, and the integration of modern genomes through the identification of genomic tracts that reflect the ancestry of the first populations of the Americas.
Collapse
Affiliation(s)
- John Lindo
- Department of Anthropology, Emory University, Atlanta, GA 30322, USA
| | - Michael DeGiorgio
- Department of Computer and Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA
| |
Collapse
|
42
|
Del Amparo R, Branco C, Arenas J, Vicens A, Arenas M. Analysis of selection in protein-coding sequences accounting for common biases. Brief Bioinform 2021; 22:6105943. [PMID: 33479739 DOI: 10.1093/bib/bbaa431] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2020] [Revised: 12/17/2020] [Accepted: 12/22/2020] [Indexed: 12/16/2022] Open
Abstract
The evolution of protein-coding genes is usually driven by selective processes, which favor some evolutionary trajectories over others, optimizing the subsequent protein stability and activity. The analysis of selection in this type of genetic data is broadly performed with the metric nonsynonymous/synonymous substitution rate ratio (dN/dS). However, most of the well-established methodologies to estimate this metric make crucial assumptions, such as lack of recombination or invariable codon frequencies along genes, which can bias the estimation. Here, we review the most relevant biases in the dN/dS estimation and provide a detailed guide to estimate this metric using state-of-the-art procedures that account for such biases, along with illustrative practical examples and recommendations. We also discuss the traditional interpretation of the estimated dN/dS emphasizing the importance of considering complementary biological information such as the role of the observed substitutions on the stability and function of proteins. This review is oriented to help evolutionary biologists that aim to accurately estimate selection in protein-coding sequences.
Collapse
Affiliation(s)
- Roberto Del Amparo
- CINBIO (Biomedical Research Center), University of Vigo, 36310 Vigo, Spain.,Department of Biochemistry, Genetics and Immunology, University of Vigo, 36310 Vigo, Spain
| | - Catarina Branco
- CINBIO (Biomedical Research Center), University of Vigo, 36310 Vigo, Spain.,Department of Biochemistry, Genetics and Immunology, University of Vigo, 36310 Vigo, Spain
| | - Jesús Arenas
- Unit of Microbiology and Immunology, University of Zaragoza, 50013 Zaragoza, Spain
| | - Alberto Vicens
- CINBIO (Biomedical Research Center), University of Vigo, 36310 Vigo, Spain.,Department of Biochemistry, Genetics and Immunology, University of Vigo, 36310 Vigo, Spain
| | - Miguel Arenas
- CINBIO (Biomedical Research Center), University of Vigo, 36310 Vigo, Spain.,Department of Biochemistry, Genetics and Immunology, University of Vigo, 36310 Vigo, Spain
| |
Collapse
|
43
|
Hübner S, Kantar MB. Tapping Diversity From the Wild: From Sampling to Implementation. FRONTIERS IN PLANT SCIENCE 2021; 12:626565. [PMID: 33584776 PMCID: PMC7873362 DOI: 10.3389/fpls.2021.626565] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/06/2020] [Accepted: 01/07/2021] [Indexed: 05/05/2023]
Abstract
The diversity observed among crop wild relatives (CWRs) and their ability to flourish in unfavorable and harsh environments have drawn the attention of plant scientists and breeders for many decades. However, it is also recognized that the benefit gained from using CWRs in breeding is a potential rose between thorns of detrimental genetic variation that is linked to the trait of interest. Despite the increased interest in CWRs, little attention was given so far to the statistical, analytical, and technical considerations that should guide the sampling design, the germplasm characterization, and later its implementation in breeding. Here, we review the entire process of sampling and identifying beneficial genetic variation in CWRs and the challenge of using it in breeding. The ability to detect beneficial genetic variation in CWRs is strongly affected by the sampling design which should be adjusted to the spatial and temporal variation of the target species, the trait of interest, and the analytical approach used. Moreover, linkage disequilibrium is a key factor that constrains the resolution of searching for beneficial alleles along the genome, and later, the ability to deplete linked deleterious genetic variation as a consequence of genetic drag. We also discuss how technological advances in genomics, phenomics, biotechnology, and data science can improve the ability to identify beneficial genetic variation in CWRs and to exploit it in strive for higher-yielding and sustainable crops.
Collapse
Affiliation(s)
- Sariel Hübner
- Galilee Research Institute (MIGAL), Tel-Hai College, Qiryat Shemona, Israel
- *Correspondence: Sariel Hübner,
| | - Michael B. Kantar
- Department of Tropical Plant and Soil Sciences, University of Hawai’i at Mânoa, Honolulu, HI, United States
| |
Collapse
|
44
|
Sanchez T, Cury J, Charpiat G, Jay F. Deep learning for population size history inference: Design, comparison and combination with approximate Bayesian computation. Mol Ecol Resour 2020; 21:2645-2660. [DOI: 10.1111/1755-0998.13224] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2020] [Revised: 06/19/2020] [Accepted: 07/02/2020] [Indexed: 12/28/2022]
Affiliation(s)
- Théophile Sanchez
- Laboratoire de Recherche en Informatique CNRS UMR 8623 Université Paris‐Saclay Orsay France
| | - Jean Cury
- Laboratoire de Recherche en Informatique CNRS UMR 8623 Université Paris‐Saclay Orsay France
| | - Guillaume Charpiat
- Laboratoire de Recherche en Informatique CNRS UMR 8623 Université Paris‐Saclay Orsay France
| | - Flora Jay
- Laboratoire de Recherche en Informatique CNRS UMR 8623 Université Paris‐Saclay Orsay France
| |
Collapse
|
45
|
Rees JS, Castellano S, Andrés AM. The Genomics of Human Local Adaptation. Trends Genet 2020; 36:415-428. [DOI: 10.1016/j.tig.2020.03.006] [Citation(s) in RCA: 38] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2020] [Revised: 03/16/2020] [Accepted: 03/18/2020] [Indexed: 01/23/2023]
|
46
|
Abstract
Accurately inferring the genome-wide landscape of recombination rates in natural populations is a central aim in genomics, as patterns of linkage influence everything from genetic mapping to understanding evolutionary history. Here, we describe recombination landscape estimation using recurrent neural networks (ReLERNN), a deep learning method for estimating a genome-wide recombination map that is accurate even with small numbers of pooled or individually sequenced genomes. Rather than use summaries of linkage disequilibrium as its input, ReLERNN takes columns from a genotype alignment, which are then modeled as a sequence across the genome using a recurrent neural network. We demonstrate that ReLERNN improves accuracy and reduces bias relative to existing methods and maintains high accuracy in the face of demographic model misspecification, missing genotype calls, and genome inaccessibility. We apply ReLERNN to natural populations of African Drosophila melanogaster and show that genome-wide recombination landscapes, although largely correlated among populations, exhibit important population-specific differences. Lastly, we connect the inferred patterns of recombination with the frequencies of major inversions segregating in natural Drosophila populations.
Collapse
Affiliation(s)
- Jeffrey R Adrion
- Institute of Ecology and Evolution, University of Oregon, Eugene, OR
| | - Jared G Galloway
- Institute of Ecology and Evolution, University of Oregon, Eugene, OR
| | - Andrew D Kern
- Institute of Ecology and Evolution, University of Oregon, Eugene, OR
| |
Collapse
|
47
|
Dehasque M, Ávila‐Arcos MC, Díez‐del‐Molino D, Fumagalli M, Guschanski K, Lorenzen ED, Malaspinas A, Marques‐Bonet T, Martin MD, Murray GGR, Papadopulos AST, Therkildsen NO, Wegmann D, Dalén L, Foote AD. Inference of natural selection from ancient DNA. Evol Lett 2020; 4:94-108. [PMID: 32313686 PMCID: PMC7156104 DOI: 10.1002/evl3.165] [Citation(s) in RCA: 36] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2019] [Revised: 01/13/2020] [Accepted: 02/02/2020] [Indexed: 01/01/2023] Open
Abstract
Evolutionary processes, including selection, can be indirectly inferred based on patterns of genomic variation among contemporary populations or species. However, this often requires unrealistic assumptions of ancestral demography and selective regimes. Sequencing ancient DNA from temporally spaced samples can inform about past selection processes, as time series data allow direct quantification of population parameters collected before, during, and after genetic changes driven by selection. In this Comment and Opinion, we advocate for the inclusion of temporal sampling and the generation of paleogenomic datasets in evolutionary biology, and highlight some of the recent advances that have yet to be broadly applied by evolutionary biologists. In doing so, we consider the expected signatures of balancing, purifying, and positive selection in time series data, and detail how this can advance our understanding of the chronology and tempo of genomic change driven by selection. However, we also recognize the limitations of such data, which can suffer from postmortem damage, fragmentation, low coverage, and typically low sample size. We therefore highlight the many assumptions and considerations associated with analyzing paleogenomic data and the assumptions associated with analytical methods.
Collapse
Affiliation(s)
- Marianne Dehasque
- Centre for Palaeogenetics10691StockholmSweden
- Department of Bioinformatics and GeneticsSwedish Museum of Natural History10405StockholmSweden
- Department of ZoologyStockholm University10691StockholmSweden
| | - María C. Ávila‐Arcos
- International Laboratory for Human Genome Research (LIIGH)UNAM JuriquillaQueretaro76230Mexico
| | - David Díez‐del‐Molino
- Centre for Palaeogenetics10691StockholmSweden
- Department of ZoologyStockholm University10691StockholmSweden
| | - Matteo Fumagalli
- Department of Life Sciences, Silwood Park CampusImperial College LondonAscotSL5 7PYUnited Kingdom
| | - Katerina Guschanski
- Animal Ecology, Department of Ecology and Genetics, Science for Life LaboratoryUppsala University75236UppsalaSweden
| | | | - Anna‐Sapfo Malaspinas
- Department of Computational BiologyUniversity of Lausanne1015LausanneSwitzerland
- SIB Swiss Institute of Bioinformatics1015LausanneSwitzerland
| | - Tomas Marques‐Bonet
- Institut de Biologia Evolutiva(CSIC‐Universitat Pompeu Fabra), Parc de Recerca Biomèdica de BarcelonaBarcelonaSpain
- National Centre for Genomic Analysis—Centre for Genomic RegulationBarcelona Institute of Science and Technology08028BarcelonaSpain
- Institucio Catalana de Recerca i Estudis Avançats08010BarcelonaSpain
- Institut Català de Paleontologia Miquel CrusafontUniversitat Autònoma de BarcelonaCerdanyola del VallèsSpain
| | - Michael D. Martin
- Department of Natural History, NTNU University MuseumNorwegian University of Science and Technology (NTNU)TrondheimNorway
| | - Gemma G. R. Murray
- Department of Veterinary MedicineUniversity of CambridgeCambridgeCB2 1TNUnited Kingdom
| | - Alexander S. T. Papadopulos
- Molecular Ecology and Fisheries Genetics Laboratory, School of Biological SciencesBangor UniversityBangorLL57 2UWUnited Kingdom
| | | | - Daniel Wegmann
- Department of BiologyUniversité de Fribourg1700FribourgSwitzerland
- Swiss Institute of BioinformaticsFribourgSwitzerland
| | - Love Dalén
- Centre for Palaeogenetics10691StockholmSweden
- Department of Bioinformatics and GeneticsSwedish Museum of Natural History10405StockholmSweden
| | - Andrew D. Foote
- Molecular Ecology and Fisheries Genetics Laboratory, School of Biological SciencesBangor UniversityBangorLL57 2UWUnited Kingdom
| |
Collapse
|