1
|
Katikaneni A, Lowe CB. Novelty versus innovation of gene regulatory elements in human evolution and disease. Curr Opin Genet Dev 2024; 90:102279. [PMID: 39591813 DOI: 10.1016/j.gde.2024.102279] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2024] [Revised: 10/10/2024] [Accepted: 10/22/2024] [Indexed: 11/28/2024]
Abstract
It is not currently understood how much of human evolution is due to modifying existing functional elements in the genome versus forging novel elements from nonfunctional DNA. Many early experiments that aimed to assign genetic changes on the human lineage to their resulting phenotypic change have focused on mutations that modify existing elements. However, a number of recent studies have highlighted the potential ease and importance of forging novel gene regulatory elements from nonfunctional sequences on the human lineage. In this review, we distinguish gene regulatory element novelty from innovation. We propose definitions for these terms and emphasize their importance in studying the genetic basis of human uniqueness. We discuss why the forging of novel regulatory elements may have been less emphasized during the previous decades, and why novel regulatory elements are likely to play a significant role in both human adaptation and disease.
Collapse
Affiliation(s)
- Anushka Katikaneni
- Department of Molecular Genetics and Microbiology, Duke University Medical Center, Durham, NC 27710, USA; University Program in Genetics and Genomics, Duke University, Durham, NC 27708, USA
| | - Craig B Lowe
- Department of Molecular Genetics and Microbiology, Duke University Medical Center, Durham, NC 27710, USA; University Program in Genetics and Genomics, Duke University, Durham, NC 27708, USA.
| |
Collapse
|
2
|
Perron N, Kirst M, Chen S. Bringing CAM photosynthesis to the table: Paving the way for resilient and productive agricultural systems in a changing climate. PLANT COMMUNICATIONS 2024; 5:100772. [PMID: 37990498 PMCID: PMC10943566 DOI: 10.1016/j.xplc.2023.100772] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/20/2023] [Revised: 07/27/2023] [Accepted: 11/20/2023] [Indexed: 11/23/2023]
Abstract
Modern agricultural systems are directly threatened by global climate change and the resulting freshwater crisis. A considerable challenge in the coming years will be to develop crops that can cope with the consequences of declining freshwater resources and changing temperatures. One approach to meeting this challenge may lie in our understanding of plant photosynthetic adaptations and water use efficiency. Plants from various taxa have evolved crassulacean acid metabolism (CAM), a water-conserving adaptation of photosynthetic carbon dioxide fixation that enables plants to thrive under semi-arid or seasonally drought-prone conditions. Although past research on CAM has led to a better understanding of the inner workings of plant resilience and adaptation to stress, successful introduction of this pathway into C3 or C4 plants has not been reported. The recent revolution in molecular, systems, and synthetic biology, as well as innovations in high-throughput data generation and mining, creates new opportunities to uncover the minimum genetic tool kit required to introduce CAM traits into drought-sensitive crops. Here, we propose four complementary research avenues to uncover this tool kit. First, genomes and computational methods should be used to improve understanding of the nature of variations that drive CAM evolution. Second, single-cell 'omics technologies offer the possibility for in-depth characterization of the mechanisms that trigger environmentally controlled CAM induction. Third, the rapid increase in new 'omics data enables a comprehensive, multimodal exploration of CAM. Finally, the expansion of functional genomics methods is paving the way for integration of CAM into farming systems.
Collapse
Affiliation(s)
- Noé Perron
- Plant Molecular and Cellular Biology Program, University of Florida, Gainesville, FL 32608, USA
| | - Matias Kirst
- Plant Molecular and Cellular Biology Program, University of Florida, Gainesville, FL 32608, USA; School of Forest, Fisheries and Geomatics Sciences, University of Florida, Gainesville, FL 32603, USA.
| | - Sixue Chen
- Department of Biology, University of Mississippi, Oxford, MS 38677-1848, USA.
| |
Collapse
|
3
|
Booker WW, Ray DD, Schrider DR. This population does not exist: learning the distribution of evolutionary histories with generative adversarial networks. Genetics 2023; 224:iyad063. [PMID: 37067864 PMCID: PMC10213497 DOI: 10.1093/genetics/iyad063] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2023] [Revised: 02/23/2023] [Accepted: 04/05/2023] [Indexed: 04/18/2023] Open
Abstract
Numerous studies over the last decade have demonstrated the utility of machine learning methods when applied to population genetic tasks. More recent studies show the potential of deep-learning methods in particular, which allow researchers to approach problems without making prior assumptions about how the data should be summarized or manipulated, instead learning their own internal representation of the data in an attempt to maximize inferential accuracy. One type of deep neural network, called Generative Adversarial Networks (GANs), can even be used to generate new data, and this approach has been used to create individual artificial human genomes free from privacy concerns. In this study, we further explore the application of GANs in population genetics by designing and training a network to learn the statistical distribution of population genetic alignments (i.e. data sets consisting of sequences from an entire population sample) under several diverse evolutionary histories-the first GAN capable of performing this task. After testing multiple different neural network architectures, we report the results of a fully differentiable Deep-Convolutional Wasserstein GAN with gradient penalty that is capable of generating artificial examples of population genetic alignments that successfully mimic key aspects of the training data, including the site-frequency spectrum, differentiation between populations, and patterns of linkage disequilibrium. We demonstrate consistent training success across various evolutionary models, including models of panmictic and subdivided populations, populations at equilibrium and experiencing changes in size, and populations experiencing either no selection or positive selection of various strengths, all without the need for extensive hyperparameter tuning. Overall, our findings highlight the ability of GANs to learn and mimic population genetic data and suggest future areas where this work can be applied in population genetics research that we discuss herein.
Collapse
Affiliation(s)
- William W Booker
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27514-2916, USA
| | - Dylan D Ray
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27514-2916, USA
| | - Daniel R Schrider
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27514-2916, USA
| |
Collapse
|
4
|
Hayeck TJ, Stong N, Baugh E, Dhindsa R, Turner TN, Malakar A, Mosbruger TL, Shaw GTW, Duan Y, Ionita-Laza I, Goldstein D, Allen AS. Ancestry adjustment improves genome-wide estimates of regional intolerance. Genetics 2022; 221:iyac050. [PMID: 35385101 PMCID: PMC9157129 DOI: 10.1093/genetics/iyac050] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2022] [Accepted: 02/24/2022] [Indexed: 11/12/2022] Open
Abstract
Genomic regions subject to purifying selection are more likely to carry disease-causing mutations than regions not under selection. Cross species conservation is often used to identify such regions but with limited resolution to detect selection on short evolutionary timescales such as that occurring in only one species. In contrast, genetic intolerance looks for depletion of variation relative to expectation within a species, allowing species-specific features to be identified. When estimating the intolerance of noncoding sequence, methods strongly leverage variant frequency distributions. As the expected distributions depend on ancestry, if not properly controlled for, ancestral population source may obfuscate signals of selection. We demonstrate that properly incorporating ancestry in intolerance estimation greatly improved variant classification. We provide a genome-wide intolerance map that is conditional on ancestry and likely to be particularly valuable for variant prioritization.
Collapse
Affiliation(s)
- Tristan J Hayeck
- Department of Pathology and Laboratory Medicine, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Nicholas Stong
- Institute for Genomic Medicine, Columbia University Medical Center, New York, NY 10032, USA
| | - Evan Baugh
- Institute for Genomic Medicine, Columbia University Medical Center, New York, NY 10032, USA
| | - Ryan Dhindsa
- Institute for Genomic Medicine, Columbia University Medical Center, New York, NY 10032, USA
| | - Tychele N Turner
- Department of Genetics, Washington University in St. Louis, St. Louis, MO 63110, USA
| | - Ayan Malakar
- Institute for Genomic Medicine, Columbia University Medical Center, New York, NY 10032, USA
| | - Timothy L Mosbruger
- Department of Pathology and Laboratory Medicine, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Grace Tzun-Wen Shaw
- Department of Pathology and Laboratory Medicine, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Yuncheng Duan
- Department of Biostatistics and Bioinformatics, Duke University, Durham, NC 27710, USA
| | | | - David Goldstein
- Institute for Genomic Medicine, Columbia University Medical Center, New York, NY 10032, USA
| | - Andrew S Allen
- Department of Biostatistics and Bioinformatics, Duke University, Durham, NC 27710, USA
| |
Collapse
|
5
|
Kumar H, Panigrahi M, Panwar A, Rajawat D, Nayak SS, Saravanan KA, Kaisa K, Parida S, Bhushan B, Dutt T. Machine-Learning Prospects for Detecting Selection Signatures Using Population Genomics Data. J Comput Biol 2022; 29:943-960. [PMID: 35639362 DOI: 10.1089/cmb.2021.0447] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
Natural selection has been given a lot of attention because it relates to the adaptation of populations to their environments, both biotic and abiotic. An allele is selected when it is favored by natural selection. Consequently, the favored allele increases in frequency in the population and neighboring linked variation diminishes, causing so-called selective sweeps. A high-throughput genomic sequence allows one to disentangle the evolutionary forces at play in populations. With the development of high-throughput genome sequencing technologies, it has become easier to detect these selective sweeps/selection signatures. Various methods can be used to detect selective sweeps, from simple implementations using summary statistics to complex statistical approaches. One of the important problems of these statistical models is the potential to provide inaccurate results when their assumptions are violated. The use of machine learning (ML) in population genetics has been introduced as an alternative method of detecting selection by treating the problem of detecting selection signatures as a classification problem. Since the availability of population genomics data is increasing, researchers may incorporate ML into these statistical models to infer signatures of selection with higher predictive accuracy and better resolution. This article describes how ML can be used to aid in detecting and studying natural selection patterns using population genomic data.
Collapse
Affiliation(s)
- Harshit Kumar
- Divisions of Animal Genetics, ICAR-Indian Veterinary Research Institute, Izatnagar, India
| | - Manjit Panigrahi
- Divisions of Animal Genetics, ICAR-Indian Veterinary Research Institute, Izatnagar, India
| | - Anuradha Panwar
- Divisions of Animal Genetics, ICAR-Indian Veterinary Research Institute, Izatnagar, India
| | - Divya Rajawat
- Divisions of Animal Genetics, ICAR-Indian Veterinary Research Institute, Izatnagar, India
| | - Sonali Sonejita Nayak
- Divisions of Animal Genetics, ICAR-Indian Veterinary Research Institute, Izatnagar, India
| | - K A Saravanan
- Divisions of Animal Genetics, ICAR-Indian Veterinary Research Institute, Izatnagar, India
| | - Kaiho Kaisa
- Divisions of Animal Genetics, ICAR-Indian Veterinary Research Institute, Izatnagar, India
| | - Subhashree Parida
- Divisions of Pharmacology and Toxicology, ICAR-Indian Veterinary Research Institute, Izatnagar, India
| | - Bharat Bhushan
- Divisions of Animal Genetics, ICAR-Indian Veterinary Research Institute, Izatnagar, India
| | - Triveni Dutt
- Livestock Production and Management Section, ICAR-Indian Veterinary Research Institute, Izatnagar, India
| |
Collapse
|
6
|
Nguembang Fadja A, Riguzzi F, Bertorelle G, Trucchi E. Identification of natural selection in genomic data with deep convolutional neural network. BioData Min 2021; 14:51. [PMID: 34863217 PMCID: PMC8642854 DOI: 10.1186/s13040-021-00280-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2021] [Accepted: 10/25/2021] [Indexed: 11/10/2022] Open
Abstract
Background With the increase in the size of genomic datasets describing variability in populations, extracting relevant information becomes increasingly useful as well as complex. Recently, computational methodologies such as Supervised Machine Learning and specifically Convolutional Neural Networks have been proposed to make inferences on demographic and adaptive processes using genomic data. Even though it was already shown to be powerful and efficient in different fields of investigation, Supervised Machine Learning has still to be explored as to unfold its enormous potential in evolutionary genomics. Results The paper proposes a method based on Supervised Machine Learning for classifying genomic data, represented as windows of genomic sequences from a sample of individuals belonging to the same population. A Convolutional Neural Network is used to test whether a genomic window shows the signature of natural selection. Training performed on simulated data show that the proposed model can accurately predict neutral and selection processes on portions of genomes taken from real populations with almost 90% accuracy.
Collapse
Affiliation(s)
- Arnaud Nguembang Fadja
- Dipartimento di Matematica e Informatica, University of Ferrara, Via Saragat 1, Ferrara, I-44122, Italy.
| | - Fabrizio Riguzzi
- Dipartimento di Matematica e Informatica, University of Ferrara, Via Saragat 1, Ferrara, I-44122, Italy
| | - Giorgio Bertorelle
- Dipartimento di Scienze della Vita e Biotecnologie, University of Ferrara, Via Luigi Borsari 46, Ferrara, I-44121, Italy
| | - Emiliano Trucchi
- Dipartimento di Scienze della Vita e dell'Ambiente, Marche Polytechnic University, Via Brecce Bianche, Ancona, I-60131, Italy
| |
Collapse
|
7
|
Chen Z, Zhang D, Reynolds RH, Gustavsson EK, García-Ruiz S, D'Sa K, Fairbrother-Browne A, Vandrovcova J, Hardy J, Houlden H, Gagliano Taliun SA, Botía J, Ryten M. Human-lineage-specific genomic elements are associated with neurodegenerative disease and APOE transcript usage. Nat Commun 2021; 12:2076. [PMID: 33824317 PMCID: PMC8024253 DOI: 10.1038/s41467-021-22262-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2020] [Accepted: 03/03/2021] [Indexed: 12/12/2022] Open
Abstract
Knowledge of genomic features specific to the human lineage may provide insights into brain-related diseases. We leverage high-depth whole genome sequencing data to generate a combined annotation identifying regions simultaneously depleted for genetic variation (constrained regions) and poorly conserved across primates. We propose that these constrained, non-conserved regions (CNCRs) have been subject to human-specific purifying selection and are enriched for brain-specific elements. We find that CNCRs are depleted from protein-coding genes but enriched within lncRNAs. We demonstrate that per-SNP heritability of a range of brain-relevant phenotypes are enriched within CNCRs. We find that genes implicated in neurological diseases have high CNCR density, including APOE, highlighting an unannotated intron-3 retention event. Using human brain RNA-sequencing data, we show the intron-3-retaining transcript to be more abundant in Alzheimer's disease with more severe tau and amyloid pathological burden. Thus, we demonstrate potential association of human-lineage-specific sequences in brain development and neurological disease.
Collapse
Affiliation(s)
- Zhongbo Chen
- Department of Neurodegenerative Disease, Queen Square Institute of Neurology, University College London (UCL), London, UK
- NIHR Great Ormond Street Hospital Biomedical Research Centre, University College London, London, UK
- Department of Genetics and Genomic Medicine, Great Ormond Street Institute of Child Health, University College London, London, UK
| | - David Zhang
- Department of Neurodegenerative Disease, Queen Square Institute of Neurology, University College London (UCL), London, UK
- NIHR Great Ormond Street Hospital Biomedical Research Centre, University College London, London, UK
- Department of Genetics and Genomic Medicine, Great Ormond Street Institute of Child Health, University College London, London, UK
| | - Regina H Reynolds
- Department of Neurodegenerative Disease, Queen Square Institute of Neurology, University College London (UCL), London, UK
- NIHR Great Ormond Street Hospital Biomedical Research Centre, University College London, London, UK
- Department of Genetics and Genomic Medicine, Great Ormond Street Institute of Child Health, University College London, London, UK
| | - Emil K Gustavsson
- Department of Neurodegenerative Disease, Queen Square Institute of Neurology, University College London (UCL), London, UK
- NIHR Great Ormond Street Hospital Biomedical Research Centre, University College London, London, UK
- Department of Genetics and Genomic Medicine, Great Ormond Street Institute of Child Health, University College London, London, UK
| | - Sonia García-Ruiz
- Department of Neurodegenerative Disease, Queen Square Institute of Neurology, University College London (UCL), London, UK
- NIHR Great Ormond Street Hospital Biomedical Research Centre, University College London, London, UK
- Department of Genetics and Genomic Medicine, Great Ormond Street Institute of Child Health, University College London, London, UK
| | - Karishma D'Sa
- Department of Neurodegenerative Disease, Queen Square Institute of Neurology, University College London (UCL), London, UK
- NIHR Great Ormond Street Hospital Biomedical Research Centre, University College London, London, UK
- Department of Genetics and Genomic Medicine, Great Ormond Street Institute of Child Health, University College London, London, UK
| | - Aine Fairbrother-Browne
- Department of Neurodegenerative Disease, Queen Square Institute of Neurology, University College London (UCL), London, UK
- NIHR Great Ormond Street Hospital Biomedical Research Centre, University College London, London, UK
- Department of Genetics and Genomic Medicine, Great Ormond Street Institute of Child Health, University College London, London, UK
| | - Jana Vandrovcova
- Department of Neurodegenerative Disease, Queen Square Institute of Neurology, University College London (UCL), London, UK
| | - John Hardy
- Department of Neurodegenerative Disease, Queen Square Institute of Neurology, University College London (UCL), London, UK
- Reta Lila Weston Institute, Queen Square Institute of Neurology, UCL, London, UK
- UK Dementia Research Institute, Queen Square Institute of Neurology, UCL, London, UK
- NIHR University College London Hospitals Biomedical Research Centre, London, UK
- Institute for Advanced Study, The Hong Kong University of Science and Technology, The Hong Kong University of Science and Technology, Hong Kong SAR, China
| | - Henry Houlden
- Department of Neuromuscular Disease, Queen Square Institute of Neurology, UCL, London, UK
| | - Sarah A Gagliano Taliun
- Department of Medicine & Department of Neurosciences, Université de Montréal, Université de Montréal, Montréal, QC, Canada
- Montréal Heart Institute, Montréal, Québec, Canada
| | - Juan Botía
- Department of Neurodegenerative Disease, Queen Square Institute of Neurology, University College London (UCL), London, UK
- Departamento de Ingeniería de la Información y las Comunicaciones, Universidad de Murcia, Murcia, Spain
| | - Mina Ryten
- Department of Neurodegenerative Disease, Queen Square Institute of Neurology, University College London (UCL), London, UK.
- NIHR Great Ormond Street Hospital Biomedical Research Centre, University College London, London, UK.
- Department of Genetics and Genomic Medicine, Great Ormond Street Institute of Child Health, University College London, London, UK.
| |
Collapse
|
8
|
|
9
|
Huber CD, Kim BY, Lohmueller KE. Population genetic models of GERP scores suggest pervasive turnover of constrained sites across mammalian evolution. PLoS Genet 2020; 16:e1008827. [PMID: 32469868 PMCID: PMC7286533 DOI: 10.1371/journal.pgen.1008827] [Citation(s) in RCA: 60] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2020] [Revised: 06/10/2020] [Accepted: 05/05/2020] [Indexed: 01/20/2023] Open
Abstract
Comparative genomic approaches have been used to identify sites where mutations are under purifying selection and of functional consequence by searching for sequences that are conserved across distantly related species. However, the performance of these approaches has not been rigorously evaluated under population genetic models. Further, short-lived functional elements may not leave a footprint of sequence conservation across many species. We use simulations to study how one measure of conservation, the Genomic Evolutionary Rate Profiling (GERP) score, relates to the strength of selection (Nes). We show that the GERP score is related to the strength of purifying selection. However, changes in selection coefficients or functional elements over time (i.e. functional turnover) can strongly affect the GERP distribution, leading to unexpected relationships between GERP and Nes. Further, we show that for functional elements that have a high turnover rate, adding more species to the analysis does not necessarily increase statistical power. Finally, we use the distribution of GERP scores across the human genome to compare models with and without turnover of sites where mutations are under purifying selection. We show that mutations in 4.51% of the noncoding human genome are under purifying selection and that most of this sequence has likely experienced changes in selection coefficients throughout mammalian evolution. Our work reveals limitations to using comparative genomic approaches to identify deleterious mutations. Commonly used GERP score thresholds miss over half of the noncoding sites in the human genome where mutations are under purifying selection. One of the most significant and challenging tasks in modern genomics is to assess the functional consequences of a particular nucleotide change in a genome. A common approach to address this challenge prioritizes sequences that share similar nucleotides across distantly related species, with the rationale that mutations at such positions were deleterious and removed from the population by purifying natural selection. Our manuscript shows that one popular measure of sequence conservation, the GERP score, performs well at identifying selected mutations if mutations at a site were under selection across all of mammalian evolution. Changes in selection at a given site dramatically reduces the power of GERP to detect selected mutations in humans. We also combine population genetic models with the distribution of GERP scores at noncoding sites across the human genome to show that the degree of selection at individual sites has changed throughout mammalian evolution. Importantly, we demonstrate that at least 80 Mb of noncoding sequence under purifying selection in humans will not have extreme GERP scores and will likely be missed by modern comparative genomic approaches. Our work argues that new approaches, potentially based on genetic variation within species, will be required to identify deleterious mutations.
Collapse
Affiliation(s)
- Christian D. Huber
- School of Biological Sciences, University of Adelaide, Adelaide, South Australia, Australia
| | - Bernard Y. Kim
- Department of Biology, Stanford University, Stanford, California, United States of America
| | - Kirk E. Lohmueller
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, California, United States of America
- Interdepartmental Program in Bioinformatics, University of California, Los Angeles, California, United States of America
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, California, United States of America
- * E-mail:
| |
Collapse
|
10
|
Joly-Lopez Z, Platts AE, Gulko B, Choi JY, Groen SC, Zhong X, Siepel A, Purugganan MD. An inferred fitness consequence map of the rice genome. NATURE PLANTS 2020; 6:119-130. [PMID: 32042156 PMCID: PMC7446671 DOI: 10.1038/s41477-019-0589-3] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/18/2019] [Accepted: 12/20/2019] [Indexed: 05/04/2023]
Abstract
The extent to which sequence variation impacts plant fitness is poorly understood. High-resolution maps detailing the constraint acting on the genome, especially in regulatory sites, would be beneficial as functional annotation of noncoding sequences remains sparse. Here, we present a fitness consequence (fitCons) map for rice (Oryza sativa). We inferred fitCons scores (ρ) for 246 inferred genome classes derived from nine functional genomic and epigenomic datasets, including chromatin accessibility, messenger RNA/small RNA transcription, DNA methylation, histone modifications and engaged RNA polymerase activity. These were integrated with genome-wide polymorphism and divergence data from 1,477 rice accessions and 11 reference genome sequences in the Oryzeae. We found ρ to be multimodal, with ~9% of the rice genome falling into classes where more than half of the bases would probably have a fitness consequence if mutated. Around 2% of the rice genome showed evidence of weak negative selection, frequently at candidate regulatory sites, including a novel set of 1,000 potentially active enhancer elements. This fitCons map provides perspective on the evolutionary forces associated with genome diversity, aids in genome annotation and can guide crop breeding programs.
Collapse
Affiliation(s)
- Zoé Joly-Lopez
- Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA
| | - Adrian E Platts
- Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Brad Gulko
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Jae Young Choi
- Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA
| | - Simon C Groen
- Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA
| | - Xuehua Zhong
- Laboratory of Genetics and Wisconsin Institute for Discovery, University of Wisconsin-Madison, Madison, WI, USA
| | - Adam Siepel
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Michael D Purugganan
- Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA.
- Center for Genomics and Systems Biology, NYU Abu Dhabi Research Institute, NYU Abu Dhabi, Abu Dhabi, United Arab Emirates.
| |
Collapse
|
11
|
Flagel L, Brandvain Y, Schrider DR. The Unreasonable Effectiveness of Convolutional Neural Networks in Population Genetic Inference. Mol Biol Evol 2019; 36:220-238. [PMID: 30517664 PMCID: PMC6367976 DOI: 10.1093/molbev/msy224] [Citation(s) in RCA: 105] [Impact Index Per Article: 17.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
Population-scale genomic data sets have given researchers incredible amounts of information from which to infer evolutionary histories. Concomitant with this flood of data, theoretical and methodological advances have sought to extract information from genomic sequences to infer demographic events such as population size changes and gene flow among closely related populations/species, construct recombination maps, and uncover loci underlying recent adaptation. To date, most methods make use of only one or a few summaries of the input sequences and therefore ignore potentially useful information encoded in the data. The most sophisticated of these approaches involve likelihood calculations, which require theoretical advances for each new problem, and often focus on a single aspect of the data (e.g., only allele frequency information) in the interest of mathematical and computational tractability. Directly interrogating the entirety of the input sequence data in a likelihood-free manner would thus offer a fruitful alternative. Here, we accomplish this by representing DNA sequence alignments as images and using a class of deep learning methods called convolutional neural networks (CNNs) to make population genetic inferences from these images. We apply CNNs to a number of evolutionary questions and find that they frequently match or exceed the accuracy of current methods. Importantly, we show that CNNs perform accurate evolutionary model selection and parameter estimation, even on problems that have not received detailed theoretical treatments. Thus, when applied to population genetic alignments, CNNs are capable of outperforming expert-derived statistical methods and offer a new path forward in cases where no likelihood approach exists.
Collapse
Affiliation(s)
- Lex Flagel
- Monsanto Company, Chesterfield, MO
- Department of Plant and Microbial Biology, University of Minnesota, St. Paul, MN
| | - Yaniv Brandvain
- Department of Plant and Microbial Biology, University of Minnesota, St. Paul, MN
| | - Daniel R Schrider
- Department of Genetics, University of North Carolina, Chapel Hill, NC
| |
Collapse
|
12
|
Chan J, Perrone V, Spence JP, Jenkins PA, Mathieson S, Song YS. A Likelihood-Free Inference Framework for Population Genetic Data using Exchangeable Neural Networks. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 2018; 31:8594-8605. [PMID: 33244210 PMCID: PMC7687905] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
An explosion of high-throughput DNA sequencing in the past decade has led to a surge of interest in population-scale inference with whole-genome data. Recent work in population genetics has centered on designing inference methods for relatively simple model classes, and few scalable general-purpose inference techniques exist for more realistic, complex models. To achieve this, two inferential challenges need to be addressed: (1) population data are exchangeable, calling for methods that efficiently exploit the symmetries of the data, and (2) computing likelihoods is intractable as it requires integrating over a set of correlated, extremely high-dimensional latent variables. These challenges are traditionally tackled by likelihood-free methods that use scientific simulators to generate datasets and reduce them to hand-designed, permutation-invariant summary statistics, often leading to inaccurate inference. In this work, we develop an exchangeable neural network that performs summary statistic-free, likelihood-free inference. Our framework can be applied in a black-box fashion across a variety of simulation-based tasks, both within and outside biology. We demonstrate the power of our approach on the recombination hotspot testing problem, outperforming the state-of-the-art.
Collapse
|
13
|
Schrider DR, Kern AD. Supervised Machine Learning for Population Genetics: A New Paradigm. Trends Genet 2018; 34:301-312. [PMID: 29331490 PMCID: PMC5905713 DOI: 10.1016/j.tig.2017.12.005] [Citation(s) in RCA: 220] [Impact Index Per Article: 31.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2017] [Revised: 11/29/2017] [Accepted: 12/08/2017] [Indexed: 01/21/2023]
Abstract
As population genomic datasets grow in size, researchers are faced with the daunting task of making sense of a flood of information. To keep pace with this explosion of data, computational methodologies for population genetic inference are rapidly being developed to best utilize genomic sequence data. In this review we discuss a new paradigm that has emerged in computational population genomics: that of supervised machine learning (ML). We review the fundamentals of ML, discuss recent applications of supervised ML to population genetics that outperform competing methods, and describe promising future directions in this area. Ultimately, we argue that supervised ML is an important and underutilized tool that has considerable potential for the world of evolutionary genomics.
Collapse
Affiliation(s)
- Daniel R Schrider
- Department of Genetics, and Human Genetics Institute of New Jersey, Rutgers University, Piscataway, NJ 08554, USA.
| | - Andrew D Kern
- Department of Genetics, and Human Genetics Institute of New Jersey, Rutgers University, Piscataway, NJ 08554, USA.
| |
Collapse
|
14
|
The human noncoding genome defined by genetic diversity. Nat Genet 2018; 50:333-337. [PMID: 29483654 DOI: 10.1038/s41588-018-0062-7] [Citation(s) in RCA: 96] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2016] [Accepted: 01/19/2018] [Indexed: 11/09/2022]
Abstract
Understanding the significance of genetic variants in the noncoding genome is emerging as the next challenge in human genomics. We used the power of 11,257 whole-genome sequences and 16,384 heptamers (7-nt motifs) to build a map of sequence constraint for the human species. This build differed substantially from traditional maps of interspecies conservation and identified regulatory elements among the most constrained regions of the genome. Using new Hi-C experimental data, we describe a strong pattern of coordination over 2 Mb where the most constrained regulatory elements associate with the most essential genes. Constrained regions of the noncoding genome are up to 52-fold enriched for known pathogenic variants as compared to unconstrained regions (21-fold when compared to the genome average). This map of sequence constraint across thousands of individuals is an asset to help interpret noncoding elements in the human genome, prioritize variants and reconsider gene units at a larger scale.
Collapse
|
15
|
Johri P, Krenek S, Marinov GK, Doak TG, Berendonk TU, Lynch M. Population Genomics of Paramecium Species. Mol Biol Evol 2017; 34:1194-1216. [PMID: 28204679 DOI: 10.1093/molbev/msx074] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
Population-genomic analyses are essential to understanding factors shaping genomic variation and lineage-specific sequence constraints. The dearth of such analyses for unicellular eukaryotes prompted us to assess genomic variation in Paramecium, one of the most well-studied ciliate genera. The Paramecium aurelia complex consists of ∼15 morphologically indistinguishable species that diverged subsequent to two rounds of whole-genome duplications (WGDs, as long as 320 MYA) and possess extremely streamlined genomes. We examine patterns of both nuclear and mitochondrial polymorphism, by sequencing whole genomes of 10-13 worldwide isolates of each of three species belonging to the P. aurelia complex: P. tetraurelia, P. biaurelia, P. sexaurelia, as well as two outgroup species that do not share the WGDs: P. caudatum and P. multimicronucleatum. An apparent absence of global geographic population structure suggests continuous or recent dispersal of Paramecium over long distances. Intergenic regions are highly constrained relative to coding sequences, especially in P. caudatum and P. multimicronucleatum that have shorter intergenic distances. Sequence diversity and divergence are reduced up to ∼100-150 bp both upstream and downstream of genes, suggesting strong constraints imposed by the presence of densely packed regulatory modules. In addition, comparison of sequence variation at non-synonymous and synonymous sites suggests similar recent selective pressures on paralogs within and orthologs across the deeply diverging species. This study presents the first genome-wide population-genomic analysis in ciliates and provides a valuable resource for future studies in evolutionary and functional genetics in Paramecium.
Collapse
Affiliation(s)
- Parul Johri
- Department of Biology, Indiana University, Bloomington, IN
| | - Sascha Krenek
- Institute of Hydrobiology, Technische Universität Dresden, Dresden, Germany
| | | | - Thomas G Doak
- Department of Biology, Indiana University, Bloomington, IN.,National Center for Genome Analysis Support, Indiana University, Bloomington, IN
| | - Thomas U Berendonk
- Institute of Hydrobiology, Technische Universität Dresden, Dresden, Germany
| | - Michael Lynch
- Department of Biology, Indiana University, Bloomington, IN
| |
Collapse
|
16
|
Abstract
The degree to which adaptation in recent human evolution shapes genetic variation remains controversial. This is in part due to the limited evidence in humans for classic "hard selective sweeps", wherein a novel beneficial mutation rapidly sweeps through a population to fixation. However, positive selection may often proceed via "soft sweeps" acting on mutations already present within a population. Here, we examine recent positive selection across six human populations using a powerful machine learning approach that is sensitive to both hard and soft sweeps. We found evidence that soft sweeps are widespread and account for the vast majority of recent human adaptation. Surprisingly, our results also suggest that linked positive selection affects patterns of variation across much of the genome, and may increase the frequencies of deleterious mutations. Our results also reveal insights into the role of sexual selection, cancer risk, and central nervous system development in recent human evolution.
Collapse
Affiliation(s)
- Daniel R. Schrider
- Department of Genetics, Rutgers University, Piscataway, NJ
- Human Genetics Institute of New Jersey, Rutgers University, Piscataway, NJ
| | - Andrew D. Kern
- Department of Genetics, Rutgers University, Piscataway, NJ
- Human Genetics Institute of New Jersey, Rutgers University, Piscataway, NJ
| |
Collapse
|
17
|
Meyer KA, Marques-Bonet T, Sestan N. Differential Gene Expression in the Human Brain Is Associated with Conserved, but Not Accelerated, Noncoding Sequences. Mol Biol Evol 2017; 34:1217-1229. [PMID: 28204568 PMCID: PMC5400397 DOI: 10.1093/molbev/msx076] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
Abstract
Previous studies have found that genes which are differentially expressed within the developing human brain disproportionately neighbor conserved noncoding sequences (CNSs) that have an elevated substitution rate in humans and in other species. One explanation for this general association of differential expression with accelerated CNSs is that genes with pre-existing patterns of differential expression have been preferentially targeted by species-specific regulatory changes. Here we provide support for an alternative explanation: genes that neighbor a greater number of CNSs have a higher probability of differential expression and a higher probability of neighboring a CNS with lineage-specific acceleration. Thus, neighboring an accelerated element from any species signals that a gene likely neighbors many CNSs. We extend the analyses beyond the prenatal time points considered in previous studies to demonstrate that this association persists across developmental and adult periods. Examining differential expression between non-neural tissues suggests that the relationship between the number of CNSs a gene neighbors and its differential expression status may be particularly strong for expression differences among brain regions. In addition, by considering this relationship, we highlight a recently defined set of putative human-specific gain-of-function sequences that, even after adjusting for the number of CNSs neighbored by genes, shows a positive relationship with upregulation in the brain compared with other tissues examined.
Collapse
Affiliation(s)
- Kyle A. Meyer
- Department of Neuroscience and Kavli Institute for Neuroscience, Yale School of Medicine, New Haven, CT
| | - Tomas Marques-Bonet
- Institute of Evolutionary Biology (UPF-CSIC), PRBB, Barcelona, Spain
- Catalan Institution of Research and Advanced Studies (ICREA), Passeig de Lluís Companys, Barcelona, Spain
- CNAG-CRG, Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Barcelona, Spain
| | - Nenad Sestan
- Department of Neuroscience and Kavli Institute for Neuroscience, Yale School of Medicine, New Haven, CT
- Departments of Genetics and Psychiatry, Section of Comparative Medicine, Program in Cellular Neuroscience, Neurodegeneration and Repair, and Yale Child Study Center, Yale School of Medicine, New Haven, CT
| |
Collapse
|
18
|
Phung TN, Huber CD, Lohmueller KE. Determining the Effect of Natural Selection on Linked Neutral Divergence across Species. PLoS Genet 2016; 12:e1006199. [PMID: 27508305 PMCID: PMC4980041 DOI: 10.1371/journal.pgen.1006199] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2015] [Accepted: 06/25/2016] [Indexed: 11/18/2022] Open
Abstract
A major goal in evolutionary biology is to understand how natural selection has shaped patterns of genetic variation across genomes. Studies in a variety of species have shown that neutral genetic diversity (intra-species differences) has been reduced at sites linked to those under direct selection. However, the effect of linked selection on neutral sequence divergence (inter-species differences) remains ambiguous. While empirical studies have reported correlations between divergence and recombination, which is interpreted as evidence for natural selection reducing linked neutral divergence, theory argues otherwise, especially for species that have diverged long ago. Here we address these outstanding issues by examining whether natural selection can affect divergence between both closely and distantly related species. We show that neutral divergence between closely related species (e.g. human-primate) is negatively correlated with functional content and positively correlated with human recombination rate. We also find that neutral divergence between distantly related species (e.g. human-rodent) is negatively correlated with functional content and positively correlated with estimates of background selection from primates. These patterns persist after accounting for the confounding factors of hypermutable CpG sites, GC content, and biased gene conversion. Coalescent models indicate that even when the contribution of ancestral polymorphism to divergence is small, background selection in the ancestral population can still explain a large proportion of the variance in divergence across the genome, generating the observed correlations. Our findings reveal that, contrary to previous intuition, natural selection can indirectly affect linked neutral divergence between both closely and distantly related species. Though we cannot formally exclude the possibility that the direct effects of purifying selection drive some of these patterns, such a scenario would be possible only if more of the genome is under purifying selection than currently believed. Our work has implications for understanding the evolution of genomes and interpreting patterns of genetic variation. Genetic variation at neutral sites can be reduced through linkage to nearby selected sites. This pattern has been used to show the widespread effects of natural selection at shaping patterns of genetic diversity across genomes from a variety of species. However, it is not entirely clear whether natural selection has an effect on neutral divergence between species. Here we show that putatively neutral divergence between closely related species (human and chimp) and between distantly related pairs of species (humans and mice) show signatures consistent with having been affected by linkage to selected sites. Further, our theoretical models and simulations show that natural selection indirectly affecting linked neutral sites can generate these patterns. Unless substantially more of the genome is under the direct effects of purifying selection than currently believed, our results argue that natural selection has played an important role in shaping variation in levels of putatively neutral sequence divergence across the genome. Our findings further suggest that divergence-based estimates of neutral mutation rate variation across the genome as well as certain estimators of population history may be confounded by linkage to selected sites.
Collapse
Affiliation(s)
- Tanya N. Phung
- Interdepartmental Program in Bioinformatics, University of California, Los Angeles, Los Angeles, California, United States of America
| | - Christian D. Huber
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, Los Angeles, California, United States of America
| | - Kirk E. Lohmueller
- Interdepartmental Program in Bioinformatics, University of California, Los Angeles, Los Angeles, California, United States of America
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, Los Angeles, California, United States of America
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, California, United States of America
- * E-mail:
| |
Collapse
|