201
|
Machine learning technology in the application of genome analysis: A systematic review. Gene 2019; 705:149-156. [PMID: 31026571 DOI: 10.1016/j.gene.2019.04.062] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2019] [Revised: 04/17/2019] [Accepted: 04/22/2019] [Indexed: 01/17/2023]
Abstract
Machine learning (ML) is a powerful technique to tackle many problems in data mining and predictive analytics. We believe that ML will be of considerable potentials in the field of bioinformatics since the high-throughput technology is producing ever increasing biological data. In this review, we summarized major ML algorithms and conditions that must be paid attention to when applying these algorithms to genomic problems in details and we provided a list of examples from different perspectives and data analysis challenges at present.
Collapse
|
202
|
Abstract
Machine learning has demonstrated potential in analyzing large, complex biological data. In practice, however, biological information is required in addition to machine learning for successful application.
Collapse
|
203
|
|
204
|
Kelly JK, Hughes KA. Pervasive Linked Selection and Intermediate-Frequency Alleles Are Implicated in an Evolve-and-Resequencing Experiment of Drosophila simulans. Genetics 2019; 211:943-961. [PMID: 30593495 PMCID: PMC6404262 DOI: 10.1534/genetics.118.301824] [Citation(s) in RCA: 47] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2018] [Accepted: 12/15/2018] [Indexed: 11/18/2022] Open
Abstract
We develop analytical and simulation tools for evolve-and-resequencing experiments and apply them to a new study of rapid evolution in Drosophila simulans Likelihood test statistics applied to pooled population sequencing data suggest parallel evolution of 138 SNPs across the genome. This number is reduced by orders of magnitude from previous studies (thousands or tens of thousands), owing to differences in both experimental design and statistical analysis. Whole genome simulations calibrated from Drosophila genetic data sets indicate that major features of the genome-wide response could be explained by as few as 30 loci under strong directional selection with a corresponding hitchhiking effect. Smaller effect loci are likely also responding, but are below the detection limit of the experiment. Finally, SNPs showing strong parallel evolution in the experiment are intermediate in frequency in the natural population (usually 30-70%) indicative of balancing selection in nature. These loci also exhibit elevated differentiation among natural populations of D. simulans, suggesting environmental heterogeneity as a potential balancing mechanism.
Collapse
Affiliation(s)
- John K Kelly
- Department of Ecology and Evolutionary Biology, University of Kansas, Lawrence, Kansas 66045
| | - Kimberly A Hughes
- Department of Biological Science, Florida State University, Tallahassee, Florida 32306
| |
Collapse
|
205
|
Flagel L, Brandvain Y, Schrider DR. The Unreasonable Effectiveness of Convolutional Neural Networks in Population Genetic Inference. Mol Biol Evol 2019; 36:220-238. [PMID: 30517664 PMCID: PMC6367976 DOI: 10.1093/molbev/msy224] [Citation(s) in RCA: 98] [Impact Index Per Article: 19.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
Population-scale genomic data sets have given researchers incredible amounts of information from which to infer evolutionary histories. Concomitant with this flood of data, theoretical and methodological advances have sought to extract information from genomic sequences to infer demographic events such as population size changes and gene flow among closely related populations/species, construct recombination maps, and uncover loci underlying recent adaptation. To date, most methods make use of only one or a few summaries of the input sequences and therefore ignore potentially useful information encoded in the data. The most sophisticated of these approaches involve likelihood calculations, which require theoretical advances for each new problem, and often focus on a single aspect of the data (e.g., only allele frequency information) in the interest of mathematical and computational tractability. Directly interrogating the entirety of the input sequence data in a likelihood-free manner would thus offer a fruitful alternative. Here, we accomplish this by representing DNA sequence alignments as images and using a class of deep learning methods called convolutional neural networks (CNNs) to make population genetic inferences from these images. We apply CNNs to a number of evolutionary questions and find that they frequently match or exceed the accuracy of current methods. Importantly, we show that CNNs perform accurate evolutionary model selection and parameter estimation, even on problems that have not received detailed theoretical treatments. Thus, when applied to population genetic alignments, CNNs are capable of outperforming expert-derived statistical methods and offer a new path forward in cases where no likelihood approach exists.
Collapse
Affiliation(s)
- Lex Flagel
- Monsanto Company, Chesterfield, MO
- Department of Plant and Microbial Biology, University of Minnesota, St. Paul, MN
| | - Yaniv Brandvain
- Department of Plant and Microbial Biology, University of Minnesota, St. Paul, MN
| | - Daniel R Schrider
- Department of Genetics, University of North Carolina, Chapel Hill, NC
| |
Collapse
|
206
|
Vivian‐Griffiths T, Baker E, Schmidt KM, Bracher‐Smith M, Walters J, Artemiou A, Holmans P, O'Donovan MC, Owen MJ, Pocklington A, Escott‐Price V. Predictive modeling of schizophrenia from genomic data: Comparison of polygenic risk score with kernel support vector machines approach. Am J Med Genet B Neuropsychiatr Genet 2019; 180:80-85. [PMID: 30516002 PMCID: PMC6492016 DOI: 10.1002/ajmg.b.32705] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/20/2018] [Revised: 09/03/2018] [Accepted: 11/09/2018] [Indexed: 11/07/2022]
Abstract
A major controversy in psychiatric genetics is whether nonadditive genetic interaction effects contribute to the risk of highly polygenic disorders. We applied a support vector machines (SVMs) approach, which is capable of building linear and nonlinear models using kernel methods, to classify cases from controls in a large schizophrenia case-control sample of 11,853 subjects (5,554 cases and 6,299 controls) and compared its prediction accuracy with the polygenic risk score (PRS) approach. We also investigated whether SVMs are a suitable approach to detecting nonlinear genetic effects, that is, interactions. We found that PRS provided more accurate case/control classification than either linear or nonlinear SVMs, and give a tentative explanation why PRS outperforms both multivariate regression and linear kernel SVMs. In addition, we observe that nonlinear kernel SVMs showed higher classification accuracy than linear SVMs when a large number of SNPs are entered into the model. We conclude that SVMs are a potential tool for assessing the presence of interactions, prior to searching for them explicitly.
Collapse
Affiliation(s)
- Timothy Vivian‐Griffiths
- Medical Research Council Centre for Neuropsychiatric Genetics and Genomics, Division of Psychological Medicine and Clinical NeurosciencesCardiff UniversityCardiffUnited Kingdom
| | - Emily Baker
- Medical Research Council Centre for Neuropsychiatric Genetics and Genomics, Division of Psychological Medicine and Clinical NeurosciencesCardiff UniversityCardiffUnited Kingdom
| | - Karl M. Schmidt
- School of MathematicsCardiff UniversityCardiffUnited Kingdom
| | - Matthew Bracher‐Smith
- Medical Research Council Centre for Neuropsychiatric Genetics and Genomics, Division of Psychological Medicine and Clinical NeurosciencesCardiff UniversityCardiffUnited Kingdom
| | - James Walters
- Medical Research Council Centre for Neuropsychiatric Genetics and Genomics, Division of Psychological Medicine and Clinical NeurosciencesCardiff UniversityCardiffUnited Kingdom
| | | | - Peter Holmans
- Medical Research Council Centre for Neuropsychiatric Genetics and Genomics, Division of Psychological Medicine and Clinical NeurosciencesCardiff UniversityCardiffUnited Kingdom
| | - Michael C. O'Donovan
- Medical Research Council Centre for Neuropsychiatric Genetics and Genomics, Division of Psychological Medicine and Clinical NeurosciencesCardiff UniversityCardiffUnited Kingdom
| | - Michael J. Owen
- Medical Research Council Centre for Neuropsychiatric Genetics and Genomics, Division of Psychological Medicine and Clinical NeurosciencesCardiff UniversityCardiffUnited Kingdom
| | - Andrew Pocklington
- Medical Research Council Centre for Neuropsychiatric Genetics and Genomics, Division of Psychological Medicine and Clinical NeurosciencesCardiff UniversityCardiffUnited Kingdom
| | - Valentina Escott‐Price
- Medical Research Council Centre for Neuropsychiatric Genetics and Genomics, Division of Psychological Medicine and Clinical NeurosciencesCardiff UniversityCardiffUnited Kingdom
| |
Collapse
|
207
|
Villanea FA, Schraiber JG. Multiple episodes of interbreeding between Neanderthal and modern humans. Nat Ecol Evol 2019; 3:39-44. [PMID: 30478305 PMCID: PMC6309227 DOI: 10.1038/s41559-018-0735-8] [Citation(s) in RCA: 82] [Impact Index Per Article: 16.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2018] [Accepted: 10/18/2018] [Indexed: 11/30/2022]
Abstract
Neanderthals and anatomically modern humans overlapped geographically for a period of over 30,000 years following human migration out of Africa. During this period, Neanderthals and humans interbred, as evidenced by Neanderthal portions of the genome carried by non-African individuals today. A key observation is that the proportion of Neanderthal ancestry is ~12-20% higher in East Asian individuals relative to European individuals. Here, we explore various demographic models that could explain this observation. These include distinguishing between a single admixture event and multiple Neanderthal contributions to either population, and the hypothesis that reduced Neanderthal ancestry in modern Europeans resulted from more recent admixture with a ghost population that lacked a Neanderthal ancestry component (the 'dilution' hypothesis). To summarize the asymmetric pattern of Neanderthal allele frequencies, we compiled the joint fragment frequency spectrum of European and East Asian Neanderthal fragments and compared it with both analytical theory and data simulated under various models of admixture. Using maximum-likelihood and machine learning, we found that a simple model of a single admixture did not fit the empirical data, and instead favour a model of multiple episodes of gene flow into both European and East Asian populations. These findings indicate a longer-term, more complex interaction between humans and Neanderthals than was previously appreciated.
Collapse
Affiliation(s)
- Fernando A Villanea
- Department of Biology, Temple University, Philadelphia, PA, USA
- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA, USA
| | - Joshua G Schraiber
- Department of Biology, Temple University, Philadelphia, PA, USA.
- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA, USA.
| |
Collapse
|
208
|
Seijo LM, Peled N, Ajona D, Boeri M, Field JK, Sozzi G, Pio R, Zulueta JJ, Spira A, Massion PP, Mazzone PJ, Montuenga LM. Biomarkers in Lung Cancer Screening: Achievements, Promises, and Challenges. J Thorac Oncol 2018; 14:343-357. [PMID: 30529598 DOI: 10.1016/j.jtho.2018.11.023] [Citation(s) in RCA: 302] [Impact Index Per Article: 50.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2018] [Revised: 11/20/2018] [Accepted: 11/26/2018] [Indexed: 12/12/2022]
Abstract
The present review is an update of the research and development efforts regarding the use of molecular biomarkers in the lung cancer screening setting. The two main unmet clinical needs, namely, the refinement of risk to improve the selection of individuals undergoing screening and the characterization of undetermined nodules found during the computed tomography-based screening process are the object of the biomarkers described in the present review. We first propose some principles to optimize lung cancer biomarker discovery projects. Then, we summarize the discovery and developmental status of currently promising molecular candidates, such as autoantibodies, complement fragments, microRNAs, circulating tumor DNA, DNA methylation, blood protein profiling, or RNA airway or nasal signatures. We also mention other emerging biomarkers or new technologies to follow, such as exhaled breath biomarkers, metabolomics, sputum cell imaging, genetic predisposition studies, and the integration of next-generation sequencing into study of circulating DNA. We also underline the importance of integrating different molecular technologies together with imaging, radiomics, and artificial intelligence. We list a number of completed, ongoing, or planned trials to show the clinical utility of molecular biomarkers. Finally, we comment on future research challenges in the field of biomarkers in the context of lung cancer screening and propose a design of a trial to test the clinical utility of one or several candidate biomarkers.
Collapse
Affiliation(s)
- Luis M Seijo
- Clinica Universidad de Navarra, Madrid, Spain; CIBERES, Centro de Investigación Biomédica en Red de Enfermedades Respiratorias, Madrid, Spain
| | - Nir Peled
- Oncology Division, The Legacy Heritage Oncology Center and Dr. Larry Norton Institute, Soroka Medical Center and Ben-Gurion University, Beer-Sheva, Israel
| | - Daniel Ajona
- Solid Tumors Program, Centro de Investigación Médica Aplicada, Pamplona, Spain; Navarra Institute for Health Research, Pamplona, Spain; CIBERONC, Centro de Investigación Biomédica en Red de Cáncer, Madrid, Spain; Department of Biochemistry and Genetics, School of Sciences, University of Navarra, Pamplona, Spain
| | - Mattia Boeri
- Department of Experimental Oncology, Fondazione IRCCS Istituto Nazionale dei Tumori, Milan, Italy
| | - John K Field
- The Roy Castle Lung Cancer Research Programme, Department of Molecular and Clinical Cancer Medicine, University of Liverpool, Liverpool, United Kingdom
| | - Gabriella Sozzi
- Department of Experimental Oncology, Fondazione IRCCS Istituto Nazionale dei Tumori, Milan, Italy
| | - Ruben Pio
- Solid Tumors Program, Centro de Investigación Médica Aplicada, Pamplona, Spain; Navarra Institute for Health Research, Pamplona, Spain; CIBERONC, Centro de Investigación Biomédica en Red de Cáncer, Madrid, Spain; Department of Biochemistry and Genetics, School of Sciences, University of Navarra, Pamplona, Spain
| | - Javier J Zulueta
- Department of Pulmonology, Clinica Universidad de Navarra, Pamplona, Spain; Visiongate Inc., Phoenix, Arizona
| | - Avrum Spira
- Boston University School of Medicine, Boston, Massachusetts
| | | | | | - Luis M Montuenga
- Solid Tumors Program, Centro de Investigación Médica Aplicada, Pamplona, Spain; Navarra Institute for Health Research, Pamplona, Spain; CIBERONC, Centro de Investigación Biomédica en Red de Cáncer, Madrid, Spain; Department of Pathology, Anatomy and Physiology, School of Medicine, University of Navarra, Pamplona, Spain.
| |
Collapse
|
209
|
Blischak PD, Mabry ME, Conant GC, Pires JC. Integrating Networks, Phylogenomics, and Population Genomics for the Study of Polyploidy. ANNUAL REVIEW OF ECOLOGY EVOLUTION AND SYSTEMATICS 2018. [DOI: 10.1146/annurev-ecolsys-121415-032302] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Duplication events are regarded as sources of evolutionary novelty, but our understanding of general trends for the long-term trajectory of additional genomic material is still lacking. Organisms with a history of whole genome duplication (WGD) offer a unique opportunity to study potential trends in the context of gene retention and/or loss, gene and network dosage, and changes in gene expression. In this review, we discuss the prevalence of polyploidy across the tree of life, followed by an overview of studies investigating genome evolution and gene expression. We then provide an overview of methods in network biology, phylogenomics, and population genomics that are critical for advancing our understanding of evolution post-WGD, highlighting the need for models that can accommodate polyploids. Finally, we close with a brief note on the importance of random processes in the evolution of polyploids with respect to neutral versus selective forces, ancestral polymorphisms, and the formation of autopolyploids versus allopolyploids.
Collapse
Affiliation(s)
- Paul D. Blischak
- Department of Evolution, Ecology, and Organismal Biology, The Ohio State University, Columbus, Ohio 43210, USA
| | - Makenzie E. Mabry
- Division of Biological Sciences and Bond Life Sciences Center, University of Missouri, Columbia, Missouri 65211, USA
| | - Gavin C. Conant
- Division of Animal Sciences, University of Missouri, Columbia, Missouri 65211, USA
- Current affiliation: Bioinformatics Research Center, Program in Genetics and Department of Biological Sciences, North Carolina State University, Raleigh, North Carolina 27695, USA
| | - J. Chris Pires
- Division of Biological Sciences and Bond Life Sciences Center, University of Missouri, Columbia, Missouri 65211-7310, USA
| |
Collapse
|
210
|
Stetter MG, Thornton K, Ross-Ibarra J. Genetic architecture and selective sweeps after polygenic adaptation to distant trait optima. PLoS Genet 2018; 14:e1007794. [PMID: 30452452 PMCID: PMC6277123 DOI: 10.1371/journal.pgen.1007794] [Citation(s) in RCA: 36] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2018] [Revised: 12/03/2018] [Accepted: 10/26/2018] [Indexed: 11/22/2022] Open
Abstract
Understanding the genetic basis of phenotypic adaptation to changing environments is an essential goal of population and quantitative genetics. While technological advances now allow interrogation of genome-wide genotyping data in large panels, our understanding of the process of polygenic adaptation is still limited. To address this limitation, we use extensive forward-time simulation to explore the impacts of variation in demography, trait genetics, and selection on the rate and mode of adaptation and the resulting genetic architecture. We simulate a population adapting to an optimum shift, modeling sequence variation for 20 QTL for each of 12 different demographies for 100 different traits varying in the effect size distribution of new mutations, the strength of stabilizing selection, and the contribution of the genomic background. We then use random forest regression approaches to learn the relative importance of input parameters in determining a number of aspects of the process of adaptation, including the speed of adaptation, the relative frequency of hard sweeps and sweeps from standing variation, or the final genetic architecture of the trait. We find that selective sweeps occur even for traits under relatively weak selection and where the genetic background explains most of the variation. Though most sweeps occur from variation segregating in the ancestral population, new mutations can be important for traits under strong stabilizing selection that undergo a large optimum shift. We also show that population bottlenecks and expansion impact overall genetic variation as well as the relative importance of sweeps from standing variation and the speed with which adaptation can occur. We then compare our results to two traits under selection during maize domestication, showing that our simulations qualitatively recapitulate differences between them. Overall, our results underscore the complex population genetics of individual loci in even relatively simple quantitative trait models, but provide a glimpse into the factors that drive this complexity and the potential of these approaches for understanding polygenic adaptation.
Collapse
Affiliation(s)
- Markus G. Stetter
- Dept. of Plant Sciences and Center for Population Biology, University of California, Davis, Davis, CA, USA
| | - Kevin Thornton
- Dept. of Ecology and Evolutionary Biology, University of California, Irvine, Irvine, CA, USA
| | - Jeffrey Ross-Ibarra
- Dept. of Plant Sciences and Center for Population Biology, University of California, Davis, Davis, CA, USA
- Genome Center, University of California, Davis, Davis, CA, USA
| |
Collapse
|
211
|
Kim BY, Wei X, Fitz-Gibbon S, Lohmueller KE, Ortego J, Gugger PF, Sork VL. RADseq data reveal ancient, but not pervasive, introgression between Californian tree and scrub oak species (Quercussect.Quercus: Fagaceae). Mol Ecol 2018; 27:4556-4571. [DOI: 10.1111/mec.14869] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2017] [Revised: 07/25/2018] [Accepted: 08/29/2018] [Indexed: 12/24/2022]
Affiliation(s)
- Bernard Y. Kim
- Department of Ecology and Evolutionary Biology; University of California at Los Angeles; Los Angeles California
| | - Xinzeng Wei
- Department of Ecology and Evolutionary Biology; University of California at Los Angeles; Los Angeles California
- Key Laboratory of Aquatic Botany and Watershed Ecology; Wuhan Botanical Garden; Chinese Academy of Sciences; Wuhan Hubei China
| | - Sorel Fitz-Gibbon
- Department of Ecology and Evolutionary Biology; University of California at Los Angeles; Los Angeles California
| | - Kirk E. Lohmueller
- Department of Ecology and Evolutionary Biology; University of California at Los Angeles; Los Angeles California
- Department of Human Genetics; David Geffen School of Medicine; University of California; Los Angeles California
| | - Joaquín Ortego
- Department of Integrative Ecology; Estación Biológica de Doñana, EBD-CSIC; Seville Spain
| | - Paul F. Gugger
- Department of Ecology and Evolutionary Biology; University of California at Los Angeles; Los Angeles California
- Appalachian Laboratory; University of Maryland Center for Environmental Science; Frostburg Maryland
| | - Victoria L. Sork
- Department of Ecology and Evolutionary Biology; University of California at Los Angeles; Los Angeles California
- Institute of the Environment and Sustainability; University of California; Los Angeles California
| |
Collapse
|
212
|
Wenric S, Shemirani R. Using Supervised Learning Methods for Gene Selection in RNA-Seq Case-Control Studies. Front Genet 2018; 9:297. [PMID: 30123241 PMCID: PMC6085558 DOI: 10.3389/fgene.2018.00297] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2018] [Accepted: 07/16/2018] [Indexed: 12/17/2022] Open
Abstract
Whole transcriptome studies typically yield large amounts of data, with expression values for all genes or transcripts of the genome. The search for genes of interest in a particular study setting can thus be a daunting task, usually relying on automated computational methods. Moreover, most biological questions imply that such a search should be performed in a multivariate setting, to take into account the inter-genes relationships. Differential expression analysis commonly yields large lists of genes deemed significant, even after adjustment for multiple testing, making the subsequent study possibilities extensive. Here, we explore the use of supervised learning methods to rank large ensembles of genes defined by their expression values measured with RNA-Seq in a typical 2 classes sample set. First, we use one of the variable importance measures generated by the random forests classification algorithm as a metric to rank genes. Second, we define the EPS (extreme pseudo-samples) pipeline, making use of VAEs (Variational Autoencoders) and regressors to extract a ranking of genes while leveraging the feature space of both virtual and comparable samples. We show that, on 12 cancer RNA-Seq data sets ranging from 323 to 1,210 samples, using either a random forests-based gene selection method or the EPS pipeline outperforms differential expression analysis for 9 and 8 out of the 12 datasets respectively, in terms of identifying subsets of genes associated with survival. These results demonstrate the potential of supervised learning-based gene selection methods in RNA-Seq studies and highlight the need to use such multivariate gene selection methods alongside the widely used differential expression analysis.
Collapse
Affiliation(s)
- Stephane Wenric
- Laboratory of Human Genetics, GIGA-Research, University of Liège, Liège, Belgium.,Department of Genetics and Genomic Sciences, The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai Hospital, New York, NY, United States
| | - Ruhollah Shemirani
- Department of Computer Science, Information Sciences Institute, University of Southern California, Marina del Rey, CA, United States
| |
Collapse
|
213
|
Kern AD, Schrider DR. diploS/HIC: An Updated Approach to Classifying Selective Sweeps. G3 (BETHESDA, MD.) 2018; 8:1959-1970. [PMID: 29626082 PMCID: PMC5982824 DOI: 10.1534/g3.118.200262] [Citation(s) in RCA: 71] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/26/2018] [Accepted: 04/04/2018] [Indexed: 11/18/2022]
Abstract
Identifying selective sweeps in populations that have complex demographic histories remains a difficult problem in population genetics. We previously introduced a supervised machine learning approach, S/HIC, for finding both hard and soft selective sweeps in genomes on the basis of patterns of genetic variation surrounding a window of the genome. While S/HIC was shown to be both powerful and precise, the utility of S/HIC was limited by the use of phased genomic data as input. In this report we describe a deep learning variant of our method, diploS/HIC, that uses unphased genotypes to accurately classify genomic windows. diploS/HIC is shown to be quite powerful even at moderate to small sample sizes.
Collapse
Affiliation(s)
- Andrew D Kern
- Department of Genetics, Rutgers University, Piscataway, NJ 08854
| | | |
Collapse
|
214
|
Schrider DR, Ayroles J, Matute DR, Kern AD. Supervised machine learning reveals introgressed loci in the genomes of Drosophila simulans and D. sechellia. PLoS Genet 2018; 14:e1007341. [PMID: 29684059 PMCID: PMC5933812 DOI: 10.1371/journal.pgen.1007341] [Citation(s) in RCA: 69] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2017] [Revised: 05/03/2018] [Accepted: 03/28/2018] [Indexed: 12/30/2022] Open
Abstract
Hybridization and gene flow between species appears to be common. Even though it is clear that hybridization is widespread across all surveyed taxonomic groups, the magnitude and consequences of introgression are still largely unknown. Thus it is crucial to develop the statistical machinery required to uncover which genomic regions have recently acquired haplotypes via introgression from a sister population. We developed a novel machine learning framework, called FILET (Finding Introgressed Loci via Extra-Trees) capable of revealing genomic introgression with far greater power than competing methods. FILET works by combining information from a number of population genetic summary statistics, including several new statistics that we introduce, that capture patterns of variation across two populations. We show that FILET is able to identify loci that have experienced gene flow between related species with high accuracy, and in most situations can correctly infer which population was the donor and which was the recipient. Here we describe a data set of outbred diploid Drosophila sechellia genomes, and combine them with data from D. simulans to examine recent introgression between these species using FILET. Although we find that these populations may have split more recently than previously appreciated, FILET confirms that there has indeed been appreciable recent introgression (some of which might have been adaptive) between these species, and reveals that this gene flow is primarily in the direction of D. simulans to D. sechellia. Understanding the extent to which species or diverged populations hybridize in nature is crucially important if we are to understand the speciation process. Accordingly numerous research groups have developed methodology for finding the genetic evidence of such introgression. In this report we develop a supervised machine learning approach for uncovering loci which have introgressed across species boundaries. We show that our method, FILET, has greater accuracy and power than competing methods in discovering introgression, and in addition can detect the directionality associated with the gene flow between species. Using whole genome sequences from Drosophila simulans and Drosophila sechellia we show that FILET discovers quite extensive introgression between these species that has occurred mostly from D. simulans to D. sechellia. Our work highlights the complex process of speciation even within a well-studied system and points to the growing importance of supervised machine learning in population genetics.
Collapse
Affiliation(s)
- Daniel R. Schrider
- Department of Genetics, Rutgers University, Piscataway, New Jersey, United States of America
- Human Genetics Institute of New Jersey, Rutgers University, Piscataway, New Jersey, United States of America
- * E-mail:
| | - Julien Ayroles
- Ecology and Evolutionary Biology Department, Princeton University, Princeton, New Jersey, United States of America
- Lewis Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
| | - Daniel R. Matute
- Biology Department, University of North Carolina, Chapel Hill, North Carolina, United States of America
| | - Andrew D. Kern
- Department of Genetics, Rutgers University, Piscataway, New Jersey, United States of America
- Human Genetics Institute of New Jersey, Rutgers University, Piscataway, New Jersey, United States of America
| |
Collapse
|