1
|
Lim D, Baek C, Blanchette M. Graphylo: A deep learning approach for predicting regulatory DNA and RNA sites from whole-genome multiple alignments. iScience 2024; 27:109002. [PMID: 38362268 PMCID: PMC10867641 DOI: 10.1016/j.isci.2024.109002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2023] [Revised: 12/17/2023] [Accepted: 01/19/2024] [Indexed: 02/17/2024] Open
Abstract
This study focuses on enhancing the prediction of regulatory functional sites in DNA and RNA sequences, a crucial aspect of gene regulation. Current methods, such as motif overrepresentation and machine learning, often lack specificity. To address this issue, the study leverages evolutionary information and introduces Graphylo, a deep-learning approach for predicting transcription factor binding sites in the human genome. Graphylo combines Convolutional Neural Networks for DNA sequences with Graph Convolutional Networks on phylogenetic trees, using information from placental mammals' genomes and evolutionary history. The research demonstrates that Graphylo consistently outperforms both single-species deep learning techniques and methods that incorporate inter-species conservation scores on a wide range of datasets. It achieves this by utilizing a species-based attention model for evolutionary insights and an integrated gradient approach for nucleotide-level model interpretability. This innovative approach offers a promising avenue for improving the accuracy of regulatory site prediction in genomics.
Collapse
|
2
|
Reconstruction of hundreds of reference ancestral genomes across the eukaryotic kingdom. Nat Ecol Evol 2023; 7:355-366. [PMID: 36646945 PMCID: PMC9998269 DOI: 10.1038/s41559-022-01956-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2022] [Accepted: 11/22/2022] [Indexed: 01/18/2023]
Abstract
Ancestral sequence reconstruction is a fundamental aspect of molecular evolution studies and can trace small-scale sequence modifications through the evolution of genomes and species. In contrast, fine-grained reconstructions of ancestral genome organizations are still in their infancy, limiting our ability to draw comprehensive views of genome and karyotype evolution. Here we reconstruct the detailed gene contents and organizations of 624 ancestral vertebrate, plant, fungi, metazoan and protist genomes, 183 of which are near-complete chromosomal gene order reconstructions. Reconstructed ancestral genomes are similar to their descendants in terms of gene content as expected and agree precisely with reference cytogenetic and in silico reconstructions when available. By comparing successive ancestral genomes along the phylogenetic tree, we estimate the intra- and interchromosomal rearrangement history of all major vertebrate clades at high resolution. This freely available resource introduces the possibility to follow evolutionary processes at genomic scales in chronological order, across multiple clades and without relying on a single extant species as reference.
Collapse
|
3
|
Gupta MK, Vadde R. Next-generation development and application of codon model in evolution. Front Genet 2023; 14:1091575. [PMID: 36777719 PMCID: PMC9911445 DOI: 10.3389/fgene.2023.1091575] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Accepted: 01/17/2023] [Indexed: 01/28/2023] Open
Abstract
To date, numerous nucleotide, amino acid, and codon substitution models have been developed to estimate the evolutionary history of any sequence/organism in a more comprehensive way. Out of these three, the codon substitution model is the most powerful. These models have been utilized extensively to detect selective pressure on a protein, codon usage bias, ancestral reconstruction and phylogenetic reconstruction. However, due to more computational demanding, in comparison to nucleotide and amino acid substitution models, only a few studies have employed the codon substitution model to understand the heterogeneity of the evolutionary process in a genome-scale analysis. Hence, there is always a question of how to develop more robust but less computationally demanding codon substitution models to get more accurate results. In this review article, the authors attempted to understand the basis of the development of different types of codon-substitution models and how this information can be utilized to develop more robust but less computationally demanding codon substitution models. The codon substitution model enables to detect selection regime under which any gene or gene region is evolving, codon usage bias in any organism or tissue-specific region and phylogenetic relationship between different lineages more accurately than nucleotide and amino acid substitution models. Thus, in the near future, these codon models can be utilized in the field of conservation, breeding and medicine.
Collapse
|
4
|
Campitelli LF, Yellan I, Albu M, Barazandeh M, Patel ZM, Blanchette M, Hughes TR. Reconstruction of full-length LINE-1 progenitors from ancestral genomes. Genetics 2022; 221:6584822. [PMID: 35552404 PMCID: PMC9252281 DOI: 10.1093/genetics/iyac074] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2022] [Accepted: 04/27/2022] [Indexed: 11/24/2022] Open
Abstract
Sequences derived from the Long INterspersed Element-1 (L1) family of retrotransposons occupy at least 17% of the human genome, with 67 distinct subfamilies representing successive waves of expansion and extinction in mammalian lineages. L1s contribute extensively to gene regulation, but their molecular history is difficult to trace, because most are present only as truncated and highly mutated fossils. Consequently, L1 entries in current databases of repeat sequences are composed mainly of short diagnostic subsequences, rather than full functional progenitor sequences for each subfamily. Here, we have coupled 2 levels of sequence reconstruction (at the level of whole genomes and L1 subfamilies) to reconstruct progenitor sequences for all human L1 subfamilies that are more functionally and phylogenetically plausible than existing models. Most of the reconstructed sequences are at or near the canonical length of L1s and encode uninterrupted ORFs with expected protein domains. We also show that the presence or absence of binding sites for KRAB-C2H2 Zinc Finger Proteins, even in ancient-reconstructed progenitor L1s, mirrors binding observed in human ChIP-exo experiments, thus extending the arms race and domestication model. RepeatMasker searches of the modern human genome suggest that the new models may be able to assign subfamily resolution identities to previously ambiguous L1 instances. The reconstructed L1 sequences will be useful for genome annotation and functional study of both L1 evolution and L1 contributions to host regulatory networks.
Collapse
Affiliation(s)
- Laura F Campitelli
- Department of Molecular Genetics, University of Toronto, Toronto, ON M5S 1A1, Canada.,Donnelly Centre, University of Toronto, Toronto, ON M5S 1A1, Canada
| | - Isaac Yellan
- Department of Molecular Genetics, University of Toronto, Toronto, ON M5S 1A1, Canada.,Donnelly Centre, University of Toronto, Toronto, ON M5S 1A1, Canada
| | - Mihai Albu
- Donnelly Centre, University of Toronto, Toronto, ON M5S 1A1, Canada
| | - Marjan Barazandeh
- Donnelly Centre, University of Toronto, Toronto, ON M5S 1A1, Canada.,Faculty of Pharmaceutical Sciences, University of British Columbia, Vancouver, BC V6T 1Z4, Canada
| | - Zain M Patel
- Department of Molecular Genetics, University of Toronto, Toronto, ON M5S 1A1, Canada.,Donnelly Centre, University of Toronto, Toronto, ON M5S 1A1, Canada
| | - Mathieu Blanchette
- Faculty of Pharmaceutical Sciences, University of British Columbia, Vancouver, BC V6T 1Z4, Canada.,Department of Computer Science, McGill University, Montreal, Quebec H3A 0G4, Canada
| | - Timothy R Hughes
- Department of Molecular Genetics, University of Toronto, Toronto, ON M5S 1A1, Canada.,Donnelly Centre, University of Toronto, Toronto, ON M5S 1A1, Canada
| |
Collapse
|
5
|
Heydeck D, Reisch F, Schäfer M, Kakularam KR, Roigas SA, Stehling S, Püschel GP, Kuhn H. The Reaction Specificity of Mammalian ALOX15 Orthologs is Changed During Late Primate Evolution and These Alterations Might Offer Evolutionary Advantages for Hominidae. Front Cell Dev Biol 2022; 10:871585. [PMID: 35531094 PMCID: PMC9068934 DOI: 10.3389/fcell.2022.871585] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Accepted: 04/01/2022] [Indexed: 01/03/2023] Open
Abstract
Arachidonic acid lipoxygenases (ALOXs) have been implicated in the immune response of mammals. The reaction specificity of these enzymes is decisive for their biological functions and ALOX classification is based on this enzyme property. Comparing the amino acid sequences and the functional properties of selected mammalian ALOX15 orthologs we previously hypothesized that the reaction specificity of these enzymes can be predicted based on their amino acid sequences (Triad Concept) and that mammals, which are ranked in evolution below gibbons, express arachidonic acid 12-lipoxygenating ALOX15 orthologs. In contrast, Hominidae involving the great apes and humans possess 15-lipoxygenating enzymes (Evolutionary Hypothesis). These two hypotheses were based on sequence data of some 60 mammalian ALOX15 orthologs and about half of them were functionally characterized. Here, we compared the ALOX15 sequences of 152 mammals representing all major mammalian subclades expressed 44 novel ALOX15 orthologs and performed extensive mutagenesis studies of their triad determinants. We found that ALOX15 genes are absent in extant Prototheria but that corresponding enzymes frequently occur in Metatheria and Eutheria. More than 90% of them catalyze arachidonic acid 12-lipoxygenation and the Triad Concept is applicable to all of them. Mammals ranked in evolution above gibbons express arachidonic acid 15-lipoxygenating ALOX15 orthologs but enzymes with similar specificity are only present in less than 5% of mammals ranked below gibbons. This data suggests that ALOX15 orthologs have been introduced during Prototheria-Metatheria transition and put the Triad Concept and the Evolutionary Hypothesis on a much broader and more reliable experimental basis.
Collapse
Affiliation(s)
- Dagmar Heydeck
- Department of Biochemistry, Charité—Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt Universität zu Berlin, Berlin, Germany
- *Correspondence: Dagmar Heydeck,
| | - Florian Reisch
- Department of Biochemistry, Charité—Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt Universität zu Berlin, Berlin, Germany
- Institute for Nutritional Sciences, University Potsdam, Potsdam, Germany
| | - Marjann Schäfer
- Department of Biochemistry, Charité—Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt Universität zu Berlin, Berlin, Germany
- Institute for Nutritional Sciences, University Potsdam, Potsdam, Germany
| | - Kumar R. Kakularam
- Department of Biochemistry, Charité—Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt Universität zu Berlin, Berlin, Germany
| | - Sophie A. Roigas
- Department of Biochemistry, Charité—Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt Universität zu Berlin, Berlin, Germany
| | - Sabine Stehling
- Department of Biochemistry, Charité—Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt Universität zu Berlin, Berlin, Germany
| | - Gerhard P. Püschel
- Institute for Nutritional Sciences, University Potsdam, Potsdam, Germany
| | - Hartmut Kuhn
- Department of Biochemistry, Charité—Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt Universität zu Berlin, Berlin, Germany
| |
Collapse
|
6
|
Lichman BR. Ancestral Sequence Reconstruction for Exploring Alkaloid Evolution. Methods Mol Biol 2022; 2505:165-179. [PMID: 35732944 DOI: 10.1007/978-1-0716-2349-7_12] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
The complex and bioactive monoterpene indole alkaloids (MIAs) found in Catharanthus roseus and related species are the products of many millions of years of evolution through mutation and natural selection. Ancestral sequence reconstruction (ASR) is a method that combines phylogenetic analysis and experimental biochemistry to infer details about past events in protein evolution. Here, I propose that ASR could be leveraged to understand how enzymes catalyzing the formation of complex alkaloids arose over evolutionary time. I discuss the steps of ASR, including sequence selection, multiple sequence alignment, tree inference, and the generation and characterization of inferred ancestral enzymes.
Collapse
Affiliation(s)
- Benjamin R Lichman
- Centre for Novel Agricultural Products, Department of Biology, University of York, York, UK.
| |
Collapse
|
7
|
Schull JK, Turakhia Y, Hemker JA, Dally WJ, Bejerano G. OUP accepted manuscript. Genome Biol Evol 2022; 14:6529394. [PMID: 35171243 PMCID: PMC8920512 DOI: 10.1093/gbe/evac013] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/10/2022] [Indexed: 11/14/2022] Open
Abstract
We present Champagne, a whole-genome method for generating character matrices for phylogenomic analysis using large genomic indel events. By rigorously picking orthologous genes and locating large insertion and deletion events, Champagne delivers a character matrix that considerably reduces homoplasy compared with morphological and nucleotide-based matrices, on both established phylogenies and difficult-to-resolve nodes in the mammalian tree. Champagne provides ample evidence in the form of genomic structural variation to support incomplete lineage sorting and possible introgression in Paenungulata and human–chimp–gorilla which were previously inferred primarily through matrices composed of aligned single-nucleotide characters. Champagne also offers further evidence for Myomorpha as sister to Sciuridae and Hystricomorpha in the rodent tree. Champagne harbors distinct theoretical advantages as an automated method that produces nearly homoplasy-free character matrices on the whole-genome scale.
Collapse
Affiliation(s)
- James K Schull
- Department of Computer Science, Stanford University, USA
| | - Yatish Turakhia
- Department of Electrical and Computer Engineering, University of California San Diego, USA
| | - James A Hemker
- Department of Computer Science, Stanford University, USA
| | - William J Dally
- Department of Computer Science, Stanford University, USA
- NVIDIA, Santa Clara, California, USA
- Department of Electrical Engineering, Stanford University, USA
| | - Gill Bejerano
- Department of Computer Science, Stanford University, USA
- Department of Developmental Biology, Stanford University, USA
- Department of Biomedical Data Science, Stanford University, USA
- Department of Pediatrics, Stanford University, USA
- Corresponding author: E-mail:
| |
Collapse
|
8
|
Lim D, Blanchette M. EvoLSTM: context-dependent models of sequence evolution using a sequence-to-sequence LSTM. Bioinformatics 2021; 36:i353-i361. [PMID: 32657367 PMCID: PMC7355264 DOI: 10.1093/bioinformatics/btaa447] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Motivation Accurate probabilistic models of sequence evolution are essential for a wide variety of bioinformatics tasks, including sequence alignment and phylogenetic inference. The ability to realistically simulate sequence evolution is also at the core of many benchmarking strategies. Yet, mutational processes have complex context dependencies that remain poorly modeled and understood. Results We introduce EvoLSTM, a recurrent neural network-based evolution simulator that captures mutational context dependencies. EvoLSTM uses a sequence-to-sequence long short-term memory model trained to predict mutation probabilities at each position of a given sequence, taking into consideration the 14 flanking nucleotides. EvoLSTM can realistically simulate mammalian and plant DNA sequence evolution and reveals unexpectedly strong long-range context dependencies in mutation probabilities. EvoLSTM brings modern machine-learning approaches to bear on sequence evolution. It will serve as a useful tool to study and simulate complex mutational processes. Availability and implementation Code and dataset are available at https://github.com/DongjoonLim/EvoLSTM. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dongjoon Lim
- School of Computer Science, McGill University, Montreal, Quebec H3A 0G4, Canada
| | - Mathieu Blanchette
- School of Computer Science, McGill University, Montreal, Quebec H3A 0G4, Canada
| |
Collapse
|
9
|
Buckley RM, Kortschak RD, Adelson DL. Divergent genome evolution caused by regional variation in DNA gain and loss between human and mouse. PLoS Comput Biol 2018; 14:e1006091. [PMID: 29677183 PMCID: PMC5931693 DOI: 10.1371/journal.pcbi.1006091] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2017] [Revised: 05/02/2018] [Accepted: 03/15/2018] [Indexed: 12/31/2022] Open
Abstract
The forces driving the accumulation and removal of non-coding DNA and ultimately the evolution of genome size in complex organisms are intimately linked to genome structure and organisation. Our analysis provides a novel method for capturing the regional variation of lineage-specific DNA gain and loss events in their respective genomic contexts. To further understand this connection we used comparative genomics to identify genome-wide individual DNA gain and loss events in the human and mouse genomes. Focusing on the distribution of DNA gains and losses, relationships to important structural features and potential impact on biological processes, we found that in autosomes, DNA gains and losses both followed separate lineage-specific accumulation patterns. However, in both species chromosome X was particularly enriched for DNA gain, consistent with its high L1 retrotransposon content required for X inactivation. We found that DNA loss was associated with gene-rich open chromatin regions and DNA gain events with gene-poor closed chromatin regions. Additionally, we found that DNA loss events tended to be smaller than DNA gain events suggesting that they were able to accumulate in gene-rich open chromatin regions due to their reduced capacity to interrupt gene regulatory architecture. GO term enrichment showed that mouse loss hotspots were strongly enriched for terms related to developmental processes. However, these genes were also located in regions with a high density of conserved elements, suggesting that despite high levels of DNA loss, gene regulatory architecture remained conserved. This is consistent with a model in which DNA gain and loss results in turnover or "churning" in regulatory element dense regions of open chromatin, where interruption of regulatory elements is selected against.
Collapse
Affiliation(s)
- Reuben M. Buckley
- Department of Genetics and Evolution, The University of Adelaide, North Tce, Adelaide, Australia
| | - R. Daniel Kortschak
- Department of Genetics and Evolution, The University of Adelaide, North Tce, Adelaide, Australia
| | - David L. Adelson
- Department of Genetics and Evolution, The University of Adelaide, North Tce, Adelaide, Australia
- * E-mail:
| |
Collapse
|
10
|
Sharma V, Hiller M. Increased alignment sensitivity improves the usage of genome alignments for comparative gene annotation. Nucleic Acids Res 2017. [PMID: 28645144 PMCID: PMC5737078 DOI: 10.1093/nar/gkx554] [Citation(s) in RCA: 35] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
Genome alignments provide a powerful basis to transfer gene annotations from a well-annotated reference genome to many other aligned genomes. The completeness of these annotations crucially depends on the sensitivity of the underlying genome alignment. Here, we investigated the impact of the genome alignment parameters and found that parameters with a higher sensitivity allow the detection of thousands of novel alignments between orthologous exons that have been missed before. In particular, comparisons between species separated by an evolutionary distance of >0.75 substitutions per neutral site, like human and other non-placental vertebrates, benefit from increased sensitivity. To systematically test if increased sensitivity improves comparative gene annotations, we built a multiple alignment of 144 vertebrate genomes and used this alignment to map human genes to the other 143 vertebrates with CESAR. We found that higher alignment sensitivity substantially improves the completeness of comparative gene annotations by adding on average 2382 and 7440 novel exons and 117 and 317 novel genes for mammalian and non-mammalian species, respectively. Our results suggest a more sensitive alignment strategy that should generally be used for genome alignments between distantly-related species. Our 144-vertebrate genome alignment and the comparative gene annotations (https://bds.mpi-cbg.de/hillerlab/144VertebrateAlignment_CESAR/) are a valuable resource for comparative genomics.
Collapse
Affiliation(s)
- Virag Sharma
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany.,Max Planck Institute for the Physics of Complex Systems, Dresden, Germany
| | - Michael Hiller
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany.,Max Planck Institute for the Physics of Complex Systems, Dresden, Germany
| |
Collapse
|
11
|
Feng B, Zhou L, Tang J. Ancestral Genome Reconstruction on Whole Genome Level. Curr Genomics 2017; 18:306-315. [PMID: 29081686 PMCID: PMC5635614 DOI: 10.2174/1389202918666170307120943] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2016] [Revised: 10/08/2016] [Accepted: 11/03/2016] [Indexed: 11/22/2022] Open
Abstract
Comparative genomics, evolutionary biology, and cancer researches require tools to elucidate the evolutionary trajectories and reconstruct the ancestral genomes. Various methods have been developed to infer the genome content and gene ordering of ancestral genomes by using such genomic structural variants. There are mainly two kinds of computational approaches in the ancestral genome reconstruction study. Distance/event-based approaches employ genome evolutionary models and reconstruct the ancestral genomes that minimize the total distance or events over the edges of the given phylogeny. The homology/adjacency-based approaches search for the conserved gene adjacencies and genome structures, and assemble these regions into ancestral genomes along the internal node of the given phylogeny. We review the principles and algorithms of these approaches that can reconstruct the ancestral genomes on the whole genome level. We talk about their advantages and limitations of these approaches in dealing with various genome datasets, evolutionary events, and reconstruction problems. We also talk about the improvements and developments of these approaches in the subsequent researches. We select four most famous and powerful approaches from both distance/event-based and homology/adjacency-based categories to analyze and compare their performances in dealing with different kinds of datasets and evolutionary events. Based on our experiment, GASTS has the best performance in solving the problems with equal genome contents that only have genome rearrangement events. PMAG++ achieves the best performance in solving the problems with unequal genome contents that have all possible complicated evolutionary events.
Collapse
Affiliation(s)
- Bing Feng
- School of Computer Science and Technology, Tianjin University, Tianjin300350, China
- Department of Computer Science and Engineering, University of South Carolina, Columbia, SC29208, USA
| | - Lingxi Zhou
- Department of Computer Science and Engineering, University of South Carolina, Columbia, SC29208, USA
| | - Jijun Tang
- Department of Computer Science and Engineering, University of South Carolina, Columbia, SC29208, USA
| |
Collapse
|
12
|
Abstract
BACKGROUND Despite the long-anticipated possibility of putting sequence alignment on the same footing as statistical phylogenetics, theorists have struggled to develop time-dependent evolutionary models for indels that are as tractable as the analogous models for substitution events. MAIN TEXT This paper discusses progress in the area of insertion-deletion models, in view of recent work by Ezawa (BMC Bioinformatics 17:304, 2016); (BMC Bioinformatics 17:397, 2016); (BMC Bioinformatics 17:457, 2016) on the calculation of time-dependent gap length distributions in pairwise alignments, and current approaches for extending these approaches from ancestor-descendant pairs to phylogenetic trees. CONCLUSIONS While approximations that use finite-state machines (Pair HMMs and transducers) currently represent the most practical approach to problems such as sequence alignment and phylogeny, more rigorous approaches that work directly with the matrix exponential of the underlying continuous-time Markov chain also show promise, especially in view of recent advances.
Collapse
Affiliation(s)
- Ian H. Holmes
- 0000 0001 2181 7878grid.47840.3fDept of Bioengineering, University of California, Berkeley, 94720 USA
| |
Collapse
|
13
|
Hague MT, Feldman CR, Brodie ED, Brodie ED. Convergent adaptation to dangerous prey proceeds through the same first‐step mutation in the garter snake
Thamnophis sirtalis. Evolution 2017; 71:1504-1518. [DOI: 10.1111/evo.13244] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2017] [Accepted: 03/24/2017] [Indexed: 12/28/2022]
Affiliation(s)
- Michael T.J. Hague
- Department of Biology University of Virginia Charlottesville Virginia 22904
| | | | | | - Edmund D. Brodie
- Department of Biology University of Virginia Charlottesville Virginia 22904
| |
Collapse
|
14
|
Sundaram V, Choudhary MNK, Pehrsson E, Xing X, Fiore C, Pandey M, Maricque B, Udawatta M, Ngo D, Chen Y, Paguntalan A, Ray T, Hughes A, Cohen BA, Wang T. Functional cis-regulatory modules encoded by mouse-specific endogenous retrovirus. Nat Commun 2017; 8:14550. [PMID: 28348391 PMCID: PMC5379053 DOI: 10.1038/ncomms14550] [Citation(s) in RCA: 44] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2016] [Accepted: 01/11/2017] [Indexed: 01/30/2023] Open
Abstract
Cis-regulatory modules contain multiple transcription factor (TF)-binding sites and integrate the effects of each TF to control gene expression in specific cellular contexts. Transposable elements (TEs) are uniquely equipped to deposit their regulatory sequences across a genome, which could also contain cis-regulatory modules that coordinate the control of multiple genes with the same regulatory logic. We provide the first evidence of mouse-specific TEs that encode a module of TF-binding sites in mouse embryonic stem cells (ESCs). The majority (77%) of the individual TEs tested exhibited enhancer activity in mouse ESCs. By mutating individual TF-binding sites within the TE, we identified a module of TF-binding motifs that cooperatively enhanced gene expression. Interestingly, we also observed the same motif module in the in silico constructed ancestral TE that also acted cooperatively to enhance gene expression. Our results suggest that ancestral TE insertions might have brought in cis-regulatory modules into the mouse genome.
Collapse
Affiliation(s)
- Vasavi Sundaram
- Division of Biological and Biomedical Sciences, Washington University School of Medicine, 660 S. Euclid Avenue, St. Louis, Missouri 63110, USA
- Department of Genetics, Center for Genome Sciences and Systems Biology, Washington University School of Medicine, 4515 McKinley Avenue, St. Louis, Missouri 63110, USA
| | - Mayank N. K. Choudhary
- Division of Biological and Biomedical Sciences, Washington University School of Medicine, 660 S. Euclid Avenue, St. Louis, Missouri 63110, USA
- Department of Genetics, Center for Genome Sciences and Systems Biology, Washington University School of Medicine, 4515 McKinley Avenue, St. Louis, Missouri 63110, USA
| | - Erica Pehrsson
- Department of Genetics, Center for Genome Sciences and Systems Biology, Washington University School of Medicine, 4515 McKinley Avenue, St. Louis, Missouri 63110, USA
| | - Xiaoyun Xing
- Department of Genetics, Center for Genome Sciences and Systems Biology, Washington University School of Medicine, 4515 McKinley Avenue, St. Louis, Missouri 63110, USA
| | - Christopher Fiore
- Division of Biological and Biomedical Sciences, Washington University School of Medicine, 660 S. Euclid Avenue, St. Louis, Missouri 63110, USA
- Department of Genetics, Center for Genome Sciences and Systems Biology, Washington University School of Medicine, 4515 McKinley Avenue, St. Louis, Missouri 63110, USA
| | - Manishi Pandey
- Division of Biological and Biomedical Sciences, Washington University School of Medicine, 660 S. Euclid Avenue, St. Louis, Missouri 63110, USA
- Department of Genetics, Center for Genome Sciences and Systems Biology, Washington University School of Medicine, 4515 McKinley Avenue, St. Louis, Missouri 63110, USA
| | - Brett Maricque
- Division of Biological and Biomedical Sciences, Washington University School of Medicine, 660 S. Euclid Avenue, St. Louis, Missouri 63110, USA
- Department of Genetics, Center for Genome Sciences and Systems Biology, Washington University School of Medicine, 4515 McKinley Avenue, St. Louis, Missouri 63110, USA
| | - Methma Udawatta
- Department of Genetics, Center for Genome Sciences and Systems Biology, Washington University School of Medicine, 4515 McKinley Avenue, St. Louis, Missouri 63110, USA
| | - Duc Ngo
- Department of Genetics, Center for Genome Sciences and Systems Biology, Washington University School of Medicine, 4515 McKinley Avenue, St. Louis, Missouri 63110, USA
| | - Yujie Chen
- Department of Genetics, Center for Genome Sciences and Systems Biology, Washington University School of Medicine, 4515 McKinley Avenue, St. Louis, Missouri 63110, USA
| | - Asia Paguntalan
- Department of Genetics, Center for Genome Sciences and Systems Biology, Washington University School of Medicine, 4515 McKinley Avenue, St. Louis, Missouri 63110, USA
| | - Tammy Ray
- Department of Genetics, Center for Genome Sciences and Systems Biology, Washington University School of Medicine, 4515 McKinley Avenue, St. Louis, Missouri 63110, USA
| | - Ava Hughes
- Department of Genetics, Center for Genome Sciences and Systems Biology, Washington University School of Medicine, 4515 McKinley Avenue, St. Louis, Missouri 63110, USA
| | - Barak A. Cohen
- Department of Genetics, Center for Genome Sciences and Systems Biology, Washington University School of Medicine, 4515 McKinley Avenue, St. Louis, Missouri 63110, USA
| | - Ting Wang
- Department of Genetics, Center for Genome Sciences and Systems Biology, Washington University School of Medicine, 4515 McKinley Avenue, St. Louis, Missouri 63110, USA
| |
Collapse
|
15
|
Abstract
Genome size in mammals and birds shows remarkably little interspecific variation compared with other taxa. However, genome sequencing has revealed that many mammal and bird lineages have experienced differential rates of transposable element (TE) accumulation, which would be predicted to cause substantial variation in genome size between species. Thus, we hypothesize that there has been covariation between the amount of DNA gained by transposition and lost by deletion during mammal and avian evolution, resulting in genome size equilibrium. To test this model, we develop computational methods to quantify the amount of DNA gained by TE expansion and lost by deletion over the last 100 My in the lineages of 10 species of eutherian mammals and 24 species of birds. The results reveal extensive variation in the amount of DNA gained via lineage-specific transposition, but that DNA loss counteracted this expansion to various extents across lineages. Our analysis of the rate and size spectrum of deletion events implies that DNA removal in both mammals and birds has proceeded mostly through large segmental deletions (>10 kb). These findings support a unified "accordion" model of genome size evolution in eukaryotes whereby DNA loss counteracting TE expansion is a major determinant of genome size. Furthermore, we propose that extensive DNA loss, and not necessarily a dearth of TE activity, has been the primary force maintaining the greater genomic compaction of flying birds and bats relative to their flightless relatives.
Collapse
|
16
|
Jarvis ED. Perspectives from the Avian Phylogenomics Project: Questions that Can Be Answered with Sequencing All Genomes of a Vertebrate Class. Annu Rev Anim Biosci 2016; 4:45-59. [PMID: 26884102 DOI: 10.1146/annurev-animal-021815-111216] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The rapid pace of advances in genome technology, with concomitant reductions in cost, makes it feasible that one day in our lifetime we will have available extant genomes of entire classes of species, including vertebrates. I recently helped cocoordinate the large-scale Avian Phylogenomics Project, which collected and sequenced genomes of 48 bird species representing most currently classified orders to address a range of questions in phylogenomics and comparative genomics. The consortium was able to answer questions not previously possible with just a few genomes. This success spurred on the creation of a project to sequence the genomes of at least one individual of all extant ∼10,500 bird species. The initiation of this project has led us to consider what questions now impossible to answer could be answered with all genomes, and could drive new questions now unimaginable. These include the generation of a highly resolved family tree of extant species, genome-wide association studies across species to identify genetic substrates of many complex traits, redefinition of species and the species concept, reconstruction of the genomes of common ancestors, and generation of new computational tools to address these questions. Here I present visions for the future by posing and answering questions regarding what scientists could potentially do with available genomes of an entire vertebrate class.
Collapse
Affiliation(s)
- Erich D Jarvis
- Department of Neurobiology, Duke University Medical Center, Durham, North Carolina 27710
| |
Collapse
|
17
|
Tremblay-Savard O, Reinharz V, Waldispühl J. Reconstruction of ancestral RNA sequences under multiple structural constraints. BMC Genomics 2016; 17:862. [PMID: 28185557 PMCID: PMC5123390 DOI: 10.1186/s12864-016-3105-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Secondary structures form the scaffold of multiple sequence alignment of non-coding RNA (ncRNA) families. An accurate reconstruction of ancestral ncRNAs must use this structural signal. However, the inference of ancestors of a single ncRNA family with a single consensus structure may bias the results towards sequences with high affinity to this structure, which are far from the true ancestors. METHODS In this paper, we introduce achARNement, a maximum parsimony approach that, given two alignments of homologous ncRNA families with consensus secondary structures and a phylogenetic tree, simultaneously calculates ancestral RNA sequences for these two families. RESULTS We test our methodology on simulated data sets, and show that achARNement outperforms classical maximum parsimony approaches in terms of accuracy, but also reduces by several orders of magnitude the number of candidate sequences. To conclude this study, we apply our algorithms on the Glm clan and the FinP-traJ clan from the Rfam database. CONCLUSIONS Our results show that our methods reconstruct small sets of high-quality candidate ancestors with better agreement to the two target structures than with classical approaches. Our program is freely available at: http://csb.cs.mcgill.ca/acharnement .
Collapse
Affiliation(s)
- Olivier Tremblay-Savard
- School of Computer Science, McGill University, Montreal, H3A 0E9, Canada.,Department of Computer Science, University of Manitoba, Winnipeg, R3T 2N2, Canada
| | - Vladimir Reinharz
- School of Computer Science, McGill University, Montreal, H3A 0E9, Canada
| | - Jérôme Waldispühl
- School of Computer Science, McGill University, Montreal, H3A 0E9, Canada.
| |
Collapse
|
18
|
Perdomo-Sabogal A, Nowick K, Piccini I, Sudbrak R, Lehrach H, Yaspo ML, Warnatz HJ, Querfurth R. Human Lineage-Specific Transcriptional Regulation through GA-Binding Protein Transcription Factor Alpha (GABPa). Mol Biol Evol 2016; 33:1231-44. [PMID: 26814189 PMCID: PMC4839217 DOI: 10.1093/molbev/msw007] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
A substantial fraction of phenotypic differences between closely related species are likely caused by differences in gene regulation. While this has already been postulated over 30 years ago, only few examples of evolutionary changes in gene regulation have been verified. Here, we identified and investigated binding sites of the transcription factor GA-binding protein alpha (GABPa) aiming to discover cis-regulatory adaptations on the human lineage. By performing chromatin immunoprecipitation-sequencing experiments in a human cell line, we found 11,619 putative GABPa binding sites. Through sequence comparisons of the human GABPa binding regions with orthologous sequences from 34 mammals, we identified substitutions that have resulted in 224 putative human-specific GABPa binding sites. To experimentally assess the transcriptional impact of those substitutions, we selected four promoters for promoter-reporter gene assays using human and African green monkey cells. We compared the activities of wild-type promoters to mutated forms, where we have introduced one or more substitutions to mimic the ancestral state devoid of the GABPa consensus binding sequence. Similarly, we introduced the human-specific substitutions into chimpanzee and macaque promoter backgrounds. Our results demonstrate that the identified substitutions are functional, both in human and nonhuman promoters. In addition, we performed GABPa knock-down experiments and found 1,215 genes as strong candidates for primary targets. Further analyses of our data sets link GABPa to cognitive disorders, diabetes, KRAB zinc finger (KRAB-ZNF), and human-specific genes. Thus, we propose that differences in GABPa binding sites played important roles in the evolution of human-specific phenotypes.
Collapse
Affiliation(s)
- Alvaro Perdomo-Sabogal
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, University Leipzig, Leipzig, Germany Paul-Flechsig Institute for Brain Research, University of Leipzig, Leipzig, Germany
| | - Katja Nowick
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, University Leipzig, Leipzig, Germany Paul-Flechsig Institute for Brain Research, University of Leipzig, Leipzig, Germany
| | - Ilaria Piccini
- Institute of Genetics of Heart Diseases (IfGH), Department of Cardiovascular Medicine, University Hospital Münster, 48149 Münster, Germany Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Ralf Sudbrak
- European Centre for Public Heath Genomics, UNU-MERIT, Unsiversity Maastricht,PO Box 616, 6200 MD Maastricht, The Netherlands Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Hans Lehrach
- Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Marie-Laure Yaspo
- Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Hans-Jörg Warnatz
- Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Robert Querfurth
- Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Berlin, Germany
| |
Collapse
|
19
|
Berthelot C, Muffato M, Abecassis J, Roest Crollius H. The 3D organization of chromatin explains evolutionary fragile genomic regions. Cell Rep 2015; 10:1913-24. [PMID: 25801028 DOI: 10.1016/j.celrep.2015.02.046] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2014] [Revised: 12/17/2014] [Accepted: 02/18/2015] [Indexed: 10/23/2022] Open
Abstract
Genomic rearrangements are a major source of evolutionary divergence in eukaryotic genomes, a cause of genetic diseases and a hallmark of tumor cell progression, yet the mechanisms underlying their occurrence and evolutionary fixation are poorly understood. Statistical associations between breakpoints and specific genomic features suggest that genomes may contain elusive “fragile regions” with a higher propensity for breakage. Here, we use ancestral genome reconstructions to demonstrate a near-perfect correlation between gene density and evolutionary rearrangement breakpoints. Simulations based on functional features in the human genome show that this pattern is best explained as the outcome of DNA breaks that occur in open chromatin regions coming into 3D contact in the nucleus. Our model explains how rearrangements reorganize the order of genes in an evolutionary neutral fashion and provides a basis for understanding the susceptibility of “fragile regions” to breakage.
Collapse
|
20
|
Papamichos SI, Margaritis D, Kotsianidis I. Adaptive Evolution Coupled with Retrotransposon Exaptation Allowed for the Generation of a Human-Protein-Specific Coding Gene That Promotes Cancer Cell Proliferation and Metastasis in Both Haematological Malignancies and Solid Tumours: The Extraordinary Case of MYEOV Gene. SCIENTIFICA 2015; 2015:984706. [PMID: 26568894 PMCID: PMC4629056 DOI: 10.1155/2015/984706] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/15/2015] [Accepted: 09/27/2015] [Indexed: 06/05/2023]
Abstract
The incidence of cancer in human is high as compared to chimpanzee. However previous analysis has documented that numerous human cancer-related genes are highly conserved in chimpanzee. Till date whether human genome includes species-specific cancer-related genes that could potentially contribute to a higher cancer susceptibility remains obscure. This study focuses on MYEOV, an oncogene encoding for two protein isoforms, reported as causally involved in promoting cancer cell proliferation and metastasis in both haematological malignancies and solid tumours. First we document, via stringent in silico analysis, that MYEOV arose de novo in Catarrhini. We show that MYEOV short-isoform start codon was evolutionarily acquired after Catarrhini/Platyrrhini divergence. Throughout the course of Catarrhini evolution MYEOV acquired a gradually elongated translatable open reading frame (ORF), a gradually shortened translation-regulatory upstream ORF, and alternatively spliced mRNA variants. A point mutation introduced in human allowed for the acquisition of MYEOV long-isoform start codon. Second, we demonstrate the precious impact of exonized transposable elements on the creation of MYEOV gene structure. Third, we highlight that the initial part of MYEOV long-isoform coding DNA sequence was under positive selection pressure during Catarrhini evolution. MYEOV represents a Primate Orphan Gene that acquired, via ORF expansion, a human-protein-specific coding potential.
Collapse
Affiliation(s)
- Spyros I. Papamichos
- Department of Haematology, School of Medicine, Democritus University of Thrace, 68100 Alexandroupolis, Greece
| | - Dimitrios Margaritis
- Department of Haematology, School of Medicine, Democritus University of Thrace, 68100 Alexandroupolis, Greece
| | - Ioannis Kotsianidis
- Department of Haematology, School of Medicine, Democritus University of Thrace, 68100 Alexandroupolis, Greece
| |
Collapse
|
21
|
Duchemin W, Daubin V, Tannier E. Reconstruction of an ancestral Yersinia pestis genome and comparison with an ancient sequence. BMC Genomics 2015; 16 Suppl 10:S9. [PMID: 26450112 PMCID: PMC4603589 DOI: 10.1186/1471-2164-16-s10-s9] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
BACKGROUND We propose the computational reconstruction of a whole bacterial ancestral genome at the nucleotide scale, and its validation by a sequence of ancient DNA. This rare possibility is offered by an ancient sequence of the late middle ages plague agent. It has been hypothesized to be ancestral to extant Yersinia pestis strains based on the pattern of nucleotide substitutions. But the dynamics of indels, duplications, insertion sequences and rearrangements has impacted all genomes much more than the substitution process, which makes the ancestral reconstruction task challenging. RESULTS We use a set of gene families from 13 Yersinia species, construct reconciled phylogenies for all of them, and determine gene orders in ancestral species. Gene trees integrate information from the sequence, the species tree and gene order. We reconstruct ancestral sequences for ancestral genic and intergenic regions, providing nearly a complete genome sequence for the ancestor, containing a chromosome and three plasmids. CONCLUSION The comparison of the ancestral and ancient sequences provides a unique opportunity to assess the quality of ancestral genome reconstruction methods. But the quality of the sequencing and assembly of the ancient sequence can also be questioned by this comparison.
Collapse
Affiliation(s)
- Wandrille Duchemin
- Laboratoire de Biométrie et Biologie Évolutive, LBBE, UMR CNRS 5558, University of Lyon 1, 43 boulevard du 11 novembre 1918, 69622 Villeurbanne, France
| | - Vincent Daubin
- Laboratoire de Biométrie et Biologie Évolutive, LBBE, UMR CNRS 5558, University of Lyon 1, 43 boulevard du 11 novembre 1918, 69622 Villeurbanne, France
| | - Eric Tannier
- Laboratoire de Biométrie et Biologie Évolutive, LBBE, UMR CNRS 5558, University of Lyon 1, 43 boulevard du 11 novembre 1918, 69622 Villeurbanne, France
- Institut National de Recherche en Informatique et en Automatique (INRIA) Grenoble Rhône-Alpes, 655 avenue de l'Europe, 38330 Montbonnot, France
| |
Collapse
|
22
|
Green RE, Braun EL, Armstrong J, Earl D, Nguyen N, Hickey G, Vandewege MW, St John JA, Capella-Gutiérrez S, Castoe TA, Kern C, Fujita MK, Opazo JC, Jurka J, Kojima KK, Caballero J, Hubley RM, Smit AF, Platt RN, Lavoie CA, Ramakodi MP, Finger JW, Suh A, Isberg SR, Miles L, Chong AY, Jaratlerdsiri W, Gongora J, Moran C, Iriarte A, McCormack J, Burgess SC, Edwards SV, Lyons E, Williams C, Breen M, Howard JT, Gresham CR, Peterson DG, Schmitz J, Pollock DD, Haussler D, Triplett EW, Zhang G, Irie N, Jarvis ED, Brochu CA, Schmidt CJ, McCarthy FM, Faircloth BC, Hoffmann FG, Glenn TC, Gabaldón T, Paten B, Ray DA. Three crocodilian genomes reveal ancestral patterns of evolution among archosaurs. Science 2014; 346:1254449. [PMID: 25504731 PMCID: PMC4386873 DOI: 10.1126/science.1254449] [Citation(s) in RCA: 230] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
To provide context for the diversification of archosaurs--the group that includes crocodilians, dinosaurs, and birds--we generated draft genomes of three crocodilians: Alligator mississippiensis (the American alligator), Crocodylus porosus (the saltwater crocodile), and Gavialis gangeticus (the Indian gharial). We observed an exceptionally slow rate of genome evolution within crocodilians at all levels, including nucleotide substitutions, indels, transposable element content and movement, gene family evolution, and chromosomal synteny. When placed within the context of related taxa including birds and turtles, this suggests that the common ancestor of all of these taxa also exhibited slow genome evolution and that the comparatively rapid evolution is derived in birds. The data also provided the opportunity to analyze heterozygosity in crocodilians, which indicates a likely reduction in population size for all three taxa through the Pleistocene. Finally, these data combined with newly published bird genomes allowed us to reconstruct the partial genome of the common ancestor of archosaurs, thereby providing a tool to investigate the genetic starting material of crocodilians, birds, and dinosaurs.
Collapse
Affiliation(s)
- Richard E Green
- Department of Biomolecular Engineering, University of California, Santa Cruz, CA 95064, USA.
| | - Edward L Braun
- Department of Biology and Genetics Institute, University of Florida, Gainesville, FL 32611, USA
| | - Joel Armstrong
- Department of Biomolecular Engineering, University of California, Santa Cruz, CA 95064, USA. Center for Biomolecular Science and Engineering, University of California, Santa Cruz, CA 95064, USA
| | - Dent Earl
- Department of Biomolecular Engineering, University of California, Santa Cruz, CA 95064, USA. Center for Biomolecular Science and Engineering, University of California, Santa Cruz, CA 95064, USA
| | - Ngan Nguyen
- Department of Biomolecular Engineering, University of California, Santa Cruz, CA 95064, USA. Center for Biomolecular Science and Engineering, University of California, Santa Cruz, CA 95064, USA
| | - Glenn Hickey
- Department of Biomolecular Engineering, University of California, Santa Cruz, CA 95064, USA. Center for Biomolecular Science and Engineering, University of California, Santa Cruz, CA 95064, USA
| | - Michael W Vandewege
- Department of Biochemistry, Molecular Biology, Entomology and Plant Pathology, Mississippi State University, Mississippi State, MS 39762, USA
| | - John A St John
- Department of Biomolecular Engineering, University of California, Santa Cruz, CA 95064, USA
| | - Salvador Capella-Gutiérrez
- Bioinformatics and Genomics Programme, Centre for Genomic Regulation, 08003 Barcelona, Spain. Universitat Pompeu Fabra, 08003 Barcelona, Spain
| | - Todd A Castoe
- Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, CO 80045, USA. Department of Biology, University of Texas, Arlington, TX 76019, USA
| | - Colin Kern
- Department of Computer and Information Sciences, University of Delaware, Newark, DE 19717, USA
| | - Matthew K Fujita
- Department of Biology, University of Texas, Arlington, TX 76019, USA
| | - Juan C Opazo
- Instituto de Ciencias Ambientales y Evolutivas, Facultad de Ciencias, Universidad Austral de Chile, Valdivia, Chile
| | - Jerzy Jurka
- Genetic Information Research Institute, Mountain View, CA 94043, USA
| | - Kenji K Kojima
- Genetic Information Research Institute, Mountain View, CA 94043, USA
| | | | | | - Arian F Smit
- Institute for Systems Biology, Seattle, WA 98109, USA
| | - Roy N Platt
- Department of Biochemistry, Molecular Biology, Entomology and Plant Pathology, Mississippi State University, Mississippi State, MS 39762, USA. Institute for Genomics, Biocomputing and Biotechnology, Mississippi State University, Mississippi State, MS 39762, USA
| | - Christine A Lavoie
- Department of Biochemistry, Molecular Biology, Entomology and Plant Pathology, Mississippi State University, Mississippi State, MS 39762, USA
| | - Meganathan P Ramakodi
- Department of Biochemistry, Molecular Biology, Entomology and Plant Pathology, Mississippi State University, Mississippi State, MS 39762, USA. Institute for Genomics, Biocomputing and Biotechnology, Mississippi State University, Mississippi State, MS 39762, USA
| | - John W Finger
- Department of Environmental Health Science, University of Georgia, Athens, GA 30602, USA
| | - Alexander Suh
- Institute of Experimental Pathology (ZMBE), University of Münster, D-48149 Münster, Germany. Department of Evolutionary Biology (EBC), Uppsala University, SE-752 36 Uppsala, Sweden
| | - Sally R Isberg
- Porosus Pty. Ltd., Palmerston, NT 0831, Australia. Faculty of Veterinary Science, University of Sydney, Sydney, NSW 2006, Australia. Centre for Crocodile Research, Noonamah, NT 0837, Australia
| | - Lee Miles
- Faculty of Veterinary Science, University of Sydney, Sydney, NSW 2006, Australia
| | - Amanda Y Chong
- Faculty of Veterinary Science, University of Sydney, Sydney, NSW 2006, Australia
| | | | - Jaime Gongora
- Faculty of Veterinary Science, University of Sydney, Sydney, NSW 2006, Australia
| | - Christopher Moran
- Faculty of Veterinary Science, University of Sydney, Sydney, NSW 2006, Australia
| | - Andrés Iriarte
- Departamento de Desarrollo Biotecnológico, Instituto de Higiene, Facultad de Medicina, Universidad de la República, Montevideo, Uruguay
| | - John McCormack
- Moore Laboratory of Zoology, Occidental College, Los Angeles, CA 90041, USA
| | - Shane C Burgess
- College of Agriculture and Life Sciences, University of Arizona, Tucson, AZ 85721, USA
| | - Scott V Edwards
- Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA
| | - Eric Lyons
- School of Plant Sciences, University of Arizona, Tucson, AZ 85721, USA
| | - Christina Williams
- Department of Molecular Biomedical Sciences, North Carolina State University, Raleigh, NC 27607, USA
| | - Matthew Breen
- Department of Molecular Biomedical Sciences, North Carolina State University, Raleigh, NC 27607, USA
| | - Jason T Howard
- Howard Hughes Medical Institute, Department of Neurobiology, Duke University Medical Center, Durham, NC 27710, USA
| | - Cathy R Gresham
- Institute for Genomics, Biocomputing and Biotechnology, Mississippi State University, Mississippi State, MS 39762, USA
| | - Daniel G Peterson
- Institute for Genomics, Biocomputing and Biotechnology, Mississippi State University, Mississippi State, MS 39762, USA. Department of Plant and Soil Sciences, Mississippi State University, Mississippi State, MS 39762, USA
| | - Jürgen Schmitz
- Institute of Experimental Pathology (ZMBE), University of Münster, D-48149 Münster, Germany
| | - David D Pollock
- Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, CO 80045, USA
| | - David Haussler
- Center for Biomolecular Science and Engineering, University of California, Santa Cruz, CA 95064, USA. Howard Hughes Medical Institute, Bethesda, MD 20814, USA
| | - Eric W Triplett
- Department of Microbiology and Cell Science, University of Florida, Gainesville, FL 32611, USA
| | - Guojie Zhang
- China National GeneBank, BGI-Shenzhen, Shenzhen, China. Center for Social Evolution, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Naoki Irie
- Department of Biological Sciences, Graduate School of Science, University of Tokyo, Tokyo, Japan
| | - Erich D Jarvis
- Howard Hughes Medical Institute, Department of Neurobiology, Duke University Medical Center, Durham, NC 27710, USA
| | - Christopher A Brochu
- Department of Earth and Environmental Sciences, University of Iowa, Iowa City, IA 52242, USA
| | - Carl J Schmidt
- Department of Animal and Food Sciences, University of Delaware, Newark, DE 19717, USA
| | - Fiona M McCarthy
- School of Animal and Comparative Biomedical Sciences, University of Arizona, Tucson, AZ 85721, USA
| | - Brant C Faircloth
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, CA 90019, USA. Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803, USA
| | - Federico G Hoffmann
- Department of Biochemistry, Molecular Biology, Entomology and Plant Pathology, Mississippi State University, Mississippi State, MS 39762, USA. Institute for Genomics, Biocomputing and Biotechnology, Mississippi State University, Mississippi State, MS 39762, USA
| | - Travis C Glenn
- Department of Environmental Health Science, University of Georgia, Athens, GA 30602, USA
| | - Toni Gabaldón
- Bioinformatics and Genomics Programme, Centre for Genomic Regulation, 08003 Barcelona, Spain. Universitat Pompeu Fabra, 08003 Barcelona, Spain. Institució Catalana de Recerca i Estudis Avançats, 08010 Barcelona, Spain
| | - Benedict Paten
- Center for Biomolecular Science and Engineering, University of California, Santa Cruz, CA 95064, USA
| | - David A Ray
- Department of Biochemistry, Molecular Biology, Entomology and Plant Pathology, Mississippi State University, Mississippi State, MS 39762, USA. Institute for Genomics, Biocomputing and Biotechnology, Mississippi State University, Mississippi State, MS 39762, USA. Department of Biological Sciences, Texas Tech University, Lubbock, TX 79409, USA.
| |
Collapse
|
23
|
Paten B, Zerbino DR, Hickey G, Haussler D. A unifying model of genome evolution under parsimony. BMC Bioinformatics 2014; 15:206. [PMID: 24946830 PMCID: PMC4082375 DOI: 10.1186/1471-2105-15-206] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2013] [Accepted: 05/08/2014] [Indexed: 11/23/2022] Open
Abstract
Background Parsimony and maximum likelihood methods of phylogenetic tree estimation and parsimony methods for genome rearrangements are central to the study of genome evolution yet to date they have largely been pursued in isolation. Results We present a data structure called a history graph that offers a practical basis for the analysis of genome evolution. It conceptually simplifies the study of parsimonious evolutionary histories by representing both substitutions and double cut and join (DCJ) rearrangements in the presence of duplications. The problem of constructing parsimonious history graphs thus subsumes related maximum parsimony problems in the fields of phylogenetic reconstruction and genome rearrangement. We show that tractable functions can be used to define upper and lower bounds on the minimum number of substitutions and DCJ rearrangements needed to explain any history graph. These bounds become tight for a special type of unambiguous history graph called an ancestral variation graph (AVG), which constrains in its combinatorial structure the number of operations required. We finally demonstrate that for a given history graph G, a finite set of AVGs describe all parsimonious interpretations of G, and this set can be explored with a few sampling moves. Conclusion This theoretical study describes a model in which the inference of genome rearrangements and phylogeny can be unified under parsimony.
Collapse
Affiliation(s)
- Benedict Paten
- University of California, Santa Cruz, 1156 High St, 95064 Santa Cruz, USA.
| | | | | | | |
Collapse
|
24
|
Abstract
MOTIVATIONS Recent progress in ancient DNA sequencing technologies and protocols has lead to the sequencing of whole ancient bacterial genomes, as illustrated by the recent sequence of the Yersinia pestis strain that caused the Black Death pandemic. However, sequencing ancient genomes raises specific problems, because of the decay and fragmentation of ancient DNA among others, making the scaffolding of ancient contigs challenging. RESULTS We show that computational paleogenomics methods aimed at reconstructing the organization of ancestral genomes from the comparison of extant genomes can be adapted to correct, order and orient ancient bacterial contigs. We describe the method FPSAC (fast phylogenetic scaffolding of ancient contigs) and apply it on a set of 2134 ancient contigs assembled from the recently sequenced Black Death agent genome. We obtain a unique scaffold for the whole chromosome of this ancient genome that allows to gain precise insights into the structural evolution of the Yersinia clade.
Collapse
Affiliation(s)
- Ashok Rajaraman
- Department of Mathematics, Simon Fraser University, Burnaby (BC) V5A1S6, Canada, International Graduate Training Center in Mathematical Biology, Pacific Institute for the Mathematical Sciences, Vancouver (BC), Canada, INRIA Grenoble Rhône-Alpes, Montbonnot 38334, France, Université de Lyon 1, Laboratoire de Biométrie et Biologie Évolutive, CNRS UMR5558 F-69622 Villeurbanne, France and LaBRI, Université Bordeaux I, 33405 Talence, France
| | | | | |
Collapse
|
25
|
Hiller M, Agarwal S, Notwell JH, Parikh R, Guturu H, Wenger AM, Bejerano G. Computational methods to detect conserved non-genic elements in phylogenetically isolated genomes: application to zebrafish. Nucleic Acids Res 2013; 41:e151. [PMID: 23814184 PMCID: PMC3753653 DOI: 10.1093/nar/gkt557] [Citation(s) in RCA: 57] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
Many important model organisms for biomedical and evolutionary research have sequenced genomes, but occupy a phylogenetically isolated position, evolutionarily distant from other sequenced genomes. This phylogenetic isolation is exemplified for zebrafish, a vertebrate model for cis-regulation, development and human disease, whose evolutionary distance to all other currently sequenced fish exceeds the distance between human and chicken. Such large distances make it difficult to align genomes and use them for comparative analysis beyond gene-focused questions. In particular, detecting conserved non-genic elements (CNEs) as promising cis-regulatory elements with biological importance is challenging. Here, we develop a general comparative genomics framework to align isolated genomes and to comprehensively detect CNEs. Our approach integrates highly sensitive and quality-controlled local alignments and uses alignment transitivity and ancestral reconstruction to bridge large evolutionary distances. We apply our framework to zebrafish and demonstrate substantially improved CNE detection and quality compared with previous sets. Our zebrafish CNE set comprises 54 533 CNEs, of which 11 792 (22%) are conserved to human or mouse. Our zebrafish CNEs (http://zebrafish.stanford.edu) are highly enriched in known enhancers and extend existing experimental (ChIP-Seq) sets. The same framework can now be applied to the isolated genomes of frog, amphioxus, Caenorhabditis elegans and many others.
Collapse
Affiliation(s)
- Michael Hiller
- Department of Developmental Biology, Stanford University, Stanford, CA 94305, USA, Department of Computer Science, Stanford University, Stanford, CA 94305, USA and Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA
| | | | | | | | | | | | | |
Collapse
|
26
|
A scalable and flexible approach for investigating the genomic landscapes of phylogenetic incongruence. Mol Phylogenet Evol 2013; 66:1067-74. [DOI: 10.1016/j.ympev.2012.11.023] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2012] [Revised: 11/16/2012] [Accepted: 11/25/2012] [Indexed: 11/19/2022]
|
27
|
Blanchette M. Exploiting ancestral mammalian genomes for the prediction of human transcription factor binding sites. BMC Bioinformatics 2012; 13 Suppl 19:S2. [PMID: 23281809 PMCID: PMC3526440 DOI: 10.1186/1471-2105-13-s19-s2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background The computational prediction of Transcription Factor Binding Sites (TFBS) remains a challenge due to their short length and low information content. Comparative genomics approaches that simultaneously consider several related species and favor sites that have been conserved throughout evolution improve the accuracy (specificity) of the predictions but are limited due to a phenomenon called binding site turnover, where sequence evolution causes one TFBS to replace another in the same region. In parallel to this development, an increasing number of mammalian genomes are now sequenced and it is becoming possible to infer, to a surprisingly high degree of accuracy, ancestral mammalian sequences. Results We propose a TFBS prediction approach that makes use of the availability of inferred ancestral mammalian genomes to improve its accuracy. This method aims to identify binding loci, which are regions of a few hundred base pairs that have preserved their potential to bind a given transcription factor over evolutionary time. After proposing a neutral evolutionary model of predicted TFBS counts in a DNA region of a given length, we use it to identify regions that have preserved the number of predicted TFBS they contain to an unexpected degree given their divergence. The approach is applied to human chromosome 1 and shows significant gains in accuracy as compared to both existing single-species and multi-species TFBS prediction approaches, in particular for transcription factors that are subject to high turnover rates. Availability The source code and predictions made by the program are available at http://www.cs.mcgill.ca/~blanchem/bindingLoci.
Collapse
Affiliation(s)
- Mathieu Blanchette
- McGill Centre for Bioinformatics and School of Computer Science, McGill University, H3C 2B4 Québec, Canada.
| |
Collapse
|
28
|
A "forward genomics" approach links genotype to phenotype using independent phenotypic losses among related species. Cell Rep 2012; 2:817-23. [PMID: 23022484 DOI: 10.1016/j.celrep.2012.08.032] [Citation(s) in RCA: 90] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2012] [Revised: 07/31/2012] [Accepted: 08/30/2012] [Indexed: 12/27/2022] Open
Abstract
Genotype-phenotype mapping is hampered by countless genomic changes between species. We introduce a computational "forward genomics" strategy that-given only an independently lost phenotype and whole genomes-matches genomic and phenotypic loss patterns to associate specific genomic regions with this phenotype. We conducted genome-wide screens for two metabolic phenotypes. First, our approach correctly matches the inactivated Gulo gene exactly with the species that lost the ability to synthesize vitamin C. Second, we attribute naturally low biliary phospholipid levels in guinea pigs and horses to the inactivated phospholipid transporter Abcb4. Human ABCB4 mutations also result in low phospholipid levels but lead to severe liver disease, suggesting compensatory mechanisms in guinea pig and horse. Our simulation studies, counts of independent changes in existing phenotype surveys, and the forthcoming availability of many new genomes all suggest that forward genomics can be applied to many phenotypes, including those relevant for human evolution and disease.
Collapse
|
29
|
Romiguier J, Ranwez V, Douzery EJP, Galtier N. Genomic evidence for large, long-lived ancestors to placental mammals. Mol Biol Evol 2012; 30:5-13. [PMID: 22949523 DOI: 10.1093/molbev/mss211] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
It is widely assumed that our mammalian ancestors, which lived in the Cretaceous era, were tiny animals that survived massive asteroid impacts in shelters and evolved into modern forms after dinosaurs went extinct, 65 Ma. The small size of most Mesozoic mammalian fossils essentially supports this view. Paleontology, however, is not conclusive regarding the ancestry of extant mammals, because Cretaceous and Paleocene fossils are not easily linked to modern lineages. Here, we use full-genome data to estimate the longevity and body mass of early placental mammals. Analyzing 36 fully sequenced mammalian genomes, we reconstruct two aspects of the ancestral genome dynamics, namely GC-content evolution and nonsynonymous over synonymous rate ratio. Linking these molecular evolutionary processes to life-history traits in modern species, we estimate that early placental mammals had a life span above 25 years and a body mass above 1 kg. This is similar to current primates, cetartiodactyls, or carnivores, but markedly different from mice or shrews, challenging the dominant view about mammalian origin and evolution. Our results imply that long-lived mammals existed in the Cretaceous era and were the most successful in evolution, opening new perspectives about the conditions for survival to the Cretaceous-Tertiary crisis.
Collapse
Affiliation(s)
- J Romiguier
- CNRS, Université Montpellier 2, UMR 5554, ISEM, Montpellier, France
| | | | | | | |
Collapse
|
30
|
Ashkenazy H, Penn O, Doron-Faigenboim A, Cohen O, Cannarozzi G, Zomer O, Pupko T. FastML: a web server for probabilistic reconstruction of ancestral sequences. Nucleic Acids Res 2012; 40:W580-4. [PMID: 22661579 PMCID: PMC3394241 DOI: 10.1093/nar/gks498] [Citation(s) in RCA: 229] [Impact Index Per Article: 19.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Ancestral sequence reconstruction is essential to a variety of evolutionary studies. Here, we present the FastML web server, a user-friendly tool for the reconstruction of ancestral sequences. FastML implements various novel features that differentiate it from existing tools: (i) FastML uses an indel-coding method, in which each gap, possibly spanning multiples sites, is coded as binary data. FastML then reconstructs ancestral indel states assuming a continuous time Markov process. FastML provides the most likely ancestral sequences, integrating both indels and characters; (ii) FastML accounts for uncertainty in ancestral states: it provides not only the posterior probabilities for each character and indel at each sequence position, but also a sample of ancestral sequences from this posterior distribution, and a list of the k-most likely ancestral sequences; (iii) FastML implements a large array of evolutionary models, which makes it generic and applicable for nucleotide, protein and codon sequences; and (iv) a graphical representation of the results is provided, including, for example, a graphical logo of the inferred ancestral sequences. The utility of FastML is demonstrated by reconstructing ancestral sequences of the Env protein from various HIV-1 subtypes. FastML is freely available for all academic users and is available online at http://fastml.tau.ac.il/.
Collapse
Affiliation(s)
- Haim Ashkenazy
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel Aviv University, 69978 Tel Aviv, Israel
| | | | | | | | | | | | | |
Collapse
|
31
|
Sadri J, Diallo AB, Blanchette M. Predicting site-specific human selective pressure using evolutionary signatures. Bioinformatics 2011; 27:i266-74. [PMID: 21685080 PMCID: PMC3117352 DOI: 10.1093/bioinformatics/btr241] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Motivation: The identification of non-coding functional regions of the human genome remains one of the main challenges of genomics. By observing how a given region evolved over time, one can detect signs of negative or positive selection hinting that the region may be functional. With the quickly increasing number of vertebrate genomes to compare with our own, this type of approach is set to become extremely powerful, provided the right analytical tools are available. Results: A large number of approaches have been proposed to measure signs of past selective pressure, usually in the form of reduced mutation rate. Here, we propose a radically different approach to the detection of non-coding functional region: instead of measuring past evolutionary rates, we build a machine learning classifier to predict current substitution rates in human based on the inferred evolutionary events that affected the region during vertebrate evolution. We show that different types of evolutionary events, occurring along different branches of the phylogenetic tree, bring very different amounts of information. We propose a number of simple machine learning classifiers and show that a Support-Vector Machine (SVM) predictor clearly outperforms existing tools at predicting human non-coding functional sites. Comparison to external evidences of selection and regulatory function confirms that these SVM predictions are more accurate than those of other approaches. Availability: The predictor and predictions made are available at http://www.mcb.mcgill.ca/~blanchem/sadri. Contact:blanchem@mcb.mcgill.ca
Collapse
Affiliation(s)
- Javad Sadri
- School of Computer Science, McGill University, 3630 University, Montreal, QC, Canada H3A 2B2
| | | | | |
Collapse
|
32
|
Abstract
BACKGROUND In a previous study we demonstrated that co-evolutionary information can be utilized for improving the accuracy of ancestral gene content reconstruction. To this end, we defined a new computational problem, the Ancestral Co-Evolutionary (ACE) problem, and developed algorithms for solving it. RESULTS In the current paper we generalize our previous study in various ways. First, we describe new efficient computational approaches for solving the ACE problem. The new approaches are based on reductions to classical methods such as linear programming relaxation, quadratic programming, and min-cut. Second, we report new computational hardness results related to the ACE, including practical cases where it can be solved in polynomial time.Third, we generalize the ACE problem and demonstrate how our approach can be used for inferring parts of the genomes of non-ancestral organisms. To this end, we describe a heuristic for finding the portion of the genome ('dominant set') that can be used to reconstruct the rest of the genome with the lowest error rate. This heuristic utilizes both evolutionary information and co-evolutionary information.We implemented these algorithms on a large input of the ACE problem (95 unicellular organisms, 4,873 protein families, and 10, 576 of co-evolutionary relations), demonstrating that some of these algorithms can outperform the algorithm used in our previous study. In addition, we show that based on our approach a 'dominant set' cab be used reconstruct a major fraction of a genome (up to 79%) with relatively low error-rate (e.g. 0.11). We find that the 'dominant set' tends to include metabolic and regulatory genes, with high evolutionary rate, and low protein abundance and number of protein-protein interactions. CONCLUSIONS The ACE problem can be efficiently extended for inferring the genomes of organisms that exist today. In addition, it may be solved in polynomial time in many practical cases. Metabolic and regulatory genes were found to be the most important groups of genes necessary for reconstructing gene content of an organism based on other related genomes.
Collapse
Affiliation(s)
- Hadas Birin
- School of Computer Science, Tel Aviv University, Israel
| | - Tamir Tuller
- Department of Biomedical Engineering, Faculty of Engineering, Tel Aviv University, Tel Aviv, Israel
| |
Collapse
|
33
|
Wray GA. CNCing Is Believing. Science 2011; 333:946-7. [DOI: 10.1126/science.1210771] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
Abstract
Analysis of conserved noncoding regions reveals macroevolutionary trends in genome function.
Collapse
Affiliation(s)
- Gregory A. Wray
- Department of Biology and Institute for Genome Sciences & Policy, Duke University, Durham, NC 27708, USA
| |
Collapse
|
34
|
Abstract
Sequence alignment (the grouping of homologous bases into one column) is fundamental to almost any task in comparative genomics. This translates to positing gaps in the genomic sequences to account for events of insertions and deletions (indels). The interrelationship between sequence alignment and phylogenetic reconstruction has drawn substantial attention recently with works showing the significance of differences in alignments. One of the plausible approaches in this direction is to grade the suitability of a tree to an associated alignment and vice verse. We here present a combinatorial (as opposed to statistical) approach based on the indel history. We show--both by simulations and by using real biological data from the Encyclopedia of DNA Elements (ENCODE)--that this criterion is sound. The novelty of our approach is the distinguishing between insertions and deletions, and augmenting the analysis with a dimension of "depth," extending it from the sequence space to the phylogenetic space. Using this approach, we perform a comprehensive study of indel characteristic behavior among mammals in both coding and non-coding regions. Our results show significant differences in indel patterns between coding and non-coding regions. We also show other characteristic patterns of indel evolution in the depth of the underlying phylogeny.
Collapse
Affiliation(s)
- Sagi Snir
- Department of Evolutionary Biology and the Institute of Evolution, Haifa University, Haifa, Israel.
| | | |
Collapse
|
35
|
Paten B, Earl D, Nguyen N, Diekhans M, Zerbino D, Haussler D. Cactus: Algorithms for genome multiple sequence alignment. Genome Res 2011; 21:1512-28. [PMID: 21665927 DOI: 10.1101/gr.123356.111] [Citation(s) in RCA: 157] [Impact Index Per Article: 12.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Much attention has been given to the problem of creating reliable multiple sequence alignments in a model incorporating substitutions, insertions, and deletions. Far less attention has been paid to the problem of optimizing alignments in the presence of more general rearrangement and copy number variation. Using Cactus graphs, recently introduced for representing sequence alignments, we describe two complementary algorithms for creating genomic alignments. We have implemented these algorithms in the new "Cactus" alignment program. We test Cactus using the Evolver genome evolution simulator, a comprehensive new tool for simulation, and show using these and existing simulations that Cactus significantly outperforms all of its peers. Finally, we make an empirical assessment of Cactus's ability to properly align genes and find interesting cases of intra-gene duplication within the primates.
Collapse
Affiliation(s)
- Benedict Paten
- Center for Biomolecular Science and Engineering, University of California-Santa Cruz, CA 95064, USA.
| | | | | | | | | | | |
Collapse
|
36
|
Mancheron A, Uricaru R, Rivals E. An alternative approach to multiple genome comparison. Nucleic Acids Res 2011; 39:e101. [PMID: 21646341 PMCID: PMC3159434 DOI: 10.1093/nar/gkr177] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/05/2022] Open
Abstract
Genome comparison is now a crucial step for genome annotation and identification of regulatory motifs. Genome comparison aims for instance at finding genomic regions either specific to or in one-to-one correspondance between individuals/strains/species. It serves e.g. to pre-annotate a new genome by automatically transfering annotations from a known one. However, efficiency, flexibility and objectives of current methods do not suit the whole spectrum of applications, genome sizes and organizations. Innovative approaches are still needed. Hence, we propose an alternative way of comparing multiple genomes based on segmentation by similarity. In this framework, rather than being formulated as a complex optimization problem, genome comparison is seen as a segmentation question for which a single optimal solution can be found in almost linear time. We apply our method to analyse three strains of a virulent pathogenic bacteria, Ehrlichia ruminantium, and identify 92 new genes. We also find out that a substantial number of genes thought to be strain specific have potential orthologs in the other strains. Our solution is implemented in an efficient program, qod, equipped with a user-friendly interface, and enables the automatic transfer of annotations betwen compared genomes or contigs (Video in Supplementary Data). Because it somehow disregards the relative order of genomic blocks, qod can handle unfinished genomes, which due to the difficulty of sequencing completion may become an interesting characteristic for the future. Availabilty: http://www.atgc-montpellier.fr/qod.
Collapse
Affiliation(s)
- Alban Mancheron
- LIRMM - CNRS, Université Montpellier 2 - CC 477, 161, rue Ada, 34095 Montpellier Cedex 5, France
| | | | | |
Collapse
|
37
|
Ma J. Reconstructing the history of large-scale genomic changes: biological questions and computational challenges. J Comput Biol 2011; 18:879-93. [PMID: 21563973 DOI: 10.1089/cmb.2010.0189] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
In addition to point mutations, larger-scale structural changes (including rearrangements, duplications, insertions, and deletions) are also prevalent between different mammalian genomes. Capturing these large-scale changes is critical to unraveling the history of mammalian evolution in order to better understand the human genome. It also has profound biomedical significance, because many human diseases are associated with structural genomic aberrations. The increasing number of mammalian genomes being sequenced as well as the recent advancement in DNA sequencing technologies are allowing us to identify these structural genomic changes with vastly greater accuracy. However, there are a considerable number of computational challenges related to these problems. In this article, we introduce the ancestral genome reconstruction problem, which enables us to explain the large-scale genomic changes between species in an evolutionary context. The application of these methods to within-species structural variation and disease genome analysis is also discussed. The target audience of this article is advanced undergraduate students in biology.
Collapse
Affiliation(s)
- Jian Ma
- Department of Bioengineering, Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA.
| |
Collapse
|
38
|
Horvath JE, Sheedy CB, Merrett SL, Diallo AB, Swofford DL, NISC Comparative Sequencing Program, Green ED, Willard HF. Comparative analysis of the primate X-inactivation center region and reconstruction of the ancestral primate XIST locus. Genome Res 2011; 21:850-62. [PMID: 21518738 DOI: 10.1101/gr.111849.110] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Abstract
Here we provide a detailed comparative analysis across the candidate X-Inactivation Center (XIC) region and the XIST locus in the genomes of six primates and three mammalian outgroup species. Since lemurs and other strepsirrhine primates represent the sister lineage to all other primates, this analysis focuses on lemurs to reconstruct the ancestral primate sequences and to gain insight into the evolution of this region and the genes within it. This comparative evolutionary genomics approach reveals significant expansion in genomic size across the XIC region in higher primates, with minimal size alterations across the XIST locus itself. Reconstructed primate ancestral XIC sequences show that the most dramatic changes during the past 80 million years occurred between the ancestral primate and the lineage leading to Old World monkeys. In contrast, the XIST locus compared between human and the primate ancestor does not indicate any dramatic changes to exons or XIST-specific repeats; rather, evolution of this locus reflects small incremental changes in overall sequence identity and short repeat insertions. While this comparative analysis reinforces that the region around XIST has been subject to significant genomic change, even among primates, our data suggest that evolution of the XIST sequences themselves represents only small lineage-specific changes across the past 80 million years.
Collapse
Affiliation(s)
- Julie E Horvath
- Duke Institute for Genome Sciences and Policy, Duke University, Durham, North Carolina 27708, USA.
| | | | | | | | | | | | | | | |
Collapse
|
39
|
Error and error mitigation in low-coverage genome assemblies. PLoS One 2011; 6:e17034. [PMID: 21340033 PMCID: PMC3038916 DOI: 10.1371/journal.pone.0017034] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2010] [Accepted: 01/10/2011] [Indexed: 11/19/2022] Open
Abstract
The recent release of twenty-two new genome sequences has dramatically increased the data available for mammalian comparative genomics, but twenty of these new sequences are currently limited to ∼2× coverage. Here we examine the extent of sequencing error in these 2× assemblies, and its potential impact in downstream analyses. By comparing 2× assemblies with high-quality sequences from the ENCODE regions, we estimate the rate of sequencing error to be 1–4 errors per kilobase. While this error rate is fairly modest, sequencing error can still have surprising effects. For example, an apparent lineage-specific insertion in a coding region is more likely to reflect sequencing error than a true biological event, and the length distribution of coding indels is strongly distorted by error. We find that most errors are contributed by a small fraction of bases with low quality scores, in particular, by the ends of reads in regions of single-read coverage in the assembly. We explore several approaches for automatic sequencing error mitigation (SEM), making use of the localized nature of sequencing error, the fact that it is well predicted by quality scores, and information about errors that comes from comparisons across species. Our automatic methods for error mitigation cannot replace the need for additional sequencing, but they do allow substantial fractions of errors to be masked or eliminated at the cost of modest amounts of over-correction, and they can reduce the impact of error in downstream phylogenomic analyses. Our error-mitigated alignments are available for download.
Collapse
|
40
|
Tuller T, Birin H, Kupiec M, Ruppin E. Reconstructing ancestral genomic sequences by co-evolution: formal definitions, computational issues, and biological examples. J Comput Biol 2010; 17:1327-44. [PMID: 20874411 DOI: 10.1089/cmb.2010.0112] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The inference of ancestral genomes is a fundamental problem in molecular evolution. Due to the statistical nature of this problem, the most likely or the most parsimonious ancestral genomes usually include considerable error rates. In general, these errors cannot be abolished by utilizing more exhaustive computational approaches, by using longer genomic sequences, or by analyzing more taxa. In recent studies, we showed that co-evolution is an important force that can be used for significantly improving the inference of ancestral genome content. In this work we formally define a computational problem for the inference of ancestral genome content by co-evolution. We show that this problem is NP-hard and hard to approximate and present both a Fixed Parameter Tractable (FPT) algorithm, and heuristic approximation algorithms for solving it. The running time of these algorithms on simulated inputs with hundreds of protein families and hundreds of co-evolutionary relations was fast (up to four minutes) and it achieved an approximation ratio of <1.3. We use our approach to study the ancestral genome content of the Fungi. To this end, we implement our approach on a dataset of 33, 931 protein families and 20, 317 co-evolutionary relations. Our algorithm added and removed hundreds of proteins from the ancestral genomes inferred by maximum likelihood (ML) or maximum parsimony (MP) while slightly affecting the likelihood/parsimony score of the results. A biological analysis revealed various pieces of evidence that support the biological plausibility of the new solutions. In addition, we showed that our approach reconstructs missing values at the leaves of the Fungi evolutionary tree better than ML or MP.
Collapse
Affiliation(s)
- Tamir Tuller
- Faculty of Mathematics and Computer Science, Weizmann Institute of Science, Rehovot, Israel
| | | | | | | |
Collapse
|
41
|
Helaers R, Milinkovitch MC. MetaPIGA v2.0: maximum likelihood large phylogeny estimation using the metapopulation genetic algorithm and other stochastic heuristics. BMC Bioinformatics 2010; 11:379. [PMID: 20633263 PMCID: PMC2912891 DOI: 10.1186/1471-2105-11-379] [Citation(s) in RCA: 84] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2010] [Accepted: 07/15/2010] [Indexed: 11/11/2022] Open
Abstract
Background The development, in the last decade, of stochastic heuristics implemented in robust application softwares has made large phylogeny inference a key step in most comparative studies involving molecular sequences. Still, the choice of a phylogeny inference software is often dictated by a combination of parameters not related to the raw performance of the implemented algorithm(s) but rather by practical issues such as ergonomics and/or the availability of specific functionalities. Results Here, we present MetaPIGA v2.0, a robust implementation of several stochastic heuristics for large phylogeny inference (under maximum likelihood), including a Simulated Annealing algorithm, a classical Genetic Algorithm, and the Metapopulation Genetic Algorithm (metaGA) together with complex substitution models, discrete Gamma rate heterogeneity, and the possibility to partition data. MetaPIGA v2.0 also implements the Likelihood Ratio Test, the Akaike Information Criterion, and the Bayesian Information Criterion for automated selection of substitution models that best fit the data. Heuristics and substitution models are highly customizable through manual batch files and command line processing. However, MetaPIGA v2.0 also offers an extensive graphical user interface for parameters setting, generating and running batch files, following run progress, and manipulating result trees. MetaPIGA v2.0 uses standard formats for data sets and trees, is platform independent, runs in 32 and 64-bits systems, and takes advantage of multiprocessor and multicore computers. Conclusions The metaGA resolves the major problem inherent to classical Genetic Algorithms by maintaining high inter-population variation even under strong intra-population selection. Implementation of the metaGA together with additional stochastic heuristics into a single software will allow rigorous optimization of each heuristic as well as a meaningful comparison of performances among these algorithms. MetaPIGA v2.0 gives access both to high customization for the phylogeneticist, as well as to an ergonomic interface and functionalities assisting the non-specialist for sound inference of large phylogenetic trees using nucleotide sequences. MetaPIGA v2.0 and its extensive user-manual are freely available to academics at http://www.metapiga.org.
Collapse
|
42
|
Hanson-Smith V, Kolaczkowski B, Thornton JW. Robustness of ancestral sequence reconstruction to phylogenetic uncertainty. Mol Biol Evol 2010; 27:1988-99. [PMID: 20368266 PMCID: PMC2922618 DOI: 10.1093/molbev/msq081] [Citation(s) in RCA: 113] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Ancestral sequence reconstruction (ASR) is widely used to formulate and test hypotheses about the sequences, functions, and structures of ancient genes. Ancestral sequences are usually inferred from an alignment of extant sequences using a maximum likelihood (ML) phylogenetic algorithm, which calculates the most likely ancestral sequence assuming a probabilistic model of sequence evolution and a specific phylogeny—typically the tree with the ML. The true phylogeny is seldom known with certainty, however. ML methods ignore this uncertainty, whereas Bayesian methods incorporate it by integrating the likelihood of each ancestral state over a distribution of possible trees. It is not known whether Bayesian approaches to phylogenetic uncertainty improve the accuracy of inferred ancestral sequences. Here, we use simulation-based experiments under both simplified and empirically derived conditions to compare the accuracy of ASR carried out using ML and Bayesian approaches. We show that incorporating phylogenetic uncertainty by integrating over topologies very rarely changes the inferred ancestral state and does not improve the accuracy of the reconstructed ancestral sequence. Ancestral state reconstructions are robust to uncertainty about the underlying tree because the conditions that produce phylogenetic uncertainty also make the ancestral state identical across plausible trees; conversely, the conditions under which different phylogenies yield different inferred ancestral states produce little or no ambiguity about the true phylogeny. Our results suggest that ML can produce accurate ASRs, even in the face of phylogenetic uncertainty. Using Bayesian integration to incorporate this uncertainty is neither necessary nor beneficial.
Collapse
|
43
|
Li G, Ma J, Zhang L. Greedy selection of species for ancestral state reconstruction on phylogenies: elimination is better than insertion. PLoS One 2010; 5:e8985. [PMID: 20140213 PMCID: PMC2816206 DOI: 10.1371/journal.pone.0008985] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2009] [Accepted: 01/05/2010] [Indexed: 12/26/2022] Open
Abstract
Accurate reconstruction of ancestral character states on a phylogeny is crucial in many genomics studies. We study how to select species to achieve the best reconstruction of ancestral character states on a phylogeny. We first show that the marginal maximum likelihood has the monotonicity property that more taxa give better reconstruction, but the Fitch method does not have it even on an ultrametric phylogeny. We further validate a greedy approach for species selection using simulation. The validation tests indicate that backward greedy selection outperforms forward greedy selection. In addition, by applying our selection strategy, we obtain a set of the ten most informative species for the reconstruction of the genomic sequence of the so-called boreoeutherian ancestor of placental mammals. This study has broad relevance in comparative genomics and paleogenomics since limited research resources do not allow researchers to sequence the large number of descendant species required to reconstruct an ancestral sequence.
Collapse
Affiliation(s)
- Guoliang Li
- Computational & Mathematical Biology, Genome Institute of Singapore, Singapore, Singapore
| | - Jian Ma
- Department of Bioengineering, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Louxin Zhang
- Department of Mathematics, National University of Singapore, Singapore, Singapore
- * E-mail:
| |
Collapse
|
44
|
Kim J, Sinha S. Towards realistic benchmarks for multiple alignments of non-coding sequences. BMC Bioinformatics 2010; 11:54. [PMID: 20102627 PMCID: PMC2823711 DOI: 10.1186/1471-2105-11-54] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2009] [Accepted: 01/26/2010] [Indexed: 02/02/2023] Open
Abstract
BACKGROUND With the continued development of new computational tools for multiple sequence alignment, it is necessary today to develop benchmarks that aid the selection of the most effective tools. Simulation-based benchmarks have been proposed to meet this necessity, especially for non-coding sequences. However, it is not clear if such benchmarks truly represent real sequence data from any given group of species, in terms of the difficulty of alignment tasks. RESULTS We find that the conventional simulation approach, which relies on empirically estimated values for various parameters such as substitution rate or insertion/deletion rates, is unable to generate synthetic sequences reflecting the broad genomic variation in conservation levels. We tackle this problem with a new method for simulating non-coding sequence evolution, by relying on genome-wide distributions of evolutionary parameters rather than their averages. We then generate synthetic data sets to mimic orthologous sequences from the Drosophila group of species, and show that these data sets truly represent the variability observed in genomic data in terms of the difficulty of the alignment task. This allows us to make significant progress towards estimating the alignment accuracy of current tools in an absolute sense, going beyond only a relative assessment of different tools. We evaluate six widely used multiple alignment tools in the context of Drosophila non-coding sequences, and find the accuracy to be significantly different from previously reported values. Interestingly, the performance of most tools degrades more rapidly when there are more insertions than deletions in the data set, suggesting an asymmetric handling of insertions and deletions, even though none of the evaluated tools explicitly distinguishes these two types of events. We also examine the accuracy of two existing tools for annotating insertions versus deletions, and find their performance to be close to optimal in Drosophila non-coding sequences if provided with the true alignments. CONCLUSION We have developed a method to generate benchmarks for multiple alignments of Drosophila non-coding sequences, and shown it to be more realistic than traditional benchmarks. Apart from helping to select the most effective tools, these benchmarks will help practitioners of comparative genomics deal with the effects of alignment errors, by providing accurate estimates of the extent of these errors.
Collapse
Affiliation(s)
- Jaebum Kim
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | | |
Collapse
|
45
|
Tuller T, Birin H, Gophna U, Kupiec M, Ruppin E. Reconstructing ancestral gene content by coevolution. Genome Res 2009; 20:122-32. [PMID: 19948819 DOI: 10.1101/gr.096115.109] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Inferring the gene content of ancestral genomes is a fundamental challenge in molecular evolution. Due to the statistical nature of this problem, ancestral genomes inferred by the maximum likelihood (ML) or the maximum-parsimony (MP) methods are prone to considerable error rates. In general, these errors are difficult to abolish by using longer genomic sequences or by analyzing more taxa. This study describes a new approach for improving ancestral genome reconstruction, the ancestral coevolver (ACE), which utilizes coevolutionary information to improve the accuracy of such reconstructions over previous approaches. The principal idea is to reduce the potentially large solution space by choosing a single optimal (or near optimal) solution that is in accord with the coevolutionary relationships between protein families. Simulation experiments, both on artificial and real biological data, show that ACE yields a marked decrease in error rate compared with ML or MP. Applied to a large data set (95 organisms, 4873 protein families, and 10,000 coevolutionary relationships), some of the ancestral genomes reconstructed by ACE were remarkably different in their gene content from those reconstructed by ML or MP alone (more than 10% in some nodes). These reconstructions, while having almost similar likelihood/parsimony scores as those obtained with ML/MP, had markedly higher concordance with the coevolutionary information. Specifically, when ACE was implemented to improve the results of ML, it added a large number of proteins to those encoded by LUCA (last universal common ancestor), most of them ribosomal proteins and components of the F(0)F(1)-type ATP synthase/ATPases, complexes that are vital in most living organisms. Our analysis suggests that LUCA appears to have been bacterial-like and had a genome size similar to the genome sizes of many extant organisms.
Collapse
Affiliation(s)
- Tamir Tuller
- School of Computer Sciences, Tel Aviv University, Ramat Aviv, Israel.
| | | | | | | | | |
Collapse
|
46
|
Diallo AB, Makarenkov V, Blanchette M. Ancestors 1.0: a web server for ancestral sequence reconstruction. Bioinformatics 2009; 26:130-1. [PMID: 19850756 DOI: 10.1093/bioinformatics/btp600] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
SUMMARY The computational inference of ancestral genomes consists of five difficult steps: identifying syntenic regions, inferring ancestral arrangement of syntenic regions, aligning multiple sequences, reconstructing the insertion and deletion history and finally inferring substitutions. Each of these steps have received lot of attention in the past years. However, there currently exists no framework that integrates all of the different steps in an easy workflow. Here, we introduce Ancestors 1.0, a web server allowing one to easily and quickly perform the last three steps of the ancestral genome reconstruction procedure. It implements several alignment algorithms, an indel maximum likelihood solver and a context-dependent maximum likelihood substitution inference algorithm. The results presented by the server include the posterior probabilities for the last two steps of the ancestral genome reconstruction and the expected error rate of each ancestral base prediction. AVAILABILITY The Ancestors 1.0 is available at http://ancestors.bioinfo.uqam.ca/ancestorWeb/.
Collapse
Affiliation(s)
- Abdoulaye Banire Diallo
- Department of Computer Science, Université du Québec à Montréal, PO. Box 8888 Downtown Station, Montreal, QC H3C3P8, Canada.
| | | | | |
Collapse
|
47
|
Bradley RK, Holmes I. Evolutionary triplet models of structured RNA. PLoS Comput Biol 2009; 5:e1000483. [PMID: 19714212 PMCID: PMC2725318 DOI: 10.1371/journal.pcbi.1000483] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2008] [Accepted: 07/23/2009] [Indexed: 12/31/2022] Open
Abstract
The reconstruction and synthesis of ancestral RNAs is a feasible goal for paleogenetics. This will require new bioinformatics methods, including a robust statistical framework for reconstructing histories of substitutions, indels and structural changes. We describe a "transducer composition" algorithm for extending pairwise probabilistic models of RNA structural evolution to models of multiple sequences related by a phylogenetic tree. This algorithm draws on formal models of computational linguistics as well as the 1985 protosequence algorithm of David Sankoff. The output of the composition algorithm is a multiple-sequence stochastic context-free grammar. We describe dynamic programming algorithms, which are robust to null cycles and empty bifurcations, for parsing this grammar. Example applications include structural alignment of non-coding RNAs, propagation of structural information from an experimentally-characterized sequence to its homologs, and inference of the ancestral structure of a set of diverged RNAs. We implemented the above algorithms for a simple model of pairwise RNA structural evolution; in particular, the algorithms for maximum likelihood (ML) alignment of three known RNA structures and a known phylogeny and inference of the common ancestral structure. We compared this ML algorithm to a variety of related, but simpler, techniques, including ML alignment algorithms for simpler models that omitted various aspects of the full model and also a posterior-decoding alignment algorithm for one of the simpler models. In our tests, incorporation of basepair structure was the most important factor for accurate alignment inference; appropriate use of posterior-decoding was next; and fine details of the model were least important. Posterior-decoding heuristics can be substantially faster than exact phylogenetic inference, so this motivates the use of sum-over-pairs heuristics where possible (and approximate sum-over-pairs). For more exact probabilistic inference, we discuss the use of transducer composition for ML (or MCMC) inference on phylogenies, including possible ways to make the core operations tractable.
Collapse
Affiliation(s)
- Robert K. Bradley
- Biophysics Graduate Group, University of California, Berkeley, California, United States of America
| | - Ian Holmes
- Biophysics Graduate Group, University of California, Berkeley, California, United States of America
- Department of Bioengineering, University of California, Berkeley, California, United States of America
- * E-mail:
| |
Collapse
|
48
|
Wilson MA, Makova KD. Evolution and survival on eutherian sex chromosomes. PLoS Genet 2009; 5:e1000568. [PMID: 19609352 PMCID: PMC2704370 DOI: 10.1371/journal.pgen.1000568] [Citation(s) in RCA: 59] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2008] [Accepted: 06/18/2009] [Indexed: 11/19/2022] Open
Abstract
Since the two eutherian sex chromosomes diverged from an ancestral autosomal pair, the X has remained relatively gene-rich, while the Y has lost most of its genes through the accumulation of deleterious mutations in nonrecombining regions. Presently, it is unclear what is distinctive about genes that remain on the Y chromosome, when the sex chromosomes acquired their unique evolutionary rates, and whether X-Y gene divergence paralleled that of paralogs located on autosomes. To tackle these questions, here we juxtaposed the evolution of X and Y homologous genes (gametologs) in eutherian mammals with their autosomal orthologs in marsupial and monotreme mammals. We discovered that genes on the X and Y acquired distinct evolutionary rates immediately following the suppression of recombination between the two sex chromosomes. The Y-linked genes evolved at higher rates, while the X-linked genes maintained the lower evolutionary rates of the ancestral autosomal genes. These distinct rates have been maintained throughout the evolution of X and Y. Specifically, in humans, most X gametologs and, curiously, also most Y gametologs evolved under stronger purifying selection than similarly aged autosomal paralogs. Finally, after evaluating the current experimental data from the literature, we concluded that unique mRNA/protein expression patterns and functions acquired by Y (versus X) gametologs likely contributed to their retention. Our results also suggest that either the boundary between sex chromosome strata 3 and 4 should be shifted or that stratum 3 should be divided into two strata. Using recently available marsupial and monotreme genomes, we investigated nascent sex chromosome evolution in mammals. We show that, in eutherian mammals, X and Y genes acquired distinct evolutionary rates and functional constraints immediately after recombination suppression; X-linked genes maintained lower, ancestral (autosomal), rates, whereas the evolutionary rates of Y-linked genes increased. Most X and, unexpectedly, Y genes evolved under stronger purifying selection than similarly aged autosomal paralogs. However, we also observed that the divergence of gametologs and paralogs shared similar features. In addition, many Y-linked copies evolved unique functions and expression patterns compared to their counterparts on the X chromosome. Therefore, our results suggest that to be retained on the Y chromosome, genes need to acquire separately valuable expression and/or functions to be safeguarded by purifying selection.
Collapse
Affiliation(s)
- Melissa A. Wilson
- Department of Biology, Pennsylvania State University, University Park, Pennsylvania, United States of America
- Center for Comparative Genomics and Bioinformatics, Pennsylvania State University, University Park, Pennsylvania, United States of America
- The Integrative Biosciences Program, Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Kateryna D. Makova
- Department of Biology, Pennsylvania State University, University Park, Pennsylvania, United States of America
- Center for Comparative Genomics and Bioinformatics, Pennsylvania State University, University Park, Pennsylvania, United States of America
- The Integrative Biosciences Program, Pennsylvania State University, University Park, Pennsylvania, United States of America
- * E-mail:
| |
Collapse
|
49
|
Abstract
Many methods exist for reconstructing phylogenies from molecular sequence data, but few phylogenies are known and can be used to check their efficacy. Simulation remains the most important approach to testing the accuracy and robustness of phylogenetic inference methods. However, current simulation programs are limited, especially concerning realistic models for simulating insertions and deletions. We implement a portable and flexible application, named INDELible, for generating nucleotide, amino acid and codon sequence data by simulating insertions and deletions (indels) as well as substitutions. Indels are simulated under several models of indel-length distribution. The program implements a rich repertoire of substitution models, including the general unrestricted model and nonstationary nonhomogeneous models of nucleotide substitution, mixture, and partition models that account for heterogeneity among sites, and codon models that allow the nonsynonymous/synonymous substitution rate ratio to vary among sites and branches. With its many unique features, INDELible should be useful for evaluating the performance of many inference methods, including those for multiple sequence alignment, phylogenetic tree inference, and ancestral sequence, or genome reconstruction.
Collapse
Affiliation(s)
- William Fletcher
- Department of Genetics, Evolution and Environment and Centre for Mathematics and Physics in the Life Sciences and Experimental Biology, University College London, London, UK
| | | |
Collapse
|
50
|
Liberles DA. Reading the Story in DNA: A Beginner's Guide to Molecular Evolution. Syst Biol 2009. [DOI: 10.1093/sysbio/syp003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
|