1
|
Mayer C, Vogt A, Uslu T, Scalzitti N, Chennen K, Poch O, Thompson JD. CeGAL: Redefining a Widespread Fungal-Specific Transcription Factor Family Using an In Silico Error-Tracking Approach. J Fungi (Basel) 2023; 9:jof9040424. [PMID: 37108879 PMCID: PMC10141177 DOI: 10.3390/jof9040424] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Revised: 03/21/2023] [Accepted: 03/28/2023] [Indexed: 03/31/2023] Open
Abstract
In fungi, the most abundant transcription factor (TF) class contains a fungal-specific ‘GAL4-like’ Zn2C6 DNA binding domain (DBD), while the second class contains another fungal-specific domain, known as ‘fungal_trans’ or middle homology domain (MHD), whose function remains largely uncharacterized. Remarkably, almost a third of MHD-containing TFs in public sequence databases apparently lack DNA binding activity, since they are not predicted to contain a DBD. Here, we reassess the domain organization of these ‘MHD-only’ proteins using an in silico error-tracking approach. In a large-scale analysis of ~17,000 MHD-only TF sequences present in all fungal phyla except Microsporidia and Cryptomycota, we show that the vast majority (>90%) result from genome annotation errors and we are able to predict a new DBD sequence for 14,261 of them. Most of these sequences correspond to a Zn2C6 domain (82%), with a small proportion of C2H2 domains (4%) found only in Dikarya. Our results contradict previous findings that the MHD-only TF are widespread in fungi. In contrast, we show that they are exceptional cases, and that the fungal-specific Zn2C6–MHD domain pair represents the canonical domain signature defining the most predominant fungal TF family. We call this family CeGAL, after the highly characterized members: Cep3, whose 3D structure is determined, and GAL4, a eukaryotic TF archetype. We believe that this will not only improve the annotation and classification of the Zn2C6 TF but will also provide critical guidance for future fungal gene regulatory network analyses.
Collapse
Affiliation(s)
- Claudine Mayer
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000 Strasbourg, France
- Faculté des Sciences, Université Paris Cité, UFR Sciences du Vivant, 75013 Paris, France
- Correspondence: (C.M.); (J.D.T.)
| | - Arthur Vogt
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000 Strasbourg, France
| | - Tuba Uslu
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000 Strasbourg, France
| | - Nicolas Scalzitti
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000 Strasbourg, France
| | - Kirsley Chennen
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000 Strasbourg, France
| | - Olivier Poch
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000 Strasbourg, France
| | - Julie D. Thompson
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000 Strasbourg, France
- Correspondence: (C.M.); (J.D.T.)
| |
Collapse
|
2
|
Bricout R, Weil D, Stroebel D, Genovesio A, Roest Crollius H. Evolution is not Uniform Along Coding Sequences. Mol Biol Evol 2023; 40:7060063. [PMID: 36857092 PMCID: PMC10025431 DOI: 10.1093/molbev/msad042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2022] [Revised: 02/15/2023] [Accepted: 02/16/2023] [Indexed: 03/02/2023] Open
Abstract
Amino acids evolve at different speeds within protein sequences, because their functional and structural roles are different. Notably, amino acids located at the surface of proteins are known to evolve more rapidly than those in the core. In particular, amino acids at the N- and C-termini of protein sequences are likely to be more exposed than those at the core of the folded protein due to their location in the peptidic chain, and they are known to be less structured. Because of these reasons, we would expect that amino acids located at protein termini would evolve faster than residues located inside the chain. Here we test this hypothesis and found that amino acids evolve almost twice as fast at protein termini compared with those in the center, hinting at a strong topological bias along the sequence length. We further show that the distribution of solvent-accessible residues and functional domains in proteins readily explain how structural and functional constraints are weaker at their termini, leading to the observed excess of amino acid substitutions. Finally, we show that the specific evolutionary rates at protein termini may have direct consequences, notably misleading in silico methods used to infer sites under positive selection within genes. These results suggest that accounting for positional information should improve evolutionary models.
Collapse
Affiliation(s)
- Raphaël Bricout
- Département de biologie, École normale supérieure, Institut de Biologie de l'ENS (IBENS), CNRS, INSERM, Paris, France
| | - Dominique Weil
- Laboratoire de Biologie du Développement, Sorbonne Université, CNRS, Institut de Biologie Paris-Seine (IBPS), Paris, France
| | - David Stroebel
- Département de biologie, École normale supérieure, Institut de Biologie de l'ENS (IBENS), CNRS, INSERM, Paris, France
| | - Auguste Genovesio
- Département de biologie, École normale supérieure, Institut de Biologie de l'ENS (IBENS), CNRS, INSERM, Paris, France
| | - Hugues Roest Crollius
- Département de biologie, École normale supérieure, Institut de Biologie de l'ENS (IBENS), CNRS, INSERM, Paris, France
| |
Collapse
|
3
|
Khodji H, Collet P, Thompson JD, Jeannin-Girardon A. De-MISTED: Image-based classification of erroneous multiple sequence alignments using convolutional neural networks. APPL INTELL 2023. [DOI: 10.1007/s10489-022-04390-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/11/2023]
|
4
|
Premzl M. Revised eutherian gene collections. BMC Genom Data 2022; 23:56. [PMID: 35870891 PMCID: PMC9308196 DOI: 10.1186/s12863-022-01071-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2021] [Accepted: 07/13/2022] [Indexed: 11/24/2022] Open
Abstract
Objectives The most recent research projects in scientific field of eutherian comparative genomics included intentions to sequence every extant eutherian species genome in foreseeable future, so that future revisions and updates of eutherian gene data sets were expected. Data description Using 35 public eutherian reference genomic sequence assemblies and free available software, the eutherian comparative genomic analysis protocol RRID:SCR_014401 was published as guidance against potential genomic sequence errors. The protocol curated 14 eutherian third-party data gene data sets, including, in aggregate, 2615 complete coding sequences that were deposited in European Nucleotide Archive. The published eutherian gene collections were used in revisions and updates of eutherian gene data set classifications and nomenclatures that included gene annotations, phylogenetic analyses and protein molecular evolution analyses.
Collapse
|
5
|
Box ICH, Matthews BJ, Marshall KE. Molecular evidence of intertidal habitats selecting for repeated ice-binding protein evolution in invertebrates. J Exp Biol 2022; 225:274373. [PMID: 35258616 DOI: 10.1242/jeb.243409] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2021] [Accepted: 12/20/2021] [Indexed: 12/21/2022]
Abstract
Ice-binding proteins (IBPs) have evolved independently in multiple taxonomic groups to improve their survival at sub-zero temperatures. Intertidal invertebrates in temperate and polar regions frequently encounter sub-zero temperatures, yet there is little information on IBPs in these organisms. We hypothesized that there are far more IBPs than are currently known and that the occurrence of freezing in the intertidal zone selects for these proteins. We compiled a list of genome-sequenced invertebrates across multiple habitats and a list of known IBP sequences and used BLAST to identify a wide array of putative IBPs in those invertebrates. We found that the probability of an invertebrate species having an IBP was significantly greater in intertidal species than in those primarily found in open ocean or freshwater habitats. These intertidal IBPs had high sequence similarity to fish and tick antifreeze glycoproteins and fish type II antifreeze proteins. Previously established classifiers based on machine learning techniques further predicted ice-binding activity in the majority of our newly identified putative IBPs. We investigated the potential evolutionary origin of one putative IBP from the hard-shelled mussel Mytilus coruscus and suggest that it arose through gene duplication and neofunctionalization. We show that IBPs likely readily evolve in response to freezing risk and that there is an array of uncharacterized IBPs, and highlight the need for broader laboratory-based surveys of the diversity of ice-binding activity across diverse taxonomic and ecological groups.
Collapse
Affiliation(s)
- Isaiah C H Box
- Department of Zoology, University of British Columbia, 6270 University Blvd, Vancouver, BC, CanadaV6T 1Z4
| | - Benjamin J Matthews
- Department of Zoology, University of British Columbia, 6270 University Blvd, Vancouver, BC, CanadaV6T 1Z4
| | - Katie E Marshall
- Department of Zoology, University of British Columbia, 6270 University Blvd, Vancouver, BC, CanadaV6T 1Z4
| |
Collapse
|
6
|
Premzl M. Comparative genomic analysis of eutherian fibroblast growth factor genes. BMC Genomics 2020; 21:542. [PMID: 32758140 PMCID: PMC7430813 DOI: 10.1186/s12864-020-06958-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2019] [Accepted: 07/29/2020] [Indexed: 12/14/2022] Open
Abstract
Background The eutherian fibroblast growth factors were implicated as key regulators in developmental processes. However, there were major disagreements in descriptions of comprehensive eutherian fibroblast growth factors gene data sets including either 18 or 22 homologues. The present analysis attempted to revise and update comprehensive eutherian fibroblast growth factor gene data sets, and address and resolve major discrepancies in their descriptions using eutherian comparative genomic analysis protocol and 35 public eutherian reference genomic sequence data sets. Results Among 577 potential coding sequences, the tests of reliability of eutherian public genomic sequences annotated most comprehensive curated eutherian third-party data gene data set of fibroblast growth factor genes including 267 complete coding sequences. The present study first described 8 superclusters including 22 eutherian fibroblast growth factor major gene clusters, proposing their updated classification and nomenclature. Conclusions The integrated gene annotations, phylogenetic analysis and protein molecular evolution analysis argued that comprehensive eutherian fibroblast growth factor gene data set classifications included 22 rather than 18 homologues.
Collapse
|
7
|
Meyer C, Scalzitti N, Jeannin-Girardon A, Collet P, Poch O, Thompson JD. Understanding the causes of errors in eukaryotic protein-coding gene prediction: a case study of primate proteomes. BMC Bioinformatics 2020; 21:513. [PMID: 33172385 PMCID: PMC7656754 DOI: 10.1186/s12859-020-03855-1] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2020] [Accepted: 10/30/2020] [Indexed: 11/10/2022] Open
Abstract
Background Recent advances in sequencing technologies have led to an explosion in the number of genomes available, but accurate genome annotation remains a major challenge. The prediction of protein-coding genes in eukaryotic genomes is especially problematic, due to their complex exon–intron structures. Even the best eukaryotic gene prediction algorithms can make serious errors that will significantly affect subsequent analyses. Results We first investigated the prevalence of gene prediction errors in a large set of 176,478 proteins from ten primate proteomes available in public databases. Using the well-studied human proteins as a reference, a total of 82,305 potential errors were detected, including 44,001 deletions, 27,289 insertions and 11,015 mismatched segments where part of the correct protein sequence is replaced with an alternative erroneous sequence. We then focused on the mismatched sequence errors that cause particular problems for downstream applications. A detailed characterization allowed us to identify the potential causes for the gene misprediction in approximately half (5446) of these cases. As a proof-of-concept, we also developed a simple method which allowed us to propose improved sequences for 603 primate proteins. Conclusions Gene prediction errors in primate proteomes affect up to 50% of the sequences. Major causes of errors include undetermined genome regions, genome sequencing or assembly issues, and limitations in the models used to represent gene exon–intron structures. Nevertheless, existing genome sequences can still be exploited to improve protein sequence quality. Perspectives of the work include the characterization of other types of gene prediction errors, as well as the development of a more comprehensive algorithm for protein sequence error correction.
Collapse
Affiliation(s)
- Corentin Meyer
- Department of Computer Science, ICube, CNRS, University of Strasbourg, Strasbourg, France
| | - Nicolas Scalzitti
- Department of Computer Science, ICube, CNRS, University of Strasbourg, Strasbourg, France
| | - Anne Jeannin-Girardon
- Department of Computer Science, ICube, CNRS, University of Strasbourg, Strasbourg, France
| | - Pierre Collet
- Department of Computer Science, ICube, CNRS, University of Strasbourg, Strasbourg, France
| | - Olivier Poch
- Department of Computer Science, ICube, CNRS, University of Strasbourg, Strasbourg, France
| | - Julie D Thompson
- Department of Computer Science, ICube, CNRS, University of Strasbourg, Strasbourg, France.
| |
Collapse
|
8
|
Premzl M. Comparative genomic analysis of eutherian interferon genes. Genomics 2020; 112:4749-4759. [DOI: 10.1016/j.ygeno.2020.08.029] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2020] [Revised: 08/18/2020] [Accepted: 08/25/2020] [Indexed: 01/23/2023]
|
9
|
Prosdocimi F, Zamudio GS, Palacios-Pérez M, Torres de Farias S, V. José M. The Ancient History of Peptidyl Transferase Center Formation as Told by Conservation and Information Analyses. Life (Basel) 2020; 10:life10080134. [PMID: 32764248 PMCID: PMC7459865 DOI: 10.3390/life10080134] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2020] [Revised: 07/24/2020] [Accepted: 07/31/2020] [Indexed: 12/19/2022] Open
Abstract
The peptidyl transferase center (PTC) is the catalytic center of the ribosome and forms part of the 23S ribosomal RNA. The PTC has been recognized as the earliest ribosomal part and its origins embodied the First Universal Common Ancestor (FUCA). The PTC is frequently assumed to be highly conserved along all living beings. In this work, we posed the following questions: (i) How many 100% conserved bases can be found in the PTC? (ii) Is it possible to identify clusters of informationally linked nucleotides along its sequence? (iii) Can we propose how the PTC was formed? (iv) How does sequence conservation reflect on the secondary and tertiary structures of the PTC? Aiming to answer these questions, all available complete sequences of 23S ribosomal RNA from Bacteria and Archaea deposited on GenBank database were downloaded. Using a sequence bait of 179 bp from the PTC of Thermus termophilus, we performed an optimum pairwise alignment to retrieve the PTC region from 1424 filtered 23S rRNA sequences. These PTC sequences were multiply aligned, and the conserved regions were assigned and observed along the primary, secondary, and tertiary structures. The PTC structure was observed to be more highly conserved close to the adenine located at the catalytical site. Clusters of interrelated, co-evolving nucleotides reinforce previous assumptions that the PTC was formed by the concatenation of proto-tRNAs and important residues responsible for its assembly were identified. The observed sequence variation does not seem to significantly affect the 3D structure of the PTC ribozyme.
Collapse
Affiliation(s)
- Francisco Prosdocimi
- Laboratório de Biologia Teórica e de Sistemas, Instituto de Bioquímica Médica Leopoldo de Meis, Universidade Federal do Rio de Janeiro, Rio de Janeiro 21.941-902, Brazil
- Theoretical Biology Group, Instituto de Investigaciones Biomédicas, Universidad Nacional Autónoma de México, Ciudad Universitaria, CDMX 04510, Mexico; (G.S.Z.); (M.P.-P.)
- Correspondence: (F.P.); (M.V.J.)
| | - Gabriel S. Zamudio
- Theoretical Biology Group, Instituto de Investigaciones Biomédicas, Universidad Nacional Autónoma de México, Ciudad Universitaria, CDMX 04510, Mexico; (G.S.Z.); (M.P.-P.)
| | - Miryam Palacios-Pérez
- Theoretical Biology Group, Instituto de Investigaciones Biomédicas, Universidad Nacional Autónoma de México, Ciudad Universitaria, CDMX 04510, Mexico; (G.S.Z.); (M.P.-P.)
| | - Sávio Torres de Farias
- Laboratório de Genética Evolutiva Paulo Leminsk, Departamento de Biologia Molecular, Universidade Federal da Paraíba, João Pessoa, Paraíba 58051-900, Brazil;
| | - Marco V. José
- Theoretical Biology Group, Instituto de Investigaciones Biomédicas, Universidad Nacional Autónoma de México, Ciudad Universitaria, CDMX 04510, Mexico; (G.S.Z.); (M.P.-P.)
- Correspondence: (F.P.); (M.V.J.)
| |
Collapse
|
10
|
Ocampo Daza D, Haitina T. Reconstruction of the Carbohydrate 6-O Sulfotransferase Gene Family Evolution in Vertebrates Reveals Novel Member, CHST16, Lost in Amniotes. Genome Biol Evol 2020; 12:993-1012. [PMID: 32652010 PMCID: PMC7353957 DOI: 10.1093/gbe/evz274] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/10/2019] [Indexed: 12/24/2022] Open
Abstract
Glycosaminoglycans are sulfated polysaccharide molecules, essential for many biological processes. The 6-O sulfation of glycosaminoglycans is carried out by carbohydrate 6-O sulfotransferases (C6OSTs), previously named Gal/GalNAc/GlcNAc 6-O sulfotransferases. Here, for the first time, we present a detailed phylogenetic reconstruction, analysis of gene synteny conservation and propose an evolutionary scenario for the C6OST family in major vertebrate groups, including mammals, birds, nonavian reptiles, amphibians, lobe-finned fishes, ray-finned fishes, cartilaginous fishes, and jawless vertebrates. The C6OST gene expansion likely started early in the chordate lineage, giving rise to four ancestral genes after the divergence of tunicates and before the emergence of extant vertebrates. The two rounds of whole-genome duplication in early vertebrate evolution (1R/2R) only contributed two additional C6OST subtype genes, increasing the vertebrate repertoire from four genes to six, divided into two branches. The first branch includes CHST1 and CHST3 as well as a previously unrecognized subtype, CHST16 that was lost in amniotes. The second branch includes CHST2, CHST7, and CHST5. Subsequently, local duplications of CHST5 gave rise to CHST4 in the ancestor of tetrapods, and to CHST6 in the ancestor of primates. The teleost-specific gene duplicates were identified for CHST1, CHST2, and CHST3 and are result of whole-genome duplication (3R) in the teleost lineage. We could also detect multiple, more recent lineage-specific duplicates. Thus, the vertebrate repertoire of C6OST genes has been shaped by gene duplications and gene losses at several stages of vertebrate evolution, with implications for the evolution of skeleton, nervous system, and cell-cell interactions.
Collapse
Affiliation(s)
- Daniel Ocampo Daza
- Department of Organismal Biology, Uppsala University, Sweden
- School of Natural Sciences, University of California Merced
| | - Tatjana Haitina
- Department of Organismal Biology, Uppsala University, Sweden
| |
Collapse
|
11
|
Mittal P, Jaiswal SK, Vijay N, Saxena R, Sharma VK. Comparative analysis of corrected tiger genome provides clues to its neuronal evolution. Sci Rep 2019; 9:18459. [PMID: 31804567 PMCID: PMC6895189 DOI: 10.1038/s41598-019-54838-z] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2019] [Accepted: 11/14/2019] [Indexed: 01/01/2023] Open
Abstract
The availability of completed and draft genome assemblies of tiger, leopard, and other felids provides an opportunity to gain comparative insights on their unique evolutionary adaptations. However, genome-wide comparative analyses are susceptible to errors in genome sequences and thus require accurate genome assemblies for reliable evolutionary insights. In this study, while analyzing the tiger genome, we found almost one million erroneous substitutions in the coding and non-coding region of the genome affecting 4,472 genes, hence, biasing the current understanding of tiger evolution. Moreover, these errors produced several misleading observations in previous studies. Thus, to gain insights into the tiger evolution, we corrected the erroneous bases in the genome assembly and gene set of tiger using ‘SeqBug’ approach developed in this study. We sequenced the first Bengal tiger genome and transcriptome from India to validate these corrections. A comprehensive evolutionary analysis was performed using 10,920 orthologs from nine mammalian species including the corrected gene sets of tiger and leopard and using five different methods at three hierarchical levels, i.e. felids, Panthera, and tiger. The unique genetic changes in tiger revealed that the genes showing signatures of adaptation in tiger were enriched in development and neuronal functioning. Specifically, the genes belonging to the Notch signalling pathway, which is among the most conserved pathways involved in embryonic and neuronal development, were found to have significantly diverged in tiger in comparison to the other mammals. Our findings suggest the role of adaptive evolution in neuronal functions and development processes, which correlates well with the presence of exceptional traits such as sensory perception, strong neuro-muscular coordination, and hypercarnivorous behaviour in tiger.
Collapse
Affiliation(s)
- Parul Mittal
- Metaomics and Systems Biology Group, Department of Biological Sciences, Indian Institute of Science Education and Research Bhopal, Bhopal, India
| | - Shubham K Jaiswal
- Metaomics and Systems Biology Group, Department of Biological Sciences, Indian Institute of Science Education and Research Bhopal, Bhopal, India
| | - Nagarjun Vijay
- Computational Evolutionary Genomics Lab, Department of Biological Sciences, Indian Institute of Science Education and Research Bhopal, Bhopal, India
| | - Rituja Saxena
- Metaomics and Systems Biology Group, Department of Biological Sciences, Indian Institute of Science Education and Research Bhopal, Bhopal, India
| | - Vineet K Sharma
- Metaomics and Systems Biology Group, Department of Biological Sciences, Indian Institute of Science Education and Research Bhopal, Bhopal, India.
| |
Collapse
|
12
|
Wilbrandt J, Misof B, Panfilio KA, Niehuis O. Repertoire-wide gene structure analyses: a case study comparing automatically predicted and manually annotated gene models. BMC Genomics 2019; 20:753. [PMID: 31623555 PMCID: PMC6798390 DOI: 10.1186/s12864-019-6064-8] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2018] [Accepted: 08/27/2019] [Indexed: 02/06/2023] Open
Abstract
Background The location and modular structure of eukaryotic protein-coding genes in genomic sequences can be automatically predicted by gene annotation algorithms. These predictions are often used for comparative studies on gene structure, gene repertoires, and genome evolution. However, automatic annotation algorithms do not yet correctly identify all genes within a genome, and manual annotation is often necessary to obtain accurate gene models and gene sets. As manual annotation is time-consuming, only a fraction of the gene models in a genome is typically manually annotated, and this fraction often differs between species. To assess the impact of manual annotation efforts on genome-wide analyses of gene structural properties, we compared the structural properties of protein-coding genes in seven diverse insect species sequenced by the i5k initiative. Results Our results show that the subset of genes chosen for manual annotation by a research community (3.5–7% of gene models) may have structural properties (e.g., lengths and exon counts) that are not necessarily representative for a species’ gene set as a whole. Nonetheless, the structural properties of automatically generated gene models are only altered marginally (if at all) through manual annotation. Major correlative trends, for example a negative correlation between genome size and exonic proportion, can be inferred from either the automatically predicted or manually annotated gene models alike. Vice versa, some previously reported trends did not appear in either the automatic or manually annotated gene sets, pointing towards insect-specific gene structural peculiarities. Conclusions In our analysis of gene structural properties, automatically predicted gene models proved to be sufficiently reliable to recover the same gene-repertoire-wide correlative trends that we found when focusing on manually annotated gene models only. We acknowledge that analyses on the individual gene level clearly benefit from manual curation. However, as genome sequencing and annotation projects often differ in the extent of their manual annotation and curation efforts, our results indicate that comparative studies analyzing gene structural properties in these genomes can nonetheless be justifiable and informative. Electronic supplementary material The online version of this article (10.1186/s12864-019-6064-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jeanne Wilbrandt
- Center for molecular Biodiversity Research, Zoological Research Museum Alexander Koenig (ZFMK), Adenauerallee 160, 53113, Bonn, Germany. .,Present address: Hoffmann Research Group, Leibniz Institute on Aging - Fritz Lipmann Institute, Beutenbergstraße 11, 07745, Jena, Germany.
| | - Bernhard Misof
- Center for molecular Biodiversity Research, Zoological Research Museum Alexander Koenig (ZFMK), Adenauerallee 160, 53113, Bonn, Germany
| | - Kristen A Panfilio
- School of Life Sciences, University of Warwick, Gibbet Hill Campus, Coventry, CV4 7AL, UK
| | - Oliver Niehuis
- Evolutionary Biology and Ecology, Institute of Biology I (Zoology), Albert Ludwig University, Hauptstr. 1, 79104, Freiburg, Germany
| |
Collapse
|
13
|
|
14
|
Abstract
Genome assemblies from next-generation sequencing technologies are now an integral part of biological research, but many sequencing and assembly processes are still error-prone. Unfortunately, these errors can propagate to downstream analyses and wreak havoc on results and conclusions. Although such errors are recognized when dealing with diploid genotype data, modern reference assemblies (which are represented as haploid sequences) lack any type of succinct quality assessment for every position. Here we present Referee, a program that uses diploid genotype quality information in order to annotate a haploid assembly with a quality score for every position. Referee aims to provide an assembly with concise quality information on a Phred-like scale in FASTQ format for easy filtering of low-quality sites. Referee also provides output of quality scores in BED format that can be easily visualized as tracks on most genome browsers. Referee is freely available at https://gwct.github.io/referee/.
Collapse
Affiliation(s)
- Gregg W C Thomas
- Department of Biology, Indiana University, Bloomington.,Department of Computer Science, Indiana University, Bloomington
| | - Matthew W Hahn
- Department of Biology, Indiana University, Bloomington.,Department of Computer Science, Indiana University, Bloomington
| |
Collapse
|
15
|
Ocampo Daza D, Larhammar D. Evolution of the growth hormone, prolactin, prolactin 2 and somatolactin family. Gen Comp Endocrinol 2018; 264:94-112. [PMID: 29339183 DOI: 10.1016/j.ygcen.2018.01.007] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/08/2017] [Revised: 12/29/2017] [Accepted: 01/11/2018] [Indexed: 12/30/2022]
Abstract
Growth hormone (GH), prolactin (PRL), prolactin 2 (PRL2) and somatolactin (SL) belong to the same hormone family and have a wide repertoire of effects including development, osmoregulation, metabolism and stimulation of growth. Both the hormone and the receptor family have been proposed to have expanded by gene duplications in early vertebrate evolution. A key question is how hormone-receptor preferences have arisen among the duplicates. The first step to address this is to determine the time window for these duplications. Specifically, we aimed to see if duplications resulted from the two basal vertebrate tetraploidizations (1R and 2R). GH family genes from a broad range of vertebrate genomes were investigated using a combination of sequence-based phylogenetic analyses and comparisons of synteny. We conclude that the PRL and PRL2 genes arose from a common ancestor in 1R/2R, as shown by neighboring gene families. No other gene duplicates were preserved from these tetraploidization events. The ancestral genes that would give rise to GH and PRL/PRL2 arose from an earlier duplication; most likely a local gene duplication as they are syntenic in several species. Likewise, some evidence suggests that SL arose from a local duplication of an ancestral GH/SL gene in the same time window, explaining the lack of similarity in chromosomal neighbors to GH, PRL or PRL2. Thus, the basic triplet of ancestral GH, PRL/PRL2 and SL genes appear to be unexpectedly ancient. Following 1R/2R, only SL was duplicated in the teleost-specific tetraploidization 3R, resulting in SLa and SLb. These time windows contrast with our recent report that the corresponding receptor genes GHR and PRLR arose through a local duplication in jawed vertebrates and that both receptor genes duplicated further in 3R, which reveals a surprising asynchrony in hormone and receptor gene duplications.
Collapse
Affiliation(s)
- Daniel Ocampo Daza
- Department of Neuroscience, Science for Life Laboratory, Uppsala University, Box 593, SE-75124 Uppsala, Sweden.
| | - Dan Larhammar
- Department of Neuroscience, Science for Life Laboratory, Uppsala University, Box 593, SE-75124 Uppsala, Sweden
| |
Collapse
|
16
|
Premzl M. Comparative genomic analysis of eutherian adiponectin genes. Heliyon 2018; 4:e00647. [PMID: 30003153 PMCID: PMC6040601 DOI: 10.1016/j.heliyon.2018.e00647] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2018] [Revised: 05/16/2018] [Accepted: 06/01/2018] [Indexed: 12/28/2022] Open
Abstract
The present study proposed updated and standardized classification and nomenclature of eutherian adiponectin genes implicated in regulation of systemic metabolism and inflammation and activation of classical complement pathway. The revisions of comprehensive adiponectin gene data sets used eutherian comparative genomic analysis protocol and public reference genomic sequence assemblies. Among 438 potential coding sequences, the tests of reliability of eutherian public genomic sequences annotated most comprehensive curated third-party data gene data set of eutherian adiponectin genes that included 211 complete coding sequences. There were 18 major gene clusters of eutherian adiponectin genes described, one of which included evidence of differential gene expansions. For example, the present analysis initially described human ADIF2 and ADIR genes. Finally, the tests of protein molecular evolution using relative synonymous codon usage statistics confirmed protein primary structure similarities between eutherian adiponectins and tumor necrosis factor ligands.
Collapse
|
17
|
Ocampo Daza D, Larhammar D. Evolution of the receptors for growth hormone, prolactin, erythropoietin and thrombopoietin in relation to the vertebrate tetraploidizations. Gen Comp Endocrinol 2018; 257:143-160. [PMID: 28652136 DOI: 10.1016/j.ygcen.2017.06.021] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/29/2017] [Revised: 06/16/2017] [Accepted: 06/22/2017] [Indexed: 12/19/2022]
Abstract
The receptors for the pituitary hormones growth hormone (GH), prolactin (PRL) and somatolactin (SL), and the hematopoietic hormones erythropoietin (EPO) and thrombopoietin (TPO), comprise a structurally related family in the superfamily of cytokine class-I receptors. GH, PRL and SL receptors have a wide variety of effects in development, osmoregulation, metabolism and stimulation of growth, while EPO and TPO receptors guide the production and differentiation of erythrocytes and thrombocytes, respectively. The evolution of the receptors for GH, PRL and SL has been partially investigated by previous reports suggesting different time points for the hormone and receptor gene duplications. This raises questions about how hormone-receptor partnerships have emerged and evolved. Therefore, we have investigated in detail the expansion of this receptor family, especially in relation to the basal vertebrate (1R, 2R) and teleost (3R) tetraploidizations. Receptor family genes were identified in a broad range of vertebrate genomes and investigated using a combination of sequence-based phylogenetic analyses and comparative genomic analyses of synteny. We found that 1R most likely generated EPOR/TPOR and GHR/PRLR ancestors; following this, 2R resulted in EPOR and TPOR genes. No GHR/PRLR duplicate seems to have survived after 2R. Instead the single GHR/PRLR underwent a local duplication sometime after 2R, generating separate syntenic genes for GHR and PRLR. Subsequently, 3R duplicated the gene pair in teleosts, resulting in two GHR and two PRLR genes, but no EPOR or TPOR duplicates. These analyses help illuminate the evolution of the regulatory mechanisms for somatic growth, metabolism, osmoregulation and hematopoiesis in vertebrates.
Collapse
Affiliation(s)
- Daniel Ocampo Daza
- Department of Neuroscience, Science for Life Laboratory, Uppsala University, Box 593, SE-75124 Uppsala, Sweden.
| | - Dan Larhammar
- Department of Neuroscience, Science for Life Laboratory, Uppsala University, Box 593, SE-75124 Uppsala, Sweden
| |
Collapse
|
18
|
Gradnigo JS, Majumdar A, Norgren RB, Moriyama EN. Advantages of an Improved Rhesus Macaque Genome for Evolutionary Analyses. PLoS One 2016; 11:e0167376. [PMID: 27911958 PMCID: PMC5135103 DOI: 10.1371/journal.pone.0167376] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2015] [Accepted: 11/14/2016] [Indexed: 01/12/2023] Open
Abstract
The rhesus macaque (Macaca mulatta) is widely used in molecular evolutionary analyses, particularly to identify genes under adaptive or unique evolution in the human lineage. For such studies, it is necessary to align nucleotide sequences of homologous protein-coding genes among multiple species. The validity of these analyses is dependent on high quality genomic data. However, for most mammalian species (other than humans and mice), only draft genomes are available. There has been concern that some results obtained from evolutionary analyses using draft genomes may not be correct. The rhesus macaque provides a unique opportunity to determine whether an improved genome (MacaM) yields better results than a draft genome (rheMac2) for evolutionary studies. We compared protein-coding genes annotated in the rheMac2 and MacaM genomes with their human orthologs. We found many genes annotated in rheMac2 had apparently spurious sequences not present in genes derived from MacaM. The rheMac2 annotations also appeared to inflate a frequently used evolutionary index, ω (the ratio of nonsynonymous to synonymous substitution rates). Genes with these spurious sequences must be filtered out from evolutionary analyses to obtain correct results. With the MacaM genome, improved sequence information means many more genes can be examined for indications of selection. These results indicate how upgrading genomes from draft status to a higher level of quality can improve interpretation of evolutionary patterns.
Collapse
Affiliation(s)
- Julien S. Gradnigo
- School of Biological Sciences, University of Nebraska-Lincoln, Lincoln, Nebraska, United States of America
| | - Abhishek Majumdar
- Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, Nebraska, United States of America
| | - Robert B. Norgren
- Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, Nebraska, United States of America
| | - Etsuko N. Moriyama
- School of Biological Sciences and Center for Plant Science Innovation, University of Nebraska-Lincoln, Lincoln, Nebraska, United States of America
- * E-mail:
| |
Collapse
|
19
|
Richard J, Kim ED, Nguyen H, Kim CD, Kim S. Allostery Wiring Map for Kinesin Energy Transduction and Its Evolution. J Biol Chem 2016; 291:20932-20945. [PMID: 27507814 PMCID: PMC5076506 DOI: 10.1074/jbc.m116.733675] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2016] [Indexed: 12/28/2022] Open
Abstract
How signals between the kinesin active and cytoskeletal binding sites are transmitted is an open question and an allosteric question. By extracting correlated evolutionary changes within 700+ sequences, we built a model of residues that are energetically coupled and that define molecular routes for signal transmission. Typically, these coupled residues are located at multiple distal sites and thus are predicted to form a complex, non-linear network that wires together different functional sites in the protein. Of note, our model connected the site for ATP hydrolysis with sites that ultimately utilize its free energy, such as the microtubule-binding site, drug-binding loop 5, and necklinker. To confirm the calculated energetic connectivity between non-adjacent residues, double-mutant cycle analysis was conducted with 22 kinesin mutants. There was a direct correlation between thermodynamic coupling in experiment and evolutionarily derived energetic coupling. We conclude that energy transduction is coordinated by multiple distal sites in the protein rather than only being relayed through adjacent residues. Moreover, this allosteric map forecasts how energetic orchestration gives rise to different nanomotor behaviors within the superfamily.
Collapse
Affiliation(s)
- Jessica Richard
- From the Department of Biochemistry and Molecular Biology, Louisiana State University School of Medicine & Health Sciences Center, New Orleans, Louisiana 70112
| | - Elizabeth D Kim
- From the Department of Biochemistry and Molecular Biology, Louisiana State University School of Medicine & Health Sciences Center, New Orleans, Louisiana 70112
| | - Hoang Nguyen
- From the Department of Biochemistry and Molecular Biology, Louisiana State University School of Medicine & Health Sciences Center, New Orleans, Louisiana 70112
| | - Catherine D Kim
- From the Department of Biochemistry and Molecular Biology, Louisiana State University School of Medicine & Health Sciences Center, New Orleans, Louisiana 70112
| | - Sunyoung Kim
- From the Department of Biochemistry and Molecular Biology, Louisiana State University School of Medicine & Health Sciences Center, New Orleans, Louisiana 70112
| |
Collapse
|
20
|
Vanhoutreve R, Kress A, Legrand B, Gass H, Poch O, Thompson JD. LEON-BIS: multiple alignment evaluation of sequence neighbours using a Bayesian inference system. BMC Bioinformatics 2016; 17:271. [PMID: 27387560 PMCID: PMC4936259 DOI: 10.1186/s12859-016-1146-y] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2016] [Accepted: 07/01/2016] [Indexed: 11/13/2022] Open
Abstract
Background A standard procedure in many areas of bioinformatics is to use a multiple sequence alignment (MSA) as the basis for various types of homology-based inference. Applications include 3D structure modelling, protein functional annotation, prediction of molecular interactions, etc. These applications, however sophisticated, are generally highly sensitive to the alignment used, and neglecting non-homologous or uncertain regions in the alignment can lead to significant bias in the subsequent inferences. Results Here, we present a new method, LEON-BIS, which uses a robust Bayesian framework to estimate the homologous relations between sequences in a protein multiple alignment. Sequences are clustered into sub-families and relations are predicted at different levels, including ‘core blocks’, ‘regions’ and full-length proteins. The accuracy and reliability of the predictions are demonstrated in large-scale comparisons using well annotated alignment databases, where the homologous sequence segments are detected with very high sensitivity and specificity. Conclusions LEON-BIS uses robust Bayesian statistics to distinguish the portions of multiple sequence alignments that are conserved either across the whole family or within subfamilies. LEON-BIS should thus be useful for automatic, high-throughput genome annotations, 2D/3D structure predictions, protein-protein interaction predictions etc.
Collapse
Affiliation(s)
- Renaud Vanhoutreve
- Department of Computer Science, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de médecine translationnelle de Strasbourg, Strasbourg, France
| | - Arnaud Kress
- Department of Computer Science, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de médecine translationnelle de Strasbourg, Strasbourg, France
| | - Baptiste Legrand
- Department of Computer Science, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de médecine translationnelle de Strasbourg, Strasbourg, France
| | - Hélène Gass
- Department of Computer Science, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de médecine translationnelle de Strasbourg, Strasbourg, France
| | - Olivier Poch
- Department of Computer Science, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de médecine translationnelle de Strasbourg, Strasbourg, France
| | - Julie D Thompson
- Department of Computer Science, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de médecine translationnelle de Strasbourg, Strasbourg, France.
| |
Collapse
|
21
|
Lämke J, Brzezinka K, Bäurle I. HSFA2 orchestrates transcriptional dynamics after heat stress in Arabidopsis thaliana. Transcription 2016; 7:111-4. [PMID: 27383578 DOI: 10.1080/21541264.2016.1187550] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022] Open
Abstract
In nature, stress is typically chronic or recurring and stress exposure can prime modified responses to recurring stress. Such stress priming may occur at the level of transcription. Here, we discuss the connection between plant stress memory, transcription, and chromatin modifications using the example of recurring heat stress.
Collapse
Affiliation(s)
- Jörn Lämke
- a Institute for Biochemistry and Biology, University of Potsdam , Potsdam , Germany
| | - Krzysztof Brzezinka
- a Institute for Biochemistry and Biology, University of Potsdam , Potsdam , Germany
| | - Isabel Bäurle
- a Institute for Biochemistry and Biology, University of Potsdam , Potsdam , Germany
| |
Collapse
|
22
|
Bianchetti L, Tarabay Y, Lecompte O, Stote R, Poch O, Dejaegere A, Viville S. Tex19 and Sectm1 concordant molecular phylogenies support co-evolution of both eutherian-specific genes. BMC Evol Biol 2015; 15:222. [PMID: 26459560 PMCID: PMC4603632 DOI: 10.1186/s12862-015-0506-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2015] [Accepted: 10/01/2015] [Indexed: 01/04/2023] Open
Abstract
BACKGROUND Transposable elements (TE) have attracted much attention since they shape the genome and contribute to species evolution. Organisms have evolved mechanisms to control TE activity. Testis expressed 19 (Tex19) represses TE expression in mouse testis and placenta. In the human and mouse genomes, Tex19 and Secreted and transmembrane 1 (Sectm1) are neighbors but are not homologs. Sectm1 is involved in immunity and its molecular phylogeny is unknown. METHODS Using multiple alignments of complete protein sequences (MACS), we inferred Tex19 and Sectm1 molecular phylogenies. Protein conserved regions were identified and folds were predicted. Finally, expression patterns were studied across tissues and species using RNA-seq public data and RT-PCR. RESULTS We present 2 high quality alignments of 58 Tex19 and 58 Sectm1 protein sequences from 48 organisms. First, both genes are eutherian-specific, i.e., exclusively present in mammals except monotremes (platypus) and marsupials. Second, Tex19 and Sectm1 have both duplicated in Sciurognathi and Bovidae while they have remained as single copy genes in all further placental mammals. Phylogenetic concordance between both genes was significant (p-value < 0.05) and supported co-evolution and functional relationship. At the protein level, Tex19 exhibits 3 conserved regions and 4 invariant cysteines. In particular, a CXXC motif is present in the N-terminal conserved region. Sectm1 exhibits 2 invariant cysteines and an Ig-like domain. Strikingly, Tex19 C-terminal conserved region was lost in Haplorrhini primates while a Sectm1 C-terminal extra domain was acquired. Finally, we have determined that Tex19 and Sectm1 expression levels anti-correlate across the testis of several primates (ρ = -0.72) which supports anti-regulation. CONCLUSIONS Tex19 and Sectm1 co-evolution and anti-regulated expressions support a strong functional relationship between both genes. Since Tex19 operates a control on TE and Sectm1 plays a role in immunity, Tex19 might suppress an immune response directed against cells that show TE activity in eutherian reproductive tissues.
Collapse
Affiliation(s)
- Laurent Bianchetti
- Biocomputing and Molecular Modelling Laboratory, Integrated Structural Biology Department, Genetics institute of Molecular and Cellular Biology (IGBMC), INSERM U964/CNRS UMR 1704/Strasbourg University, 1 rue Laurent Fries, 67404, Illkirch, France.
| | - Yara Tarabay
- Primordial Germ Cells' Ontogeny and Pluripotency Laboratory, Functional Genomics and Cancer Department, Genetics Institute of Molecular and Cellular Biology (IGBMC), INSERM U964/CNRS UMR 1704/Université de Strasbourg, 1 rue Laurent Fries, 67404, Illkirch, France. .,Present address: Institut de génétique humaine (IGH), 141 rue de la Cardonille, 34396, Montpellier, France.
| | - Odile Lecompte
- Bioinformatics and Integrated Genomics Laboratory (LBGI), ICube, CNRS UMR 7357/Université de Strasbourg, 11 rue Humann, 67085, Strasbourg, France.
| | - Roland Stote
- Biocomputing and Molecular Modelling Laboratory, Integrated Structural Biology Department, Genetics institute of Molecular and Cellular Biology (IGBMC), INSERM U964/CNRS UMR 1704/Strasbourg University, 1 rue Laurent Fries, 67404, Illkirch, France.
| | - Olivier Poch
- Bioinformatics and Integrated Genomics Laboratory (LBGI), ICube, CNRS UMR 7357/Université de Strasbourg, 11 rue Humann, 67085, Strasbourg, France.
| | - Annick Dejaegere
- Biocomputing and Molecular Modelling Laboratory, Integrated Structural Biology Department, Genetics institute of Molecular and Cellular Biology (IGBMC), INSERM U964/CNRS UMR 1704/Strasbourg University, 1 rue Laurent Fries, 67404, Illkirch, France.
| | - Stéphane Viville
- Primordial Germ Cells' Ontogeny and Pluripotency Laboratory, Functional Genomics and Cancer Department, Genetics Institute of Molecular and Cellular Biology (IGBMC), INSERM U964/CNRS UMR 1704/Université de Strasbourg, 1 rue Laurent Fries, 67404, Illkirch, France. .,Centre Hospitalier Universitaire, 67000, Strasbourg, France.
| |
Collapse
|
23
|
Khenoussi W, Vanhoutrève R, Poch O, Thompson JD. SIBIS: a Bayesian model for inconsistent protein sequence estimation. Bioinformatics 2014; 30:2432-9. [PMID: 24825613 DOI: 10.1093/bioinformatics/btu329] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The prediction of protein coding genes is a major challenge that depends on the quality of genome sequencing, the accuracy of the model used to elucidate the exonic structure of the genes and the complexity of the gene splicing process leading to different protein variants. As a consequence, today's protein databases contain a huge amount of inconsistency, due to both natural variants and sequence prediction errors. RESULTS We have developed a new method, called SIBIS, to detect such inconsistencies based on the evolutionary information in multiple sequence alignments. A Bayesian framework, combined with Dirichlet mixture models, is used to estimate the probability of observing specific amino acids and to detect inconsistent or erroneous sequence segments. We evaluated the performance of SIBIS on a reference set of protein sequences with experimentally validated errors and showed that the sensitivity is significantly higher than previous methods, with only a small loss of specificity. We also assessed a large set of human sequences from the UniProt database and found evidence of inconsistency in 48% of the previously uncharacterized sequences. We conclude that the integration of quality control methods like SIBIS in automatic analysis pipelines will be critical for the robust inference of structural, functional and phylogenetic information from these sequences. AVAILABILITY AND IMPLEMENTATION Source code, implemented in C on a linux system, and the datasets of protein sequences are freely available for download at http://www.lbgi.fr/∼julie/SIBIS.
Collapse
Affiliation(s)
- Walyd Khenoussi
- Department of Computer Science, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de médecine translationnelle, Strasbourg, F-67085, France
| | - Renaud Vanhoutrève
- Department of Computer Science, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de médecine translationnelle, Strasbourg, F-67085, France
| | - Olivier Poch
- Department of Computer Science, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de médecine translationnelle, Strasbourg, F-67085, France
| | - Julie D Thompson
- Department of Computer Science, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de médecine translationnelle, Strasbourg, F-67085, France
| |
Collapse
|
24
|
Nagy A, Patthy L. FixPred: a resource for correction of erroneous protein sequences. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2014; 2014:bau032. [PMID: 24705206 PMCID: PMC3975993 DOI: 10.1093/database/bau032] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Protein databases are heavily contaminated with erroneous (mispredicted, abnormal and incomplete) sequences and these erroneous data significantly distort the conclusions drawn from genome-scale protein sequence analyses. In our earlier work we described the MisPred resource that serves to identify erroneous sequences; here we present the FixPred computational pipeline that automatically corrects sequences identified by MisPred as erroneous. The current version of the associated FixPred database contains corrected UniProtKB/Swiss-Prot and NCBI/RefSeq sequences from Homo sapiens, Mus musculus, Rattus norvegicus, Monodelphis domestica, Gallus gallus, Xenopus tropicalis, Danio rerio, Fugu rubripes, Ciona intestinalis, Branchostoma floridae, Drosophila melanogaster and Caenorhabditis elegans; future releases of the FixPred database will include corrected sequences of additional Metazoan species. The FixPred computational pipeline and database (http://www.fixpred.com) are easily accessible through a simple web interface coupled to a powerful query engine and a standard web service. The content is completely or partially downloadable in a variety of formats. Database URL:http://www.fixpred.com
Collapse
Affiliation(s)
| | - László Patthy
- *Corresponding author: Tel: +361 279 3100; Fax: +361 466 5465;
| |
Collapse
|
25
|
Nagy A, Patthy L. MisPred: a resource for identification of erroneous protein sequences in public databases. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2013; 2013:bat053. [PMID: 23864220 PMCID: PMC3713709 DOI: 10.1093/database/bat053] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Correct prediction of the structure of protein-coding genes of higher eukaryotes is still a difficult task; therefore, public databases are heavily contaminated with mispredicted sequences. The high rate of misprediction has serious consequences because it significantly affects the conclusions that may be drawn from genome-scale sequence analyses of eukaryotic genomes. Here we present the MisPred database and computational pipeline that provide efficient means for the identification of erroneous sequences in public databases. The MisPred database contains a collection of abnormal, incomplete and mispredicted protein sequences from 19 metazoan species identified as erroneous by MisPred quality control tools in the UniProtKB/Swiss-Prot, UniProtKB/TrEMBL, NCBI/RefSeq and EnsEMBL databases. Major releases of the database are automatically generated and updated regularly. The database (http://www.mispred.com) is easily accessible through a simple web interface coupled to a powerful query engine and a standard web service. The content is completely or partially downloadable in a variety of formats. DATABASE URL: http://www.mispred.com.
Collapse
Affiliation(s)
- Alinda Nagy
- Institute of Enzymology, Research Centre for Natural Sciences, Hungarian Academy of Sciences, H-1113 Budapest, Hungary
| | | |
Collapse
|
26
|
Budd A, Devos DP. Evaluating the Evolutionary Origins of Unexpected Character Distributions within the Bacterial Planctomycetes-Verrucomicrobia-Chlamydiae Superphylum. Front Microbiol 2012; 3:401. [PMID: 23189077 PMCID: PMC3505017 DOI: 10.3389/fmicb.2012.00401] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2012] [Accepted: 10/31/2012] [Indexed: 12/26/2022] Open
Abstract
Recently, several characters that are absent from most bacteria, but which are found in many eukaryotes or archaea, have been identified within the bacterial Planctomycetes-Verrucomicrobia-Chlamydiae (PVC) superphylum. Hypotheses of the evolutionary history of such characters are commonly based on the inference of phylogenies of gene or protein families associated with the traits, estimated from multiple sequence alignments (MSAs). So far, studies of this kind have focused on the distribution of (i) two genes involved in the synthesis of sterol, (ii) tubulin genes, and (iii) c1 transfer genes. In many cases, these analyses have concluded that horizontal gene transfer (HGT) is likely to have played a role in shaping the taxonomic distribution of these gene families. In this article, we describe several issues with the inference of HGT from such analyses, in particular concerning the considerable uncertainty associated with our estimation of both gene family phylogenies (especially those containing ancient lineage divergences) and the Tree of Life (ToL), and the need for wider use and further development of explicit probabilistic models to compare hypotheses of vertical and horizontal genetic transmission. We suggest that data which is often taken as evidence for the occurrence of ancient HGT events may not be as convincing as is commonly described, and consideration of alternative theories is recommended. While focusing on analyses including PVCs, this discussion is also relevant for inferences of HGT involving other groups of organisms.
Collapse
Affiliation(s)
- A. Budd
- European Molecular Biology LaboratoryHeidelberg, Germany
| | - D. P. Devos
- European Molecular Biology LaboratoryHeidelberg, Germany
| |
Collapse
|
27
|
Guo B, Zou M, Wagner A. Pervasive indels and their evolutionary dynamics after the fish-specific genome duplication. Mol Biol Evol 2012; 29:3005-22. [PMID: 22490820 DOI: 10.1093/molbev/mss108] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023] Open
Abstract
Insertions and deletions (indels) in protein-coding genes are important sources of genetic variation. Their role in creating new proteins may be especially important after gene duplication. However, little is known about how indels affect the divergence of duplicate genes. We here study thousands of duplicate genes in five fish (teleost) species with completely sequenced genomes. The ancestor of these species has been subject to a fish-specific genome duplication (FSGD) event that occurred approximately 350 Ma. We find that duplicate genes contain at least 25% more indels than single-copy genes. These indels accumulated preferentially in the first 40 my after the FSGD. A lack of widespread asymmetric indel accumulation indicates that both members of a duplicate gene pair typically experience relaxed selection. Strikingly, we observe a 30-80% excess of deletions over insertions that is consistent for indels of various lengths and across the five genomes. We also find that indels preferentially accumulate inside loop regions of protein secondary structure and in regions where amino acids are exposed to solvent. We show that duplicate genes with high indel density also show high DNA sequence divergence. Indel density, but not amino acid divergence, can explain a large proportion of the tertiary structure divergence between proteins encoded by duplicate genes. Our observations are consistent across all five fish species. Taken together, they suggest a general pattern of duplicate gene evolution in which indels are important driving forces of evolutionary change.
Collapse
Affiliation(s)
- Baocheng Guo
- Institute of Evolutionary Biology and Environmental Studies, University of Zurich, Zurich, Switzerland
| | | | | |
Collapse
|