1
|
Rossier V, Warwick Vesztrocy A, Robinson-Rechavi M, Dessimoz C. OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches. Bioinformatics 2021; 37:2866-2873. [PMID: 33787851 PMCID: PMC8479680 DOI: 10.1093/bioinformatics/btab219] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2020] [Revised: 02/18/2021] [Accepted: 03/30/2021] [Indexed: 02/02/2023] Open
Abstract
MOTIVATION Assigning new sequences to known protein families and subfamilies is a prerequisite for many functional, comparative and evolutionary genomics analyses. Such assignment is commonly achieved by looking for the closest sequence in a reference database, using a method such as BLAST. However, ignoring the gene phylogeny can be misleading because a query sequence does not necessarily belong to the same subfamily as its closest sequence. For example, a hemoglobin which branched out prior to the hemoglobin alpha/beta duplication could be closest to a hemoglobin alpha or beta sequence, whereas it is neither. To overcome this problem, phylogeny-driven tools have emerged but rely on gene trees, whose inference is computationally expensive. RESULTS Here, we first show that in multiple animal and plant datasets, 18-62% of assignments by closest sequence are misassigned, typically to an over-specific subfamily. Then, we introduce OMAmer, a novel alignment-free protein subfamily assignment method, which limits over-specific subfamily assignments and is suited to phylogenomic databases with thousands of genomes. OMAmer is based on an innovative method using evolutionarily informed k-mers for alignment-free mapping to ancestral protein subfamilies. Whilst able to reject non-homologous family-level assignments, we show that OMAmer provides better and quicker subfamily-level assignments than approaches relying on the closest sequence, whether inferred exactly by Smith-Waterman or by the fast heuristic DIAMOND. AVAILABILITYAND IMPLEMENTATION OMAmer is available from the Python Package Index (as omamer), with the source code and a precomputed database available at https://github.com/DessimozLab/omamer. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Victor Rossier
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland,Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Alex Warwick Vesztrocy
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland,Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Marc Robinson-Rechavi
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland,Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland,To whom correspondence should be addressed. or
| | - Christophe Dessimoz
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland,Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland,Department of Genetics, Evolution, and Environment, University College London, London, WC1E 6BT, UK,Department of Computer Science, University College London, London, WC1E 6BT, UK,To whom correspondence should be addressed. or
| |
Collapse
|
2
|
Deutekom ES, Snel B, van Dam TJP. Benchmarking orthology methods using phylogenetic patterns defined at the base of Eukaryotes. Brief Bioinform 2020; 22:5906198. [PMID: 32935832 PMCID: PMC8138875 DOI: 10.1093/bib/bbaa206] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2020] [Revised: 08/10/2020] [Accepted: 08/11/2020] [Indexed: 12/26/2022] Open
Abstract
Insights into the evolution of ancestral complexes and pathways are generally achieved through careful and time-intensive manual analysis often using phylogenetic profiles of the constituent proteins. This manual analysis limits the possibility of including more protein-complex components, repeating the analyses for updated genome sets or expanding the analyses to larger scales. Automated orthology inference should allow such large-scale analyses, but substantial differences between orthologous groups generated by different approaches are observed. We evaluate orthology methods for their ability to recapitulate a number of observations that have been made with regard to genome evolution in eukaryotes. Specifically, we investigate phylogenetic profile similarity (co-occurrence of complexes), the last eukaryotic common ancestor’s gene content, pervasiveness of gene loss and the overlap with manually determined orthologous groups. Moreover, we compare the inferred orthologies to each other. We find that most orthology methods reconstruct a large last eukaryotic common ancestor, with substantial gene loss, and can predict interacting proteins reasonably well when applying phylogenetic co-occurrence. At the same time, derived orthologous groups show imperfect overlap with manually curated orthologous groups. There is no strong indication of which orthology method performs better than another on individual or all of these aspects. Counterintuitively, despite the orthology methods behaving similarly regarding large-scale evaluation, the obtained orthologous groups differ vastly from one another. Availability and implementation The data and code underlying this article are available in github and/or upon reasonable request to the corresponding author: https://github.com/ESDeutekom/ComparingOrthologies.
Collapse
Affiliation(s)
| | - Berend Snel
- Corresponding author: Berend Snel, Padualaan 8, 358CH Utrecht, The Netherlands. Tel.: +31(0)30 253 8102; E-mail:
| | | |
Collapse
|
3
|
Dencker T, Leimeister CA, Gerth M, Bleidorn C, Snir S, Morgenstern B. 'Multi-SpaM': a maximum-likelihood approach to phylogeny reconstruction using multiple spaced-word matches and quartet trees. NAR Genom Bioinform 2020; 2:lqz013. [PMID: 33575565 PMCID: PMC7671388 DOI: 10.1093/nargab/lqz013] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2019] [Revised: 07/31/2019] [Accepted: 10/13/2019] [Indexed: 02/03/2023] Open
Abstract
Word-based or 'alignment-free' methods for phylogeny inference have become popular in recent years. These methods are much faster than traditional, alignment-based approaches, but they are generally less accurate. Most alignment-free methods calculate 'pairwise' distances between nucleic-acid or protein sequences; these distance values can then be used as input for tree-reconstruction programs such as neighbor-joining. In this paper, we propose the first word-based phylogeny approach that is based on 'multiple' sequence comparison and 'maximum likelihood'. Our algorithm first samples small, gap-free alignments involving four taxa each. For each of these alignments, it then calculates a quartet tree and, finally, the program 'Quartet MaxCut' is used to infer a super tree for the full set of input taxa from the calculated quartet trees. Experimental results show that trees produced with our approach are of high quality.
Collapse
Affiliation(s)
- Thomas Dencker
- Department of Bioinformatics, Institute of Microbiology and Genetics, Universität Göttingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
| | - Chris-André Leimeister
- Department of Bioinformatics, Institute of Microbiology and Genetics, Universität Göttingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
| | - Michael Gerth
- Institute for Integrative Biology, University of Liverpool, Biosciences Building, Crown Street, L69 7ZB Liverpool, UK
| | - Christoph Bleidorn
- Department of Animal Evolution and Biodiversity, Universität Göttingen, Untere Karspüle 2, 37073 Göttingen, Germany
- Museo Nacional de Ciencias Naturales, Spanish National Research Council (CSIC), 28006 Madrid, Spain
| | - Sagi Snir
- Institute of Evolution, Department of Evolutionary and Environmental Biology, University of Haifa, 199 Aba Khoushy Ave. Mount Carmel, Haifa, Israel
| | - Burkhard Morgenstern
- Department of Bioinformatics, Institute of Microbiology and Genetics, Universität Göttingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
- Göttingen Center of Molecular Biosciences (GZMB), Justus-von-Liebig-Weg 11, 37077 Göttingen, Germany
| |
Collapse
|
4
|
Piližota I, Train CM, Altenhoff A, Redestig H, Dessimoz C. Phylogenetic approaches to identifying fragments of the same gene, with application to the wheat genome. Bioinformatics 2019; 35:1159-1166. [PMID: 30184069 PMCID: PMC6449756 DOI: 10.1093/bioinformatics/bty772] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2017] [Revised: 07/30/2018] [Accepted: 08/31/2018] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION As the time and cost of sequencing decrease, the number of available genomes and transcriptomes rapidly increases. Yet the quality of the assemblies and the gene annotations varies considerably and often remains poor, affecting downstream analyses. This is particularly true when fragments of the same gene are annotated as distinct genes, which may cause them to be mistaken as paralogs. RESULTS In this study, we introduce two novel phylogenetic tests to infer non-overlapping or partially overlapping genes that are in fact parts of the same gene. One approach collapses branches with low bootstrap support and the other computes a likelihood ratio test. We extensively validated these methods by (i) introducing and recovering fragmentation on the bread wheat, Triticum aestivum cv. Chinese Spring, chromosome 3B; (ii) by applying the methods to the low-quality 3B assembly and validating predictions against the high-quality 3B assembly; and (iii) by comparing the performance of the proposed methods to the performance of existing methods, namely Ensembl Compara and ESPRIT. Application of this combination to a draft shotgun assembly of the entire bread wheat genome revealed 1221 pairs of genes that are highly likely to be fragments of the same gene. Our approach demonstrates the power of fine-grained evolutionary inferences across multiple species to improving genome assemblies and annotations. AVAILABILITY AND IMPLEMENTATION An open source software tool is available at https://github.com/DessimozLab/esprit2. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ivana Piližota
- Department of Genetics Evolution & Environment, University College London, UK.,Department of Computer Science, University College London, UK
| | - Clément-Marie Train
- Department of Computational Biology, Lausanne, Switzerland.,Center for Integrative Genomics University of Lausanne, Lausanne, Switzerland.,Swiss Institute of Bioinformatics, Biophore Building, Lausanne, Switzerland
| | - Adrian Altenhoff
- Swiss Institute of Bioinformatics, Biophore Building, Lausanne, Switzerland
| | | | - Christophe Dessimoz
- Department of Genetics Evolution & Environment, University College London, UK.,Department of Computer Science, University College London, UK.,Department of Computational Biology, Lausanne, Switzerland.,Center for Integrative Genomics University of Lausanne, Lausanne, Switzerland.,Swiss Institute of Bioinformatics, Biophore Building, Lausanne, Switzerland
| |
Collapse
|
5
|
Abstract
The distinction between orthologs and paralogs, genes that started diverging by speciation versus duplication, is relevant in a wide range of contexts, most notably phylogenetic tree inference and protein function annotation. In this chapter, we provide an overview of the methods used to infer orthology and paralogy. We survey both graph-based approaches (and their various grouping strategies) and tree-based approaches, which solve the more general problem of gene/species tree reconciliation. We discuss conceptual differences among the various orthology inference methods and databases and examine the difficult issue of verifying and benchmarking orthology predictions. Finally, we review typical applications of orthologous genes, groups, and reconciled trees and conclude with thoughts on future methodological developments.
Collapse
|
6
|
Pouchon C, Fernández A, Nassar JM, Boyer F, Aubert S, Lavergne S, Mavárez J. Phylogenomic Analysis of the Explosive Adaptive Radiation of the Espeletia Complex (Asteraceae) in the Tropical Andes. Syst Biol 2018; 67:1041-1060. [PMID: 30339252 DOI: 10.1093/sysbio/syy022] [Citation(s) in RCA: 50] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2017] [Accepted: 03/15/2018] [Indexed: 01/17/2023] Open
Abstract
The subtribe Espeletiinae (Asteraceae), endemic to the high-elevations in the Northern Andes, exhibits an exceptional diversity of species, growth-forms, and reproductive strategies. This complex of 140 species includes large trees, dichotomous trees, shrubs and the extraordinary giant caulescent rosettes, considered as a classic example of adaptation in tropical high-elevation ecosystems. The subtribe has also long been recognized as a prominent case of adaptive radiation, but the understanding of its evolution has been hampered by a lack of phylogenetic resolution. Herein, we produce the first fully resolved phylogeny of all morphological groups of Espeletiinae, using whole plastomes and about a million nuclear nucleotides obtained with an original de novo assembly procedure without reference genome, and analyzed with traditional and coalescent-based approaches that consider the possible impact of incomplete lineage sorting and hybridization on phylogenetic inference. We show that the diversification of Espeletiinae started from a rosette ancestor about 2.3 Ma, after the final uplift of the Northern Andes. This was followed by two independent radiations in the Colombian and Venezuelan Andes, with a few trans-cordilleran dispersal events among low-elevation tree lineages but none among high-elevation rosettes. We demonstrate complex scenarios of morphological change in Espeletiinae, usually implying the convergent evolution of growth-forms with frequent loss/gains of various traits. For instance, caulescent rosettes evolved independently in both countries, likely as convergent adaptations to life in tropical high-elevation habitats. Tree growth-forms evolved independently three times from the repeated colonization of lower elevations by high-elevation rosette ancestors. The rate of morphological diversification increased during the early phase of the radiation, after which it decreased steadily towards the present. On the other hand, the rate of species diversification in the best-sampled Venezuelan radiation was on average very high (3.1 spp/My), with significant rate variation among growth-forms (much higher in polycarpic caulescent rosettes). Our results point out a scenario where both adaptive morphological evolution and geographical isolation due to Pleistocene climatic oscillations triggered an exceptionally rapid radiation for a continental plant group.
Collapse
Affiliation(s)
- Charles Pouchon
- Laboratoire d'Ecologie Alpine, UMR 5553, Université Grenoble Alpes-CNRS, Grenoble, France
| | - Angel Fernández
- Herbario IVIC, Centro de Biofísica y Bioquímica, Instituto Venezolano de Investigaciones Científicas, Apartado 20632, Caracas 1020-A, Venezuela
| | - Jafet M Nassar
- Laboratorio de Biología de Organismos, Centro de Ecología, Instituto Venezolano de Investigaciones Científicas, Apartado 20632, Caracas 1020-A, Venezuela
| | - Frédéric Boyer
- Laboratoire d'Ecologie Alpine, UMR 5553, Université Grenoble Alpes-CNRS, Grenoble, France
| | - Serge Aubert
- Laboratoire d'Ecologie Alpine, UMR 5553, Université Grenoble Alpes-CNRS, Grenoble, France.,Station alpine Joseph-Fourier, UMS 3370, Université Grenoble Alpes-CNRS, Grenoble, France
| | - Sébastien Lavergne
- Laboratoire d'Ecologie Alpine, UMR 5553, Université Grenoble Alpes-CNRS, Grenoble, France
| | - Jesús Mavárez
- Laboratoire d'Ecologie Alpine, UMR 5553, Université Grenoble Alpes-CNRS, Grenoble, France
| |
Collapse
|
7
|
Smith SA, Pease JB. Heterogeneous molecular processes among the causes of how sequence similarity scores can fail to recapitulate phylogeny. Brief Bioinform 2017; 18:451-457. [PMID: 27103098 PMCID: PMC5429007 DOI: 10.1093/bib/bbw034] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2016] [Indexed: 11/24/2022] Open
Abstract
Sequence similarity tools like Basic Local Alignment Search Tool (BLAST) are essential components of many functional genetic, genomic, phylogenetic and bioinformatic studies. Many modern analysis pipelines use significant sequence similarity scores (p- or E-values) and the ranked order of BLAST matches to test a wide range of hypotheses concerning homology, orthology, the timing of de novo gene birth/death and gene family expansion/contraction. Despite significant contrary findings, many of these tests still implicitly assume that stronger or higher-ranked E-value scores imply closer phylogenetic relationships between sequences. Here, we demonstrate that even though a general relationship does exist between the phylogenetic distance of two sequences and their E-value, significant and misleading errors occur in both the completeness and the order of results under realistic evolutionary scenarios. These results provide additional details to past evidence showing that studies should avoid drawing direct inferences of evolutionary relatedness from measures of sequence similarity alone, and should instead, where possible, use more rigorous phylogeny-based methods.
Collapse
Affiliation(s)
- Stephen A Smith
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, Michigan, USA
- Corresponding author: Stephen A. Smith, Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, Michigan 48109, USA. E-mail:
| | - James B Pease
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, Michigan, USA
| |
Collapse
|
8
|
Jahangiri-Tazehkand S, Wong L, Eslahchi C. OrthoGNC: A Software for Accurate Identification of Orthologs Based on Gene Neighborhood Conservation. GENOMICS PROTEOMICS & BIOINFORMATICS 2017; 15:361-370. [PMID: 29133277 PMCID: PMC5828658 DOI: 10.1016/j.gpb.2017.07.002] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/22/2017] [Revised: 07/17/2017] [Accepted: 07/28/2017] [Indexed: 11/17/2022]
Abstract
Orthology relations can be used to transfer annotations from one gene (or protein) to another. Hence, detecting orthology relations has become an important task in the post-genomic era. Various genomic events, such as duplication and horizontal gene transfer, can cause erroneous assignment of orthology relations. In closely-related species, gene neighborhood information can be used to resolve many ambiguities in orthology inference. Here we present OrthoGNC, a software for accurately predicting pairwise orthology relations based on gene neighborhood conservation. Analyses on simulated and real data reveal the high accuracy of OrthoGNC. In addition to orthology detection, OrthoGNC can be employed to investigate the conservation of genomic context among potential orthologs detected by other methods. OrthoGNC is freely available online at http://bs.ipm.ir/softwares/orthognc and http://tinyurl.com/orthoGNC.
Collapse
Affiliation(s)
| | - Limsoon Wong
- School of Computing, National University of Singapore, Singapore 117417, Singapore
| | - Changiz Eslahchi
- Department of Computer Science, Shahid Beheshti University, Tehran 1983969411, Iran.
| |
Collapse
|
9
|
SMORE: Synteny Modulator of Repetitive Elements. Life (Basel) 2017; 7:life7040042. [PMID: 29088079 PMCID: PMC5745555 DOI: 10.3390/life7040042] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2017] [Revised: 10/27/2017] [Accepted: 10/28/2017] [Indexed: 12/19/2022] Open
Abstract
Several families of multicopy genes, such as transfer ribonucleic acids (tRNAs) and ribosomal RNAs (rRNAs), are subject to concerted evolution, an effect that keeps sequences of paralogous genes effectively identical. Under these circumstances, it is impossible to distinguish orthologs from paralogs on the basis of sequence similarity alone. Synteny, the preservation of relative genomic locations, however, also remains informative for the disambiguation of evolutionary relationships in this situation. In this contribution, we describe an automatic pipeline for the evolutionary analysis of such cases that use genome-wide alignments as a starting point to assign orthology relationships determined by synteny. The evolution of tRNAs in primates as well as the history of the Y RNA family in vertebrates and nematodes are used to showcase the method. The pipeline is freely available.
Collapse
|
10
|
Rane RV, Oakeshott JG, Nguyen T, Hoffmann AA, Lee SF. Orthonome - a new pipeline for predicting high quality orthologue gene sets applicable to complete and draft genomes. BMC Genomics 2017; 18:673. [PMID: 28859620 PMCID: PMC5580312 DOI: 10.1186/s12864-017-4079-6] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2017] [Accepted: 08/21/2017] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND Distinguishing orthologous and paralogous relationships between genes across multiple species is essential for comparative genomic analyses. Various computational approaches have been developed to resolve these evolutionary relationships, but strong trade-offs between precision and recall of orthologue prediction remains an ongoing challenge. RESULTS Here we present Orthonome, an orthologue prediction pipeline, designed to reduce the trade-off between orthologue capture rates (recall) and accuracy of multi-species orthologue prediction. The pipeline compares sequence domains and then forms sequence-similar clusters before using phylogenetic comparisons to identify inparalogues. It then corrects sequence similarity metrics for fragment and gene length bias using a novel scoring metric capturing relationships between full length as well as fragmented genes. The remaining genes are then brought together for the identification of orthologues within a phylogenetic framework. The orthologue predictions are further calibrated along with inparalogues and gene births, using synteny, to identify novel orthologous relationships. We use 12 high quality Drosophila genomes to show that, compared to other orthologue prediction pipelines, Orthonome provides orthogroups with minimal error but high recall. Furthermore, Orthonome is resilient to suboptimal assembly/annotation quality, with the inclusion of draft genomes from eight additional Drosophila species still providing >6500 1:1 orthologues across all twenty species while retaining a better combination of accuracy and recall than other pipelines. Orthonome is implemented as a searchable database and query tool along with multiple-sequence alignment browsers for all sets of orthologues. The underlying documentation and database are accessible at http://www.orthonome.com . CONCLUSION We demonstrate that Orthonome provides a superior combination of orthologue capture rates and accuracy on complete and draft drosophilid genomes when tested alongside previously published pipelines. The study also highlights a greater degree of evolutionary conservation across drosophilid species than earlier thought.
Collapse
Affiliation(s)
- Rahul V Rane
- Bio21 Institute, School of Biosciences, The University of Melbourne, Melbourne, Victoria, Australia. .,CSIRO, Canberra, Australian Capital Territory, Australia.
| | | | - Thu Nguyen
- Bio21 Institute, School of Biosciences, The University of Melbourne, Melbourne, Victoria, Australia
| | - Ary A Hoffmann
- Bio21 Institute, School of Biosciences, The University of Melbourne, Melbourne, Victoria, Australia
| | - Siu F Lee
- CSIRO, Canberra, Australian Capital Territory, Australia.,Department of Biological Sciences, Macquarie University, Sydney, New South Wales, Australia
| |
Collapse
|
11
|
Velandia-Huerto CA, Berkemer SJ, Hoffmann A, Retzlaff N, Romero Marroquín LC, Hernández-Rosales M, Stadler PF, Bermúdez-Santana CI. Orthologs, turn-over, and remolding of tRNAs in primates and fruit flies. BMC Genomics 2016; 17:617. [PMID: 27515907 PMCID: PMC4981973 DOI: 10.1186/s12864-016-2927-4] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2016] [Accepted: 07/11/2016] [Indexed: 12/26/2022] Open
Abstract
Background Transfer RNAs (tRNAs) are ubiquitous in all living organism. They implement the genetic code so that most genomes contain distinct tRNAs for almost all 61 codons. They behave similar to mobile elements and proliferate in genomes spawning both local and non-local copies. Most tRNA families are therefore typically present as multicopy genes. The members of the individual tRNA families evolve under concerted or rapid birth-death evolution, so that paralogous copies maintain almost identical sequences over long evolutionary time-scales. To a good approximation these are functionally equivalent. Individual tRNA copies thus are evolutionary unstable and easily turn into pseudogenes and disappear. This leads to a rapid turnover of tRNAs and often large differences in the tRNA complements of closely related species. Since tRNA paralogs are not distinguished by sequence, common methods cannot not be used to establish orthology between tRNA genes. Results In this contribution we introduce a general framework to distinguish orthologs and paralogs in gene families that are subject to concerted evolution. It is based on the use of uniquely aligned adjacent sequence elements as anchors to establish syntenic conservation of sequence intervals. In practice, anchors and intervals can be extracted from genome-wide multiple sequence alignments. Syntenic clusters of concertedly evolving genes of different families can then be subdivided by list alignments, leading to usually small clusters of candidate co-orthologs. On the basis of recent advances in phylogenetic combinatorics, these candidate clusters can be further processed by cograph editing to recover their duplication histories. We developed a workflow that can be conceptualized as stepwise refinement of a graph of homologous genes. We apply this analysis strategy with different types of synteny anchors to investigate the evolution of tRNAs in primates and fruit flies. We identified a large number of tRNA remolding events concentrated at the tips of the phylogeny. With one notable exception all phylogenetically old tRNA remoldings do not change the isoacceptor class. Conclusions Gene families evolving under concerted evolution are not amenable to classical phylogenetic analyses since paralogs maintain identical, species-specific sequences, precluding the estimation of correct gene trees from sequence differences. This leaves conservation of syntenic arrangements with respect to “anchor elements” that are not subject to concerted evolution as the only viable source of phylogenetic information. We have demonstrated here that a purely synteny-based analysis of tRNA gene histories is indeed feasible. Although the choice of synteny anchors influences the resolution in particular when tight gene clusters are present, and the quality of sequence alignments, genome assemblies, and genome rearrangements limits the scope of the analysis, largely coherent results can be obtained for tRNAs. In particular, we conclude that a large fraction of the tRNAs are recent copies. This proliferation is compensated by rapid pseudogenization as exemplified by many very recent alloacceptor remoldings. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-2927-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Cristian A Velandia-Huerto
- Biology Department, Universidad Nacional de Colombia, Carrera 45 # 26-85, Edif. Uriel Gutiérrez, Bogotá, D.C, Colombia
| | - Sarah J Berkemer
- Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, Leipzig, D-04103, Germany.,Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16-18D-04107, Leipzig, Germany
| | - Anne Hoffmann
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16-18D-04107, Leipzig, Germany
| | - Nancy Retzlaff
- Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, Leipzig, D-04103, Germany.,Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16-18D-04107, Leipzig, Germany
| | - Liliana C Romero Marroquín
- Biology Department, Universidad Nacional de Colombia, Carrera 45 # 26-85, Edif. Uriel Gutiérrez, Bogotá, D.C, Colombia
| | - Maribel Hernández-Rosales
- CONACYT - Instituto de Matemáticas, UNAM Juriquilla, Av. Juriquilla #3001, Santiago de Querétaro, MX-76230, QRO, México
| | - Peter F Stadler
- Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, Leipzig, D-04103, Germany. .,Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16-18D-04107, Leipzig, Germany. .,Fraunhofer Institut for Cell Therapy and Immunology, Perlickstraße 1, Leipzig, D-04103, Germany. .,Department of Theoretical Chemistry, University of Vienna, Währinger Straße 17, Vienna, A-1090, Austria. .,Center for non-coding RNA in Technology and Health, Grønegårdsvej 3, Frederiksberg C, DK-1870, Denmark. .,Santa Fe Institute, 1399 Hyde Park Rd., Santa Fe, NM87501, USA.
| | - Clara I Bermúdez-Santana
- Biology Department, Universidad Nacional de Colombia, Carrera 45 # 26-85, Edif. Uriel Gutiérrez, Bogotá, D.C, Colombia
| |
Collapse
|
12
|
Comparing the Statistical Fate of Paralogous and Orthologous Sequences. Genetics 2016; 204:475-482. [PMID: 27474728 DOI: 10.1534/genetics.116.193912] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2016] [Accepted: 07/26/2016] [Indexed: 02/01/2023] Open
Abstract
For several decades, sequence alignment has been a widely used tool in bioinformatics. For instance, finding homologous sequences with a known function in large databases is used to get insight into the function of nonannotated genomic regions. Very efficient tools like BLAST have been developed to identify and rank possible homologous sequences. To estimate the significance of the homology, the ranking of alignment scores takes a background model for random sequences into account. Using this model we can estimate the probability to find two exactly matching subsequences by chance in two unrelated sequences. For two homologous sequences, the corresponding probability is much higher, which allows us to identify them. Here we focus on the distribution of lengths of exact sequence matches between protein-coding regions of pairs of evolutionarily distant genomes. We show that this distribution exhibits a power-law tail with an exponent [Formula: see text] Developing a simple model of sequence evolution by substitutions and segmental duplications, we show analytically and computationally that paralogous and orthologous gene pairs contribute differently to this distribution. Our model explains the differences observed in the comparison of coding and noncoding parts of genomes, thus providing a better understanding of statistical properties of genomic sequences and their evolution.
Collapse
|
13
|
Ješovnik A, González VL, Schultz TR. Phylogenomics and Divergence Dating of Fungus-Farming Ants (Hymenoptera: Formicidae) of the Genera Sericomyrmex and Apterostigma. PLoS One 2016; 11:e0151059. [PMID: 27466804 PMCID: PMC4965065 DOI: 10.1371/journal.pone.0151059] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2015] [Accepted: 02/22/2016] [Indexed: 01/27/2023] Open
Abstract
Fungus-farming ("attine") ants are model systems for studies of symbiosis, coevolution, and advanced eusociality. A New World clade of nearly 300 species in 15 genera, all attine ants cultivate fungal symbionts for food. In order to better understand the evolution of ant agriculture, we sequenced, assembled, and analyzed transcriptomes of four different attine ant species in two genera: three species in the higher-attine genus Sericomyrmex and a single lower-attine ant species, Apterostigma megacephala, representing the first genomic data for either genus. These data were combined with published genomes of nine other ant species and the honey bee Apis mellifera for phylogenomic and divergence-dating analyses. The resulting phylogeny confirms relationships inferred in previous studies of fungus-farming ants. Divergence-dating analyses recovered slightly older dates than most prior analyses, estimating that attine ants originated 53.6–66.7 million of years ago, and recovered a very long branch subtending a very recent, rapid radiation of the genus Sericomyrmex. This result is further confirmed by a separate analysis of the three Sericomyrmex species, which reveals that 92.71% of orthologs have 99% - 100% pairwise-identical nucleotide sequences. We searched the transcriptomes for genes of interest, most importantly argininosuccinate synthase and argininosuccinate lyase, which are functional in other ants but which are known to have been lost in seven previously studied attine ant species. Loss of the ability to produce the amino acid arginine has been hypothesized to contribute to the obligate dependence of attine ants upon their cultivated fungi, but the point in fungus-farming ant evolution at which these losses occurred has remained unknown. We did not find these genes in any of the sequenced transcriptomes. Although expected for Sericomyrmex species, the absence of arginine anabolic genes in the lower-attine ant Apterostigma megacephala strongly suggests that the loss coincided with the origin of attine ants.
Collapse
Affiliation(s)
- Ana Ješovnik
- Entomology Department, National Museum of Natural History, Smithsonian Institution, Washington, District of Columbia, United States of America
- Maryland Center for Systematic Entomology, Department of Entomology, University of Maryland, College Park, Maryland, United States of America
- * E-mail: (TRS); (AJ)
| | - Vanessa L. González
- Global Genome Initiative, National Museum of Natural History, Smithsonian Institution, Washington, District of Columbia, United States of America
| | - Ted R. Schultz
- Entomology Department, National Museum of Natural History, Smithsonian Institution, Washington, District of Columbia, United States of America
- * E-mail: (TRS); (AJ)
| |
Collapse
|
14
|
Glover NM, Redestig H, Dessimoz C. Homoeologs: What Are They and How Do We Infer Them? TRENDS IN PLANT SCIENCE 2016; 21:609-621. [PMID: 27021699 PMCID: PMC4920642 DOI: 10.1016/j.tplants.2016.02.005] [Citation(s) in RCA: 102] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/22/2015] [Revised: 02/09/2016] [Accepted: 02/20/2016] [Indexed: 05/18/2023]
Abstract
The evolutionary history of nearly all flowering plants includes a polyploidization event. Homologous genes resulting from allopolyploidy are commonly referred to as 'homoeologs', although this term has not always been used precisely or consistently in the literature. With several allopolyploid genome sequencing projects under way, there is a pressing need for computational methods for homoeology inference. Here we review the definition of homoeology in historical and modern contexts and propose a precise and testable definition highlighting the connection between homoeologs and orthologs. In the second part, we survey experimental and computational methods of homoeolog inference, considering the strengths and limitations of each approach. Establishing a precise and evolutionarily meaningful definition of homoeology is essential for understanding the evolutionary consequences of polyploidization.
Collapse
Affiliation(s)
- Natasha M Glover
- Bayer CropScience NV, Technologiepark 38, 9052 Gent, Belgium; University College London, Gower Street, London WC1E 6BT, UK
| | | | - Christophe Dessimoz
- University College London, Gower Street, London WC1E 6BT, UK; University of Lausanne, Biophore, 1015 Lausanne, Switzerland; Swiss Institute of Bioinformatics, Biophore, 1015 Lausanne, Switzerland.
| |
Collapse
|
15
|
Standardized benchmarking in the quest for orthologs. Nat Methods 2016; 13:425-30. [PMID: 27043882 PMCID: PMC4827703 DOI: 10.1038/nmeth.3830] [Citation(s) in RCA: 126] [Impact Index Per Article: 15.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2016] [Accepted: 03/09/2016] [Indexed: 11/23/2022]
Abstract
Achieving high accuracy in orthology inference is essential for many comparative, evolutionary and functional genomic analyses, yet the true evolutionary history of genes is generally unknown and orthologs are used for very different applications across phyla, requiring different precision–recall trade-offs. As a result, it is difficult to assess the performance of orthology inference methods. Here, we present a community effort to establish standards and an automated web-based service to facilitate orthology benchmarking. Using this service, we characterize 15 well-established inference methods and resources on a battery of 20 different benchmarks. Standardized benchmarking provides a way for users to identify the most effective methods for the problem at hand, sets a minimum requirement for new tools and resources, and guides the development of more accurate orthology inference methods.
Collapse
|
16
|
Tekaia F. Inferring Orthologs: Open Questions and Perspectives. GENOMICS INSIGHTS 2016; 9:17-28. [PMID: 26966373 PMCID: PMC4778853 DOI: 10.4137/gei.s37925] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/18/2015] [Revised: 12/30/2015] [Accepted: 01/02/2016] [Indexed: 01/25/2023]
Abstract
With the increasing number of sequenced genomes and their comparisons, the detection of orthologs is crucial for reliable functional annotation and evolutionary analyses of genes and species. Yet, the dynamic remodeling of genome content through gain, loss, transfer of genes, and segmental and whole-genome duplication hinders reliable orthology detection. Moreover, the lack of direct functional evidence and the questionable quality of some available genome sequences and annotations present additional difficulties to assess orthology. This article reviews the existing computational methods and their potential accuracy in the high-throughput era of genome sequencing and anticipates open questions in terms of methodology, reliability, and computation. Appropriate taxon sampling together with combination of methods based on similarity, phylogeny, synteny, and evolutionary knowledge that may help detecting speciation events appears to be the most accurate strategy. This review also raises perspectives on the potential determination of orthology throughout the whole species phylogeny.
Collapse
Affiliation(s)
- Fredj Tekaia
- Institut Pasteur, Unit of Structural Microbiology, CNRS URA 3528 and University Paris Diderot, Sorbonne Paris Cité, Paris, France
| |
Collapse
|
17
|
Schierwater B, Holland PWH, Miller DJ, Stadler PF, Wiegmann BM, Wörheide G, Wray GA, DeSalle R. Never Ending Analysis of a Century Old Evolutionary Debate: “Unringing” the Urmetazoon Bell. Front Ecol Evol 2016. [DOI: 10.3389/fevo.2016.00005] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
18
|
Hellmuth M, Wieseke N. From Sequence Data Including Orthologs, Paralogs, and Xenologs to Gene and Species Trees. Evol Biol 2016. [DOI: 10.1007/978-3-319-41324-2_21] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
19
|
Kotowski N, Jardim R, Dávila AMR. Improved orthologous databases to ease protozoan targets inference. Parasit Vectors 2015; 8:494. [PMID: 26416523 PMCID: PMC4587786 DOI: 10.1186/s13071-015-1090-0] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2015] [Accepted: 09/11/2015] [Indexed: 11/10/2022] Open
Abstract
Background Homology inference helps on identifying similarities, as well as differences among organisms, which provides a better insight on how closely related one might be to another. In addition, comparative genomics pipelines are widely adopted tools designed using different bioinformatics applications and algorithms. In this article, we propose a methodology to build improved orthologous databases with the potential to aid on protozoan target identification, one of the many tasks which benefit from comparative genomics tools. Methods Our analyses are based on OrthoSearch, a comparative genomics pipeline originally designed to infer orthologs through protein-profile comparison, supported by an HMM, reciprocal best hits based approach. Our methodology allows OrthoSearch to confront two orthologous databases and to generate an improved new one. Such can be later used to infer potential protozoan targets through a similarity analysis against the human genome. Results The protein sequences of Cryptosporidium hominis, Entamoeba histolytica and Leishmania infantum genomes were comparatively analyzed against three orthologous databases: (i) EggNOG KOG, (ii) ProtozoaDB and (iii) Kegg Orthology (KO). That allowed us to create two new orthologous databases, “KO + EggNOG KOG” and “KO + EggNOG KOG + ProtozoaDB”, with 16,938 and 27,701 orthologous groups, respectively. Such new orthologous databases were used for a regular OrthoSearch run. By confronting “KO + EggNOG KOG” and “KO + EggNOG KOG + ProtozoaDB” databases and protozoan species we were able to detect the following total of orthologous groups and coverage (relation between the inferred orthologous groups and the species total number of proteins): Cryptosporidium hominis: 1,821 (11 %) and 3,254 (12 %); Entamoeba histolytica: 2,245 (13 %) and 5,305 (19 %); Leishmania infantum: 2,702 (16 %) and 4,760 (17 %). Using our HMM-based methodology and the largest created orthologous database, it was possible to infer 13 orthologous groups which represent potential protozoan targets; these were found because of our distant homology approach. We also provide the number of species-specific, pair-to-pair and core groups from such analyses, depicted in Venn diagrams. Conclusions The orthologous databases generated by our HMM-based methodology provide a broader dataset, with larger amounts of orthologous groups when compared to the original databases used as input. Those may be used for several homology inference analyses, annotation tasks and protozoan targets identification. Electronic supplementary material The online version of this article (doi:10.1186/s13071-015-1090-0) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Nelson Kotowski
- Computational and Systems Biology Laboratory, Oswaldo Cruz Institute, FIOCRUZ, Avenida Brasil, 4365, 21040-360, Rio de Janeiro, RJ, Brazil.
| | - Rodrigo Jardim
- Computational and Systems Biology Laboratory, Oswaldo Cruz Institute, FIOCRUZ, Avenida Brasil, 4365, 21040-360, Rio de Janeiro, RJ, Brazil.
| | - Alberto M R Dávila
- Computational and Systems Biology Laboratory, Oswaldo Cruz Institute, FIOCRUZ, Avenida Brasil, 4365, 21040-360, Rio de Janeiro, RJ, Brazil.
| |
Collapse
|
20
|
Behura SK. Insect phylogenomics. INSECT MOLECULAR BIOLOGY 2015; 24:403-11. [PMID: 25963452 PMCID: PMC4503476 DOI: 10.1111/imb.12174] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/16/2014] [Revised: 03/10/2015] [Accepted: 04/04/2015] [Indexed: 05/08/2023]
Abstract
Phylogenomics, the integration of phylogenetics with genome data, has emerged as a powerful approach to study the evolution and systematics of species. Recently, several studies employing phylogenomic tools have provided better insights into insect evolution. Next-generation sequencing methods are now increasingly used by entomologists to generate genomic and transcript sequences of various insect species and strains. These data provide opportunities for comparative genomics and large-scale multigene phylogenies of diverse lineages of insects. Phy-logenomic investigations help us to better understand systematic and evolutionary relationships of insect species that play important roles as herbivores, predators, detritivores, pollinators and disease vectors. It is important that we critically assess the prospects and limitations of phylogenomic methods. In this review, I describe the current status, outline the major challenges and remark on potential future applications of phylogenomic tools in studying insect systematics and evolution.
Collapse
Affiliation(s)
- S K Behura
- Eck Institute for Global Health and Department of Biological Sciences, University of Notre Dame, Notre Dame, IN, USA
| |
Collapse
|
21
|
Chiara M, Caruso M, D'Erchia AM, Manzari C, Fraccalvieri R, Goffredo E, Latorre L, Miccolupo A, Padalino I, Santagada G, Chiocco D, Pesole G, Horner DS, Parisi A. Comparative Genomics of Listeria Sensu Lato: Genus-Wide Differences in Evolutionary Dynamics and the Progressive Gain of Complex, Potentially Pathogenicity-Related Traits through Lateral Gene Transfer. Genome Biol Evol 2015; 7:2154-72. [PMID: 26185097 PMCID: PMC4558849 DOI: 10.1093/gbe/evv131] [Citation(s) in RCA: 38] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
Historically, genome-wide and molecular characterization of the genus Listeria has concentrated on the important human pathogen Listeria monocytogenes and a small number of closely related species, together termed Listeria sensu strictu. More recently, a number of genome sequences for more basal, and nonpathogenic, members of the Listeria genus have become available, facilitating a wider perspective on the evolution of pathogenicity and genome level evolutionary dynamics within the entire genus (termed Listeria sensu lato). Here, we have sequenced the genomes of additional Listeria fleischmannii and Listeria newyorkensis isolates and explored the dynamics of genome evolution in Listeria sensu lato. Our analyses suggest that acquisition of genetic material through gene duplication and divergence as well as through lateral gene transfer (mostly from outside Listeria) is widespread throughout the genus. Novel genetic material is apparently subject to rapid turnover. Multiple lines of evidence point to significant differences in evolutionary dynamics between the most basal Listeria subclade and all other congeners, including both sensu strictu and other sensu lato isolates. Strikingly, these differences are likely attributable to stochastic, population-level processes and contribute to observed variation in genome size across the genus. Notably, our analyses indicate that the common ancestor of Listeria sensu lato lacked flagella, which were acquired by lateral gene transfer by a common ancestor of Listeria grayi and Listeria sensu strictu, whereas a recently functionally characterized pathogenicity island, responsible for the capacity to produce cobalamin and utilize ethanolamine/propane-2-diol, was acquired in an ancestor of Listeria sensu strictu.
Collapse
Affiliation(s)
- Matteo Chiara
- Dipartimento di Bioscienze, Università degli Studi di Milano, Italy Istituto Zooprofilattico Sperimentale della Puglia e della Basilicata, Foggia, Italy
| | - Marta Caruso
- Istituto Zooprofilattico Sperimentale della Puglia e della Basilicata, Foggia, Italy
| | - Anna Maria D'Erchia
- Dipartimento di Bioscienze, Biotecnologie e Biofarmaceutica, Università degli Studi di Bari Aldo Moro, Italy Istituto di Biomembrane e Bioenergetica, Consiglio Nazionale delle Ricerche, Bari, Italy
| | - Caterina Manzari
- Istituto di Biomembrane e Bioenergetica, Consiglio Nazionale delle Ricerche, Bari, Italy
| | - Rosa Fraccalvieri
- Istituto Zooprofilattico Sperimentale della Puglia e della Basilicata, Foggia, Italy
| | - Elisa Goffredo
- Istituto Zooprofilattico Sperimentale della Puglia e della Basilicata, Foggia, Italy
| | - Laura Latorre
- Istituto Zooprofilattico Sperimentale della Puglia e della Basilicata, Foggia, Italy
| | - Angela Miccolupo
- Istituto Zooprofilattico Sperimentale della Puglia e della Basilicata, Foggia, Italy
| | - Iolanda Padalino
- Istituto Zooprofilattico Sperimentale della Puglia e della Basilicata, Foggia, Italy
| | - Gianfranco Santagada
- Istituto Zooprofilattico Sperimentale della Puglia e della Basilicata, Foggia, Italy
| | - Doriano Chiocco
- Istituto Zooprofilattico Sperimentale della Puglia e della Basilicata, Foggia, Italy
| | - Graziano Pesole
- Dipartimento di Bioscienze, Biotecnologie e Biofarmaceutica, Università degli Studi di Bari Aldo Moro, Italy Istituto di Biomembrane e Bioenergetica, Consiglio Nazionale delle Ricerche, Bari, Italy
| | - David S Horner
- Dipartimento di Bioscienze, Università degli Studi di Milano, Italy
| | - Antonio Parisi
- Istituto Zooprofilattico Sperimentale della Puglia e della Basilicata, Foggia, Italy
| |
Collapse
|
22
|
Arenas M. Advances in computer simulation of genome evolution: toward more realistic evolutionary genomics analysis by approximate bayesian computation. J Mol Evol 2015; 80:189-92. [PMID: 25808249 DOI: 10.1007/s00239-015-9673-0] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2015] [Accepted: 03/19/2015] [Indexed: 11/29/2022]
Abstract
NGS technologies present a fast and cheap generation of genomic data. Nevertheless, ancestral genome inference is not so straightforward due to complex evolutionary processes acting on this material such as inversions, translocations, and other genome rearrangements that, in addition to their implicit complexity, can co-occur and confound ancestral inferences. Recently, models of genome evolution that accommodate such complex genomic events are emerging. This letter explores these novel evolutionary models and proposes their incorporation into robust statistical approaches based on computer simulations, such as approximate Bayesian computation, that may produce a more realistic evolutionary analysis of genomic data. Advantages and pitfalls in using these analytical methods are discussed. Potential applications of these ancestral genomic inferences are also pointed out.
Collapse
Affiliation(s)
- Miguel Arenas
- Centre for Molecular Biology "Severo Ochoa", Consejo Superior de Investigaciones Científicas (CSIC), Universidad Autónoma de Madrid (CSIC-UAM), C/Nicolás Cabrera, 1, Cantoblanco, 28049, Madrid, Spain,
| |
Collapse
|
23
|
Abstract
Phylogenomics heavily relies on well-curated sequence data sets that comprise, for each gene, exclusively 1:1 orthologos. Paralogs are treated as a dangerous nuisance that has to be detected and removed. We show here that this severe restriction of the data sets is not necessary. Building upon recent advances in mathematical phylogenetics, we demonstrate that gene duplications convey meaningful phylogenetic information and allow the inference of plausible phylogenetic trees, provided orthologs and paralogs can be distinguished with a degree of certainty. Starting from tree-free estimates of orthology, cograph editing can sufficiently reduce the noise to find correct event-annotated gene trees. The information of gene trees can then directly be translated into constraints on the species trees. Although the resolution is very poor for individual gene families, we show that genome-wide data sets are sufficient to generate fully resolved phylogenetic trees, even in the presence of horizontal gene transfer.
Collapse
|
24
|
Trachana K, Forslund K, Larsson T, Powell S, Doerks T, von Mering C, Bork P. A phylogeny-based benchmarking test for orthology inference reveals the limitations of function-based validation. PLoS One 2014; 9:e111122. [PMID: 25369365 PMCID: PMC4219706 DOI: 10.1371/journal.pone.0111122] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2014] [Accepted: 09/23/2014] [Indexed: 11/19/2022] Open
Abstract
Accurate orthology prediction is crucial for many applications in the post-genomic era. The lack of broadly accepted benchmark tests precludes a comprehensive analysis of orthology inference. So far, functional annotation between orthologs serves as a performance proxy. However, this violates the fundamental principle of orthology as an evolutionary definition, while it is often not applicable due to limited experimental evidence for most species. Therefore, we constructed high quality "gold standard" orthologous groups that can serve as a benchmark set for orthology inference in bacterial species. Herein, we used this dataset to demonstrate 1) why a manually curated, phylogeny-based dataset is more appropriate for benchmarking orthology than other popular practices and 2) how it guides database design and parameterization through careful error quantification. More specifically, we illustrate how function-based tests often fail to identify false assignments, misjudging the true performance of orthology inference methods. We also examined how our dataset can instruct the selection of a “core” species repertoire to improve detection accuracy. We conclude that including more genomes at the proper evolutionary distances can influence the overall quality of orthology detection. The curated gene families, called Reference Orthologous Groups, are publicly available at http://eggnog.embl.de/orthobench2.
Collapse
Affiliation(s)
- Kalliopi Trachana
- Institute for Systems Biology, Seattle, WA, United States of America
| | - Kristoffer Forslund
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
| | - Tomas Larsson
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
- Developmental Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
| | - Sean Powell
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
| | - Tobias Doerks
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
| | - Christian von Mering
- Institute of Molecular Life Sciences, University of Zurich and Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Peer Bork
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
- Max-Delbruck-Centre for Molecular Medicine, Berlin, Germany
- * E-mail:
| |
Collapse
|
25
|
Sonnhammer ELL, Gabaldón T, Sousa da Silva AW, Martin M, Robinson-Rechavi M, Boeckmann B, Thomas PD, Dessimoz C. Big data and other challenges in the quest for orthologs. Bioinformatics 2014; 30:2993-8. [PMID: 25064571 PMCID: PMC4201156 DOI: 10.1093/bioinformatics/btu492] [Citation(s) in RCA: 98] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2014] [Revised: 06/25/2014] [Accepted: 07/16/2014] [Indexed: 01/29/2023] Open
Abstract
UNLABELLED Given the rapid increase of species with a sequenced genome, the need to identify orthologous genes between them has emerged as a central bioinformatics task. Many different methods exist for orthology detection, which makes it difficult to decide which one to choose for a particular application. Here, we review the latest developments and issues in the orthology field, and summarize the most recent results reported at the third 'Quest for Orthologs' meeting. We focus on community efforts such as the adoption of reference proteomes, standard file formats and benchmarking. Progress in these areas is good, and they are already beneficial to both orthology consumers and providers. However, a major current issue is that the massive increase in complete proteomes poses computational challenges to many of the ortholog database providers, as most orthology inference algorithms scale at least quadratically with the number of proteomes. The Quest for Orthologs consortium is an open community with a number of working groups that join efforts to enhance various aspects of orthology analysis, such as defining standard formats and datasets, documenting community resources and benchmarking. AVAILABILITY AND IMPLEMENTATION All such materials are available at http://questfororthologs.org.
Collapse
Affiliation(s)
- Erik L L Sonnhammer
- Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden, Swedish eScience Research Center, Stockholm, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain, Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain, Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain, EMBL-European Bioinformatics Institute, Hinxton CB10 1SD, UK, Department of Ecology and Evolution, University of Lausanne, Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, SwissProt, Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland, Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA and Department of Genetics, Evolution and Environment, and Department of Computer Science, University College London, Gower St, London WC1E 6BT, UK Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden, Swedish eScience Research Center, Stockholm, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain, Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain, Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain, EMBL-European Bioinformatics Institute, Hinxton CB10 1SD, UK, Department of Ecology and Evolution, University of Lausanne, Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, SwissProt, Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland, Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA and Department of Genetics, Evolution and Environment, and Department of Computer Science, University College London, Gower St, London
| | - Toni Gabaldón
- Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden, Swedish eScience Research Center, Stockholm, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain, Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain, Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain, EMBL-European Bioinformatics Institute, Hinxton CB10 1SD, UK, Department of Ecology and Evolution, University of Lausanne, Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, SwissProt, Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland, Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA and Department of Genetics, Evolution and Environment, and Department of Computer Science, University College London, Gower St, London WC1E 6BT, UK Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden, Swedish eScience Research Center, Stockholm, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain, Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain, Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain, EMBL-European Bioinformatics Institute, Hinxton CB10 1SD, UK, Department of Ecology and Evolution, University of Lausanne, Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, SwissProt, Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland, Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA and Department of Genetics, Evolution and Environment, and Department of Computer Science, University College London, Gower St, London
| | - Alan W Sousa da Silva
- Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden, Swedish eScience Research Center, Stockholm, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain, Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain, Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain, EMBL-European Bioinformatics Institute, Hinxton CB10 1SD, UK, Department of Ecology and Evolution, University of Lausanne, Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, SwissProt, Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland, Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA and Department of Genetics, Evolution and Environment, and Department of Computer Science, University College London, Gower St, London WC1E 6BT, UK
| | - Maria Martin
- Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden, Swedish eScience Research Center, Stockholm, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain, Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain, Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain, EMBL-European Bioinformatics Institute, Hinxton CB10 1SD, UK, Department of Ecology and Evolution, University of Lausanne, Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, SwissProt, Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland, Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA and Department of Genetics, Evolution and Environment, and Department of Computer Science, University College London, Gower St, London WC1E 6BT, UK
| | - Marc Robinson-Rechavi
- Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden, Swedish eScience Research Center, Stockholm, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain, Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain, Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain, EMBL-European Bioinformatics Institute, Hinxton CB10 1SD, UK, Department of Ecology and Evolution, University of Lausanne, Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, SwissProt, Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland, Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA and Department of Genetics, Evolution and Environment, and Department of Computer Science, University College London, Gower St, London WC1E 6BT, UK Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden, Swedish eScience Research Center, Stockholm, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain, Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain, Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain, EMBL-European Bioinformatics Institute, Hinxton CB10 1SD, UK, Department of Ecology and Evolution, University of Lausanne, Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, SwissProt, Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland, Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA and Department of Genetics, Evolution and Environment, and Department of Computer Science, University College London, Gower St, London
| | - Brigitte Boeckmann
- Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden, Swedish eScience Research Center, Stockholm, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain, Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain, Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain, EMBL-European Bioinformatics Institute, Hinxton CB10 1SD, UK, Department of Ecology and Evolution, University of Lausanne, Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, SwissProt, Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland, Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA and Department of Genetics, Evolution and Environment, and Department of Computer Science, University College London, Gower St, London WC1E 6BT, UK
| | - Paul D Thomas
- Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden, Swedish eScience Research Center, Stockholm, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain, Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain, Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain, EMBL-European Bioinformatics Institute, Hinxton CB10 1SD, UK, Department of Ecology and Evolution, University of Lausanne, Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, SwissProt, Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland, Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA and Department of Genetics, Evolution and Environment, and Department of Computer Science, University College London, Gower St, London WC1E 6BT, UK
| | - Christophe Dessimoz
- Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden, Swedish eScience Research Center, Stockholm, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain, Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain, Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain, EMBL-European Bioinformatics Institute, Hinxton CB10 1SD, UK, Department of Ecology and Evolution, University of Lausanne, Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, SwissProt, Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland, Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA and Department of Genetics, Evolution and Environment, and Department of Computer Science, University College London, Gower St, London WC1E 6BT, UK Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden, Swedish eScience Research Center, Stockholm, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain, Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain, Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain, EMBL-European Bioinformatics Institute, Hinxton CB10 1SD, UK, Department of Ecology and Evolution, University of Lausanne, Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, SwissProt, Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland, Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA and Department of Genetics, Evolution and Environment, and Department of Computer Science, University College London, Gower St, London
| |
Collapse
|
26
|
Pereira C, Denise A, Lespinet O. A meta-approach for improving the prediction and the functional annotation of ortholog groups. BMC Genomics 2014; 15 Suppl 6:S16. [PMID: 25573073 PMCID: PMC4240552 DOI: 10.1186/1471-2164-15-s6-s16] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In comparative genomics, orthologs are used to transfer annotation from genes already characterized to newly sequenced genomes. Many methods have been developed for finding orthologs in sets of genomes. However, the application of different methods on the same proteome set can lead to distinct orthology predictions. METHODS We developed a method based on a meta-approach that is able to combine the results of several methods for orthologous group prediction. The purpose of this method is to produce better quality results by using the overlapping results obtained from several individual orthologous gene prediction procedures. Our method proceeds in two steps. The first aims to construct seeds for groups of orthologous genes; these seeds correspond to the exact overlaps between the results of all or several methods. In the second step, these seed groups are expanded by using HMM profiles. RESULTS We evaluated our method on two standard reference benchmarks, OrthoBench and Orthology Benchmark Service. Our method presents a higher level of accurately predicted groups than the individual input methods of orthologous group prediction. Moreover, our method increases the number of annotated orthologous pairs without decreasing the annotation quality compared to twelve state-of-the-art methods. CONCLUSIONS The meta-approach based method appears to be a reliable procedure for predicting orthologous groups. Since a large number of methods for predicting groups of orthologous genes exist, it is quite conceivable to apply this meta-approach to several combinations of different methods.
Collapse
|
27
|
Linard B, Allot A, Schneider R, Morel C, Ripp R, Bigler M, Thompson JD, Poch O, Lecompte O. OrthoInspector 2.0: Software and database updates. ACTA ACUST UNITED AC 2014; 31:447-8. [PMID: 25273105 DOI: 10.1093/bioinformatics/btu642] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
SUMMARY We previously developed OrthoInspector, a package incorporating an original algorithm for the detection of orthology and inparalogy relations between different species. We have added new functionalities to the package. While its original algorithm was not modified, performing similar orthology predictions, we facilitated the prediction of very large databases (thousands of proteomes), refurbished its graphical interface, added new visualization tools for comparative genomics/protein family analysis and facilitated its deployment in a network environment. Finally, we have released three online databases of precomputed orthology relationships. AVAILABILITY Package and databases are freely available at http://lbgi.fr/orthoinspector with all major browsers supported. CONTACT odile.lecompte@unistra.fr SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Benjamin Linard
- LBGI, Computer Science Department, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de médecine translationnelle, 4 rue Kirschleger 67085 Strasbourg, France and Department of Life Sciences, Natural History Museum, Cromwell Road, London SW7 5BD LBGI, Computer Science Department, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de médecine translationnelle, 4 rue Kirschleger 67085 Strasbourg, France and Department of Life Sciences, Natural History Museum, Cromwell Road, London SW7 5BD
| | - Alexis Allot
- LBGI, Computer Science Department, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de médecine translationnelle, 4 rue Kirschleger 67085 Strasbourg, France and Department of Life Sciences, Natural History Museum, Cromwell Road, London SW7 5BD
| | - Raphaël Schneider
- LBGI, Computer Science Department, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de médecine translationnelle, 4 rue Kirschleger 67085 Strasbourg, France and Department of Life Sciences, Natural History Museum, Cromwell Road, London SW7 5BD
| | - Can Morel
- LBGI, Computer Science Department, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de médecine translationnelle, 4 rue Kirschleger 67085 Strasbourg, France and Department of Life Sciences, Natural History Museum, Cromwell Road, London SW7 5BD
| | - Raymond Ripp
- LBGI, Computer Science Department, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de médecine translationnelle, 4 rue Kirschleger 67085 Strasbourg, France and Department of Life Sciences, Natural History Museum, Cromwell Road, London SW7 5BD
| | - Marc Bigler
- LBGI, Computer Science Department, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de médecine translationnelle, 4 rue Kirschleger 67085 Strasbourg, France and Department of Life Sciences, Natural History Museum, Cromwell Road, London SW7 5BD
| | - Julie D Thompson
- LBGI, Computer Science Department, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de médecine translationnelle, 4 rue Kirschleger 67085 Strasbourg, France and Department of Life Sciences, Natural History Museum, Cromwell Road, London SW7 5BD
| | - Olivier Poch
- LBGI, Computer Science Department, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de médecine translationnelle, 4 rue Kirschleger 67085 Strasbourg, France and Department of Life Sciences, Natural History Museum, Cromwell Road, London SW7 5BD
| | - Odile Lecompte
- LBGI, Computer Science Department, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de médecine translationnelle, 4 rue Kirschleger 67085 Strasbourg, France and Department of Life Sciences, Natural History Museum, Cromwell Road, London SW7 5BD
| |
Collapse
|
28
|
Khenoussi W, Vanhoutrève R, Poch O, Thompson JD. SIBIS: a Bayesian model for inconsistent protein sequence estimation. Bioinformatics 2014; 30:2432-9. [PMID: 24825613 DOI: 10.1093/bioinformatics/btu329] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The prediction of protein coding genes is a major challenge that depends on the quality of genome sequencing, the accuracy of the model used to elucidate the exonic structure of the genes and the complexity of the gene splicing process leading to different protein variants. As a consequence, today's protein databases contain a huge amount of inconsistency, due to both natural variants and sequence prediction errors. RESULTS We have developed a new method, called SIBIS, to detect such inconsistencies based on the evolutionary information in multiple sequence alignments. A Bayesian framework, combined with Dirichlet mixture models, is used to estimate the probability of observing specific amino acids and to detect inconsistent or erroneous sequence segments. We evaluated the performance of SIBIS on a reference set of protein sequences with experimentally validated errors and showed that the sensitivity is significantly higher than previous methods, with only a small loss of specificity. We also assessed a large set of human sequences from the UniProt database and found evidence of inconsistency in 48% of the previously uncharacterized sequences. We conclude that the integration of quality control methods like SIBIS in automatic analysis pipelines will be critical for the robust inference of structural, functional and phylogenetic information from these sequences. AVAILABILITY AND IMPLEMENTATION Source code, implemented in C on a linux system, and the datasets of protein sequences are freely available for download at http://www.lbgi.fr/∼julie/SIBIS.
Collapse
Affiliation(s)
- Walyd Khenoussi
- Department of Computer Science, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de médecine translationnelle, Strasbourg, F-67085, France
| | - Renaud Vanhoutrève
- Department of Computer Science, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de médecine translationnelle, Strasbourg, F-67085, France
| | - Olivier Poch
- Department of Computer Science, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de médecine translationnelle, Strasbourg, F-67085, France
| | - Julie D Thompson
- Department of Computer Science, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de médecine translationnelle, Strasbourg, F-67085, France
| |
Collapse
|
29
|
Dalquen DA, Dessimoz C. Bidirectional best hits miss many orthologs in duplication-rich clades such as plants and animals. Genome Biol Evol 2014; 5:1800-6. [PMID: 24013106 PMCID: PMC3814191 DOI: 10.1093/gbe/evt132] [Citation(s) in RCA: 61] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Bidirectional best hits (BBH), which entails identifying the pairs of genes in two different genomes that are more similar to each other than either is to any other gene in the other genome, is a simple and widely used method to infer orthology. A recent study has analyzed the link between BBH and orthology in bacteria and archaea and concluded that, given the very high consistency in BBH they observed among triplets of neighboring genes, a high proportion of BBH are likely to be bona fide orthologs. However, limited by their analysis setup, the previous study could not easily test the reverse question: which proportion of orthologs are BBH? In this follow-up study, we consider this question in theory and answer it based on conceptual arguments, simulated data, and real biological data from all three domains of life. Our analyses corroborate the findings of the previous study, but also show that because of the high rate of gene duplication in plants and animals, as much as 60% of orthologous relations are missed by the BBH criterion.
Collapse
Affiliation(s)
- Daniel A Dalquen
- Computational Biochemistry Research Group, ETH Zurich, Zürich, Switzerland
| | | |
Collapse
|
30
|
Powell S, Forslund K, Szklarczyk D, Trachana K, Roth A, Huerta-Cepas J, Gabaldón T, Rattei T, Creevey C, Kuhn M, Jensen LJ, von Mering C, Bork P. eggNOG v4.0: nested orthology inference across 3686 organisms. Nucleic Acids Res 2013; 42:D231-9. [PMID: 24297252 PMCID: PMC3964997 DOI: 10.1093/nar/gkt1253] [Citation(s) in RCA: 422] [Impact Index Per Article: 38.4] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
With the increasing availability of various 'omics data, high-quality orthology assignment is crucial for evolutionary and functional genomics studies. We here present the fourth version of the eggNOG database (available at http://eggnog.embl.de) that derives nonsupervised orthologous groups (NOGs) from complete genomes, and then applies a comprehensive characterization and analysis pipeline to the resulting gene families. Compared with the previous version, we have more than tripled the underlying species set to cover 3686 organisms, keeping track with genome project completions while prioritizing the inclusion of high-quality genomes to minimize error propagation from incomplete proteome sets. Major technological advances include (i) a robust and scalable procedure for the identification and inclusion of high-quality genomes, (ii) provision of orthologous groups for 107 different taxonomic levels compared with 41 in eggNOGv3, (iii) identification and annotation of particularly closely related orthologous groups, facilitating analysis of related gene families, (iv) improvements of the clustering and functional annotation approach, (v) adoption of a revised tree building procedure based on the multiple alignments generated during the process and (vi) implementation of quality control procedures throughout the entire pipeline. As in previous versions, eggNOGv4 provides multiple sequence alignments and maximum-likelihood trees, as well as broad functional annotation. Users can access the complete database of orthologous groups via a web interface, as well as through bulk download.
Collapse
Affiliation(s)
- Sean Powell
- European Molecular Biology Laboratory, Computational Biology Unit, Meyerhofstrasse 1, 69117 Heidelberg, Germany, University of Zurich and Swiss Institute of Bioinformatics, Institute of Molecular Life Sciences, Winterthurerstrasse 190, 8057 Zurich, Switzerland, Institute for Systems Biology, 401 Terry Avenue North, Seattle, WA 98109-5234, USA, Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), C/Dr. Aiguader 88, 08003 Barcelona, Spain, Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain, CUBE-Division of Computational Systems Biology, Department of Microbiology and Ecosystem Science, University of Vienna, Althanstraße 14, 1090 Vienna, Austria, Institute of Biological, Environmental & Rural Sciences, Aberystwyth University, Penglais, Aberystwyth, Ceredigion, SY23 3FG, UK, Biotechnology Center, TU Dresden, 01062 Dresden, Germany, Novo Nordisk Foundation Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, 2200, Copenhagen N, Denmark and Max-Delbrück-Centre for Molecular Medicine, Robert-Rössle-Strasse 10, 13092 Berlin, Germany
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|