51
|
The Recipe for Protein Sequence-Based Function Prediction and Its Implementation in the ANNOTATOR Software Environment. Methods Mol Biol 2016; 1415:477-506. [PMID: 27115649 DOI: 10.1007/978-1-4939-3572-7_25] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/02/2023]
|
52
|
Asgari E, Mofrad MRK. Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics. PLoS One 2015; 10:e0141287. [PMID: 26555596 PMCID: PMC4640716 DOI: 10.1371/journal.pone.0141287] [Citation(s) in RCA: 370] [Impact Index Per Article: 37.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2015] [Accepted: 10/05/2015] [Indexed: 12/22/2022] Open
Abstract
We introduce a new representation and feature extraction method for biological sequences. Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. In the present paper, we focus on protein-vectors that can be utilized in a wide array of bioinformatics investigations such as family classification, protein visualization, structure prediction, disordered protein identification, and protein-protein interaction prediction. In this method, we adopt artificial neural network approaches and represent a protein sequence with a single dense n-dimensional vector. To evaluate this method, we apply it in classification of 324,018 protein sequences obtained from Swiss-Prot belonging to 7,027 protein families, where an average family classification accuracy of 93%±0.06% is obtained, outperforming existing family classification methods. In addition, we use ProtVec representation to predict disordered proteins from structured proteins. Two databases of disordered sequences are used: the DisProt database as well as a database featuring the disordered regions of nucleoporins rich with phenylalanine-glycine repeats (FG-Nups). Using support vector machine classifiers, FG-Nup sequences are distinguished from structured protein sequences found in Protein Data Bank (PDB) with a 99.8% accuracy, and unstructured DisProt sequences are differentiated from structured DisProt sequences with 100.0% accuracy. These results indicate that by only providing sequence data for various proteins into this model, accurate information about protein structure can be determined. Importantly, this model needs to be trained only once and can then be applied to extract a comprehensive set of information regarding proteins of interest. Moreover, this representation can be considered as pre-training for various applications of deep learning in bioinformatics. The related data is available at Life Language Processing Website: http://llp.berkeley.edu and Harvard Dataverse: http://dx.doi.org/10.7910/DVN/JMFHTN.
Collapse
Affiliation(s)
- Ehsaneddin Asgari
- Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, California 94720, United States of America
| | - Mohammad R. K. Mofrad
- Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, California 94720, United States of America
- Physical Biosciences Division, Lawrence Berkeley National Lab, Berkeley, California 94720, United States of America
| |
Collapse
|
53
|
Cao R, Cheng J. Deciphering the association between gene function and spatial gene-gene interactions in 3D human genome conformation. BMC Genomics 2015; 16:880. [PMID: 26511362 PMCID: PMC4625479 DOI: 10.1186/s12864-015-2093-0] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2015] [Accepted: 10/15/2015] [Indexed: 01/17/2023] Open
Abstract
BACKGROUND A number of factors have been investigated in the context of gene function prediction and analysis, such as sequence identity, gene expressions, and gene co-evolution. However, three-dimensional (3D) conformation of the genome has not been tapped to analyse gene function, probably largely due to lack of genome conformation data until recently. METHODS We construct the genome-wide spatial gene-gene interaction networks for three different human B-cells or cell lines from their chromosomal contact data generated by the Hi-C chromosome conformation capturing technique. The G-SESAME and Fast-SemSim are used to calculate function similarity between interacted / non-interacted genes. The Gene Ontology statistics computed from the gene-gene interaction networks is used for gene function prediction. RESULTS We compare the function similarity of gene pairs that do not spatially interact and that have interactions. We find that genes that have strong spatial interactions tend to have highly similar function in terms of biological process, molecular function and cellular component of the Gene Ontology. And even though the level of gene-gene interactions generally have no or weak correlation with either sequential genomic distance or sequence identity between genes, the interacted genes with high function similarity tend to have stronger interactions, somewhat shorter genomic distance and significantly higher sequence identity. And combining genomic distance or sequence identity with spatial gene-gene interaction information informs gene-gene function similarity much better than using either one of them alone, suggesting gene-gene interaction information is largely complementary with genomic distance and sequence identity in the context of gene function analysis. We develop and evaluate a new gene function prediction method based on gene-gene interacting networks, which can predict gene function well for a large number of human genes. CONCLUSIONS In this work, we demonstrate that the spatial conformation of the human genome is relevant to gene function similarity and is useful for gene function prediction.
Collapse
Affiliation(s)
- Renzhi Cao
- Computer Science Department, University of Missouri, Columbia, Missouri, 65211, USA.
| | - Jianlin Cheng
- Computer Science Department, University of Missouri, Columbia, Missouri, 65211, USA. .,Informatics Institute, University of Missouri, Columbia, Missouri, 65211, USA. .,Christopher S. Bond Life Science Center, University of Missouri, Columbia, Missouri, 65211, USA.
| |
Collapse
|
54
|
Sherman WA, Kuchibhatla DB, Limviphuvadh V, Maurer-Stroh S, Eisenhaber B, Eisenhaber F. HPMV: human protein mutation viewer - relating sequence mutations to protein sequence architecture and function changes. J Bioinform Comput Biol 2015; 13:1550028. [PMID: 26503432 DOI: 10.1142/s0219720015500286] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Next-generation sequencing advances are rapidly expanding the number of human mutations to be analyzed for causative roles in genetic disorders. Our Human Protein Mutation Viewer (HPMV) is intended to explore the biomolecular mechanistic significance of non-synonymous human mutations in protein-coding genomic regions. The tool helps to assess whether protein mutations affect the occurrence of sequence-architectural features (globular domains, targeting signals, post-translational modification sites, etc.). As input, HPMV accepts protein mutations - as UniProt accessions with mutations (e.g. HGVS nomenclature), genome coordinates, or FASTA sequences. As output, HPMV provides an interactive cartoon showing the mutations in relation to elements of the sequence architecture. A large variety of protein sequence architectural features were selected for their particular relevance to mutation interpretation. Clicking a sequence feature in the cartoon expands a tree view of additional information including multiple sequence alignments of conserved domains and a simple 3D viewer mapping the mutation to known PDB structures, if available. The cartoon is also correlated with a multiple sequence alignment of similar sequences from other organisms. In cases where a mutation is likely to have a straightforward interpretation (e.g. a point mutation disrupting a well-understood targeting signal), this interpretation is suggested. The interactive cartoon can be downloaded as standalone viewer in Java jar format to be saved and viewed later with only a standard Java runtime environment. The HPMV website is: http://hpmv.bii.a-star.edu.sg/ .
Collapse
Affiliation(s)
- Westley Arthur Sherman
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street #07-01, Matrix, Singapore 138671, Singapore
| | - Durga Bhavani Kuchibhatla
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street #07-01, Matrix, Singapore 138671, Singapore
| | - Vachiranee Limviphuvadh
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street #07-01, Matrix, Singapore 138671, Singapore
| | - Sebastian Maurer-Stroh
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street #07-01, Matrix, Singapore 138671, Singapore
- School of Biological Sciences (SBS), Nanyang Technological University (NTU), 60 Nanyang Drive, Singapore 637551, Singapore
| | - Birgit Eisenhaber
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street #07-01, Matrix, Singapore 138671, Singapore
| | - Frank Eisenhaber
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street #07-01, Matrix, Singapore 138671, Singapore
- Department of Biological Sciences (DBS), National University of Singapore (NUS), 8 Medical Drive 4, Singapore 117597, Singapore
- School of Computer Engineering (SCE), Nanyang Technological University (NTU), 50 Nanyang Drive, Singapore 637553, Singapore
| |
Collapse
|
55
|
Wong WC, Yap CK, Eisenhaber B, Eisenhaber F. dissectHMMER: a HMMER-based score dissection framework that statistically evaluates fold-critical sequence segments for domain fold similarity. Biol Direct 2015; 10:39. [PMID: 26228544 PMCID: PMC4521371 DOI: 10.1186/s13062-015-0068-3] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2015] [Accepted: 07/20/2015] [Indexed: 11/10/2022] Open
Abstract
Background Annotation transfer for function and structure within the sequence homology concept essentially requires protein sequence similarity for the secondary structural blocks forming the fold of a protein. A simplistic similarity approach in the case of non-globular segments (coiled coils, low complexity regions, transmembrane regions, long loops, etc.) is not justified and a pertinent source for mistaken homologies. The latter is either due to positional sequence conservation as a result of a very simple, physically induced pattern or integral sequence properties that are critical for function. Furthermore, against the backdrop that the number of well-studied proteins continues to grow at a slow rate, it necessitates for a search methodology to dive deeper into the sequence similarity space to connect the unknown sequences to the well-studied ones, albeit more distant, for biological function postulations. Results Based on our previous work of dissecting the hidden markov model (HMMER) based similarity score into fold-critical and the non-globular contributions to improve homology inference, we propose a framework-dissectHMMER, that identifies more fold-related domain hits from standard HMMER searches. Subsequent statistical stratification of the fold-related hits into cohorts of functionally-related domains allows for the function postulation of the query sequence. Briefly, the technical problems as to how to recognize non-globular parts in the domain model, resolve contradictory HMMER2/HMMER3 results and evaluate fold-related domain hits for homology, are addressed in this work. The framework is benchmarked against a set of SCOP-to-Pfam domain models. Despite being a sequence-to-profile method, dissectHMMER performs favorably against a profile-to-profile based method-HHsuite/HHsearch. Examples of function annotation using dissectHMMER, including the function discovery of an uncharacterized membrane protein Q9K8K1_BACHD (WP_010899149.1) as a lactose/H+ symporter, are presented. Finally, dissectHMMER webserver is made publicly available at http://dissecthmmer.bii.a-star.edu.sg. Conclusions The proposed framework-dissectHMMER, is faithful to the original inception of the sequence homology concept while improving upon the existing HMMER search tool through the rescue of statistically evaluated false-negative yet fold-related domain hits to the query sequence. Overall, this translates into an opportunity for any novel protein sequence to be functionally characterized. Reviewers This article was reviewed by Masanori Arita, Shamil Sunyaev and L. Aravind. Electronic supplementary material The online version of this article (doi:10.1186/s13062-015-0068-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Wing-Cheong Wong
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, Singapore, 138671, Singapore.
| | - Choon-Kong Yap
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, Singapore, 138671, Singapore.
| | - Birgit Eisenhaber
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, Singapore, 138671, Singapore.
| | - Frank Eisenhaber
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, Singapore, 138671, Singapore. .,Department of Biological Sciences (DBS), National University of Singapore (NUS), 8 Medical Drive, Singapore, 117597, Singapore. .,School of Computer Engineering (SCE), Nanyang Technological University (NTU), 50 Nanyang Drive, Singapore, 637553, Singapore.
| |
Collapse
|
56
|
Schlacht A, Dacks JB. Unexpected ancient paralogs and an evolutionary model for the COPII coat complex. Genome Biol Evol 2015; 7:1098-109. [PMID: 25747251 PMCID: PMC4419792 DOI: 10.1093/gbe/evv045] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
The coat protein complex II (COPII) is responsible for the transport of protein cargoes from the Endoplasmic Reticulum (ER) to the Golgi apparatus. COPII has been functionally characterized extensively in vivo in humans and yeast. This complex shares components with the nuclear pore complex and the Seh1-Associated (SEA) complex, inextricably linking its evolution with that of the nuclear pore and other protocoatomer domain-containing complexes. Importantly, this is one of the last coat complexes to be examined from a comparative genomic and phylogenetic perspective. We use homology searching of eight components across 74 eukaryotic genomes, followed by phylogenetic analyses, to assess both the distribution of the COPII components across eukaryote diversity and to assess its evolutionary history. We report that Sec12, but not Sed4 was present in the Last Eukaryotic Common Ancestor along with Sec16, Sar1, Sec13, Sec31, Sec23, and Sec24. We identify a previously undetected paralog of Sec23 that, at least, predates the archaeplastid clade. We also describe three Sec24 paralogs likely present in the Last Eukaryotic Common Ancestor, including one newly detected that was anciently present but lost from both opisthokonts and excavates. Altogether, we report previously undescribed complexity of the COPII coat in the ancient eukaryotic ancestor and speculate on models for the evolution, not only of the complex, but its relationship to other protocoatomer-derived complexes.
Collapse
Affiliation(s)
- Alexander Schlacht
- Department of Cell Biology, University of Alberta, Edmonton, Alberta, Canada
| | - Joel B Dacks
- Department of Cell Biology, University of Alberta, Edmonton, Alberta, Canada
| |
Collapse
|
57
|
Callahan A, Cifuentes JJ, Dumontier M. An evidence-based approach to identify aging-related genes in Caenorhabditis elegans. BMC Bioinformatics 2015; 16:40. [PMID: 25888240 PMCID: PMC4339751 DOI: 10.1186/s12859-015-0469-4] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2014] [Accepted: 01/15/2015] [Indexed: 12/21/2022] Open
Abstract
Background Extensive studies have been carried out on Caenorhabditis elegans as a model organism to elucidate mechanisms of aging and the effects of perturbing known aging-related genes on lifespan and behavior. This research has generated large amounts of experimental data that is increasingly difficult to integrate and analyze with existing databases and domain knowledge. To address this challenge, we demonstrate a scalable and effective approach for automatic evidence gathering and evaluation that leverages existing experimental data and literature-curated facts to identify genes involved in aging and lifespan regulation in C. elegans. Results We developed a semantic knowledge base for aging by integrating data about C. elegans genes from WormBase with data about 2005 human and model organism genes from GenAge and 149 genes from GenDR, and with the Bio2RDF network of linked data for the life sciences. Using HyQue (a Semantic Web tool for hypothesis-based querying and evaluation) to interrogate this knowledge base, we examined 48,231 C. elegans genes for their role in modulating lifespan and aging. HyQue identified 24 novel but well-supported candidate aging-related genes for further experimental validation. Conclusions We use semantic technologies to discover candidate aging genes whose effects on lifespan are not yet well understood. Our customized HyQue system, the aging research knowledge base it operates over, and HyQue evaluations of all C. elegans genes are freely available at http://hyque.semanticscience.org. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0469-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Alison Callahan
- Stanford Center for Biomedical Informatics Research, School of Medicine, Stanford University, Stanford California, AC, USA.
| | - Juan José Cifuentes
- Molecular Bioinformatics Laboratory, Millennium Institute on Immunology and Immunotherapy, 49 Santiago, CP, 8330025, Portugal. .,Departamento de Genética Molecular y Microbiología, Facultad de Ciencias Biológicas, Pontificia Universidad Católica de Chile, Alameda 340, Santiago, Chile.
| | - Michel Dumontier
- Stanford Center for Biomedical Informatics Research, School of Medicine, Stanford University, Stanford California, AC, USA.
| |
Collapse
|
58
|
Uddin R, Saeed K, Khan W, Azam SS, Wadood A. Metabolic pathway analysis approach: Identification of novel therapeutic target against methicillin resistant Staphylococcus aureus. Gene 2015; 556:213-26. [DOI: 10.1016/j.gene.2014.11.056] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2014] [Revised: 11/18/2014] [Accepted: 11/25/2014] [Indexed: 12/31/2022]
|
59
|
Jeffares DC, Tomiczek B, Sojo V, dos Reis M. A beginners guide to estimating the non-synonymous to synonymous rate ratio of all protein-coding genes in a genome. Methods Mol Biol 2015; 1201:65-90. [PMID: 25388108 DOI: 10.1007/978-1-4939-1438-8_4] [Citation(s) in RCA: 70] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
The ratio of non-synonymous to synonymous substitutions (dN/dS) is a useful measure of the strength and mode of natural selection acting on protein-coding genes. It is widely used to study patterns of selection on protein genes on a genomic scale-from the small genomes of viruses, bacteria, and parasitic eukaryotes to the largest eukaryotic genomes. In this chapter we describe all the steps necessary to calculate the dN/dS of all the genes using at least two genomes. We include a brief discussion on assigning orthologs, and of codon-aware alignment of orthologs. We then describe how to use the CODEML program of the PAML package for phylogenetic analysis to calculate the dN/dS and how to perform some statistical tests for positive selection. We then outline some methods for interpreting output and describe how one may use this data to make discoveries about the biology of your species. Finally, as a worked example we show all the steps we used to calculate dN/dS for 3,261 orthologs from six Plasmodium species, including tests for adaptive evolution (see worked_example.pdf).
Collapse
Affiliation(s)
- Daniel C Jeffares
- Research Department of Genetics, Evolution and Environment, University College London, Gower Street, London, WC1E 6BT, UK,
| | | | | | | |
Collapse
|
60
|
Junier I. Conserved patterns in bacterial genomes: a conundrum physically tailored by evolutionary tinkering. Comput Biol Chem 2014; 53 Pt A:125-33. [PMID: 25239779 DOI: 10.1016/j.compbiolchem.2014.08.017] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/11/2014] [Indexed: 11/17/2022]
Abstract
The proper functioning of bacteria is encoded in their genome at multiple levels or scales, each of which is constrained by specific physical forces. At the smallest spatial scales, interatomic forces dictate the folding and function of proteins and nucleic acids. On longer length scales, stochastic forces emerging from the thermal jiggling of proteins and RNAs impose strong constraints on the organization of genes along chromosomes, more particularly in the context of the building of nucleoprotein complexes and the operational mode of regulatory agents. At the cellular level, transcription, replication and cell division activities generate forces that act on both the internal structure and cellular location of chromosomes. The overall result is a complex multi-scale organization of genomes that reflects the evolutionary tinkering of bacteria. The goal of this review is to highlight avenues for deciphering this complexity by focusing on patterns that are conserved among evolutionarily distant bacteria. To this end, I discuss three different organizational scales: the protein structures, the chromosomal organization of genes and the global structure of chromosomes.
Collapse
Affiliation(s)
- Ivan Junier
- Centre for Genomic Regulation (CRG), Dr. Aiguader 88, 08003 Barcelona, Spain; Universitat Pompeu Fabra (UPF), Barcelona, Spain.
| |
Collapse
|
61
|
Ward N, Moreno-Hagelsieb G. Quickly finding orthologs as reciprocal best hits with BLAT, LAST, and UBLAST: how much do we miss? PLoS One 2014; 9:e101850. [PMID: 25013894 PMCID: PMC4094424 DOI: 10.1371/journal.pone.0101850] [Citation(s) in RCA: 107] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2014] [Accepted: 06/11/2014] [Indexed: 11/30/2022] Open
Abstract
Reciprocal Best Hits (RBH) are a common proxy for orthology in comparative genomics. Essentially, a RBH is found when the proteins encoded by two genes, each in a different genome, find each other as the best scoring match in the other genome. NCBI's BLAST is the software most usually used for the sequence comparisons necessary to finding RBHs. Since sequence comparison can be time consuming, we decided to compare the number and quality of RBHs detected using algorithms that run in a fraction of the time as BLAST. We tested BLAT, LAST and UBLAST. All three programs ran in a hundredth to a 25th of the time required to run BLAST. A reduction in the number of homologs and RBHs found by the faster algorithms compared to BLAST becomes apparent as the genomes compared become more dissimilar, with BLAT, a program optimized for quickly finding very similar sequences, missing both the most homologs and the most RBHs. Though LAST produced the closest number of homologs and RBH to those produced with BLAST, UBLAST was very close, with either program producing between 0.6 and 0.8 of the RBHs as BLAST between dissimilar genomes, while in more similar genomes the differences were barely apparent. UBLAST ran faster than LAST, making it the best option among the programs tested.
Collapse
Affiliation(s)
- Natalie Ward
- Department of Biology, Wilfrid Laurier University, Waterloo, Ontario, Canada
| | | |
Collapse
|
62
|
Wong WC, Maurer-Stroh S, Eisenhaber B, Eisenhaber F. On the necessity of dissecting sequence similarity scores into segment-specific contributions for inferring protein homology, function prediction and annotation. BMC Bioinformatics 2014; 15:166. [PMID: 24890864 PMCID: PMC4061105 DOI: 10.1186/1471-2105-15-166] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2013] [Accepted: 05/27/2014] [Indexed: 02/01/2023] Open
Abstract
Background Protein sequence similarities to any types of non-globular segments (coiled coils, low complexity regions, transmembrane regions, long loops, etc. where either positional sequence conservation is the result of a very simple, physically induced pattern or rather integral sequence properties are critical) are pertinent sources for mistaken homologies. Regretfully, these considerations regularly escape attention in large-scale annotation studies since, often, there is no substitute to manual handling of these cases. Quantitative criteria are required to suppress events of function annotation transfer as a result of false homology assignments. Results The sequence homology concept is based on the similarity comparison between the structural elements, the basic building blocks for conferring the overall fold of a protein. We propose to dissect the total similarity score into fold-critical and other, remaining contributions and suggest that, for a valid homology statement, the fold-relevant score contribution should at least be significant on its own. As part of the article, we provide the DissectHMMER software program for dissecting HMMER2/3 scores into segment-specific contributions. We show that DissectHMMER reproduces HMMER2/3 scores with sufficient accuracy and that it is useful in automated decisions about homology for instructive sequence examples. To generalize the dissection concept for cases without 3D structural information, we find that a dissection based on alignment quality is an appropriate surrogate. The approach was applied to a large-scale study of SMART and PFAM domains in the space of seed sequences and in the space of UniProt/SwissProt. Conclusions Sequence similarity core dissection with regard to fold-critical and other contributions systematically suppresses false hits and, additionally, recovers previously obscured homology relationships such as the one between aquaporins and formate/nitrite transporters that, so far, was only supported by structure comparison.
Collapse
Affiliation(s)
- Wing-Cheong Wong
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, Singapore 138671, Singapore.
| | | | | | | |
Collapse
|
63
|
Szövényi P, Devos N, Weston DJ, Yang X, Hock Z, Shaw JA, Shimizu KK, McDaniel SF, Wagner A. Efficient purging of deleterious mutations in plants with haploid selfing. Genome Biol Evol 2014; 6:1238-52. [PMID: 24879432 PMCID: PMC4041004 DOI: 10.1093/gbe/evu099] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Abstract
In diploid organisms, selfing reduces the efficiency of selection in removing deleterious mutations from a population. This need not be the case for all organisms. Some plants, for example, undergo an extreme form of selfing known as intragametophytic selfing, which immediately exposes all recessive deleterious mutations in a parental genome to selective purging. Here, we ask how effectively deleterious mutations are removed from such plants. Specifically, we study the extent to which deleterious mutations accumulate in a predominantly selfing and a predominantly outcrossing pair of moss species, using genome-wide transcriptome data. We find that the selfing species purge significantly more nonsynonymous mutations, as well as a greater proportion of radical amino acid changes which alter physicochemical properties of amino acids. Moreover, their purging of deleterious mutation is especially strong in conserved regions of protein-coding genes. Our observations show that selfing need not impede but can even accelerate the removal of deleterious mutations, and do so on a genome-wide scale.
Collapse
Affiliation(s)
- Péter Szövényi
- Institute of Evolutionary Biology and Environmental Studies, University of Zurich, SwitzerlandInstitute of Systematic Botany, University of Zurich, SwitzerlandSwiss Institute of Bioinformatics, Quartier Sorge-Batiment Genopode, Lausanne, SwitzerlandMTA-ELTE-MTM Ecology Research Group, ELTE, Biological Institute, Hungary
| | | | - David J Weston
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN
| | - Xiaohan Yang
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN
| | - Zsófia Hock
- Institute of Systematic Botany, University of Zurich, Switzerland
| | | | - Kentaro K Shimizu
- Institute of Evolutionary Biology and Environmental Studies, University of Zurich, Switzerland
| | | | - Andreas Wagner
- Institute of Evolutionary Biology and Environmental Studies, University of Zurich, SwitzerlandSwiss Institute of Bioinformatics, Quartier Sorge-Batiment Genopode, Lausanne, SwitzerlandBioinformatics Institute, Agency for Science, Technology and Research (A*STAR), SingaporeThe Santa Fe Institute, Santa Fe NM
| |
Collapse
|
64
|
Eisenhaber B, Eisenhaber S, Kwang TY, Grüber G, Eisenhaber F. Transamidase subunit GAA1/GPAA1 is a M28 family metallo-peptide-synthetase that catalyzes the peptide bond formation between the substrate protein's omega-site and the GPI lipid anchor's phosphoethanolamine. Cell Cycle 2014; 13:1912-7. [PMID: 24743167 PMCID: PMC4111754 DOI: 10.4161/cc.28761] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
The transamidase subunit GAA1/GPAA1 is predicted to be the enzyme that catalyzes the attachment of the glycosylphosphatidyl (GPI) lipid anchor to the carbonyl intermediate of the substrate protein at the ω-site. Its ~300-amino acid residue lumenal domain is a M28 family metallo-peptide-synthetase with an α/β hydrolase fold, including a central 8-strand β-sheet and a single metal (most likely zinc) ion coordinated by 3 conserved polar residues. Phosphoethanolamine is used as an adaptor to make the non-peptide GPI lipid anchor look chemically similar to the N terminus of a peptide.
Collapse
Affiliation(s)
- Birgit Eisenhaber
- Bioinformatics Institute (BII); A*STAR; Singapore, Republic of Singapore
| | - Stephan Eisenhaber
- Department of Physical Chemistry; University of Vienna; Wien/Vienna, Republic of Austria
| | - Toh Yew Kwang
- Bioinformatics Institute (BII); A*STAR; Singapore, Republic of Singapore
| | - Gerhard Grüber
- Bioinformatics Institute (BII); A*STAR; Singapore, Republic of Singapore; Nanyang Technological University; School of Biological Sciences; Singapore, Republic of Singapore
| | - Frank Eisenhaber
- Bioinformatics Institute (BII); A*STAR; Singapore, Republic of Singapore; Department of Biological Sciences (DBS); National University of Singapore (NUS); Singapore, Republic of Singapore; School of Computer Engineering (SCE); Nanyang Technological University (NTU); Singapore, Republic of Singapore
| |
Collapse
|
65
|
Puggioni V, Dondi A, Folli C, Shin I, Rhee S, Percudani R. Gene Context Analysis Reveals Functional Divergence between Hypothetically Equivalent Enzymes of the Purine–Ureide Pathway. Biochemistry 2014; 53:735-45. [DOI: 10.1021/bi4010107] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Vincenzo Puggioni
- Laboratory
of Biochemistry, Molecular Biology, and Bioinformatics, Department
of Life Sciences, University of Parma, Italy
| | - Ambra Dondi
- Laboratory
of Biochemistry, Molecular Biology, and Bioinformatics, Department
of Life Sciences, University of Parma, Italy
| | - Claudia Folli
- Department
of Food Science, University of Parma, Italy
| | - Inchul Shin
- Department
of Agricultural Biotechnology, Seoul National University, Seoul, Korea
| | - Sangkee Rhee
- Department
of Agricultural Biotechnology, Seoul National University, Seoul, Korea
| | - Riccardo Percudani
- Laboratory
of Biochemistry, Molecular Biology, and Bioinformatics, Department
of Life Sciences, University of Parma, Italy
| |
Collapse
|
66
|
Cao L, Chen F, Yang X, Xu W, Xie J, Yu L. Phylogenetic analysis of CDK and cyclin proteins in premetazoan lineages. BMC Evol Biol 2014; 14:10. [PMID: 24433236 PMCID: PMC3923393 DOI: 10.1186/1471-2148-14-10] [Citation(s) in RCA: 96] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2013] [Accepted: 01/02/2014] [Indexed: 12/21/2022] Open
Abstract
Background The molecular history of animal evolution from single-celled ancestors remains a major question in biology, and little is known regarding the evolution of cell cycle regulation during animal emergence. In this study, we conducted a comprehensive evolutionary analysis of CDK and cyclin proteins in metazoans and their unicellular relatives. Results Our analysis divided the CDK family into eight subfamilies. Seven subfamilies (CDK1/2/3, CDK5, CDK7, CDK 20, CDK8/19, CDK9, and CDK10/11) are conserved in metazoans and fungi, with the remaining subfamily, CDK4/6, found only in eumetazoans. With respect to cyclins, cyclin C, H, L, Y subfamilies, and cyclin K and T as a whole subfamily, are generally conserved in animal, fungi, and amoeba Dictyostelium discoideum. In contrast, cyclin subfamilies B, A, E, and D, which are cell cycle-related, have distinct evolutionary histories. The cyclin B subfamily is generally conserved in D. discoideum, fungi, and animals, whereas cyclin A and E subfamilies are both present in animals and their unicellular relatives such as choanoflagellate Monosiga brevicollis and filasterean Capsaspora owczarzaki, but are absent in fungi and D. discoideum. Although absent in fungi and D. discoideum, cyclin D subfamily orthologs can be found in the early-emerging, non-opisthokont apusozoan Thecamonas trahens. Within opisthokonta, the cyclin D subfamily is conserved only in eumetazoans, and is absent in fungi, choanoflagellates, and the basal metazoan Amphimedon queenslandica. Conclusions Our data indicate that the CDK4/6 subfamily and eumetazoans emerged simultaneously, with the evolutionary conservation of the cyclin D subfamily also tightly linked with eumetazoan appearance. Establishment of the CDK4/6-cyclin D complex may have been the key step in the evolution of cell cycle control during eumetazoan emergence.
Collapse
Affiliation(s)
- Lihuan Cao
- State Key Laboratory of Genetic Engineering, Institute of Genetics, School of Life Sciences, Fudan University, Shanghai 200433, PR China.
| | | | | | | | | | | |
Collapse
|
67
|
Barad S, Horowitz SB, Kobiler I, Sherman A, Prusky D. Accumulation of the mycotoxin patulin in the presence of gluconic acid contributes to pathogenicity of Penicillium expansum. MOLECULAR PLANT-MICROBE INTERACTIONS : MPMI 2014; 27:66-77. [PMID: 24024763 DOI: 10.1094/mpmi-05-13-0138-r] [Citation(s) in RCA: 55] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
Penicillium expansum, the causal agent of blue mold rot, causes severe postharvest fruit maceration through secretion of D-gluconic acid (GLA) and secondary metabolites such as the mycotoxin patulin in colonized tissue. GLA involvement in pathogenicity has been suggested but the mechanism of patulin accumulation and its contribution to P. expansum pathogenicity remain unclear. The roles of GLA and patulin accumulation in P. expansum pathogenicity were studied using i) glucose oxidase GOX2-RNAi mutants exhibiting decreased GOX2 expression, GLA accumulation, and reduced pathogenicity; ii) IDH-RNAi mutants exhibiting downregulation of IDH (the last gene in patulin biosynthesis), reduced patulin accumulation, and no effect on GLA level; and iii) PACC-RNAi mutants exhibiting downregulation of both GOX2 and IDH that reduced GLA and patulin production. Present results indicate that conditions enhancing the decrease in GLA accumulation by GOX2-RNAi and PACC-RNAi mutants, and not low pH, affected patulin accumulation, suggesting GLA production as the driving force for further patulin accumulation. Thus, it is suggested that GLA accumulation may modulate patulin synthesis as a direct precursor under dynamic pH conditions modulating the activation of the transcription factor PACC and the consequent pathogenicity factors, which contribute to host-tissue colonization by P. expansum.
Collapse
|
68
|
Qin H, Driks A. Contrasting evolutionary patterns of spore coat proteins in two Bacillus species groups are linked to a difference in cellular structure. BMC Evol Biol 2013; 13:261. [PMID: 24283940 PMCID: PMC4219348 DOI: 10.1186/1471-2148-13-261] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2013] [Accepted: 11/20/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The Bacillus subtilis-group and the Bacillus cereus-group are two well-studied groups of species in the genus Bacillus. Bacteria in this genus can produce a highly resistant cell type, the spore, which is encased in a complex protective protein shell called the coat. Spores in the B. cereus-group contain an additional outer layer, the exosporium, which encircles the coat. The coat in B. subtilis spores possesses inner and outer layers. The aim of this study is to investigate whether differences in the spore structures influenced the divergence of the coat protein genes during the evolution of these two Bacillus species groups. RESULTS We designed and implemented a computational framework to compare the evolutionary histories of coat proteins. We curated a list of B. subtilis coat proteins and identified their orthologs in 11 Bacillus species based on phylogenetic congruence. Phylogenetic profiles of these coat proteins show that they can be divided into conserved and labile ones. Coat proteins comprising the B. subtilis inner coat are significantly more conserved than those comprising the outer coat. We then performed genome-wide comparisons of the nonsynonymous/synonymous substitution rate ratio, dN/dS, and found contrasting patterns: Coat proteins have significantly higher dN/dS in the B. subtilis-group genomes, but not in the B. cereus-group genomes. We further corroborated this contrast by examining changes of dN/dS within gene trees, and found that some coat protein gene trees have significantly different dN/dS between the B subtilis-clade and the B. cereus-clade. CONCLUSIONS Coat proteins in the B. subtilis- and B. cereus-group species are under contrasting selective pressures. We speculate that the absence of the exosporium in the B. subtilis spore coat effectively lifted a structural constraint that has led to relaxed negative selection pressure on the outer coat.
Collapse
Affiliation(s)
- Hong Qin
- Department of Biology, Spelman College, Atlanta, GA 30314, USA.
| | | |
Collapse
|
69
|
Tao YL, Yang DH, Zhang YT, Zhang Y, Wang ZQ, Wang YS, Cai SQ, Liu SL. Cloning, expression, and characterization of the β-glucosidase hydrolyzing secoisolariciresinol diglucoside to secoisolariciresinol from Bacteroides uniformis ZL1. Appl Microbiol Biotechnol 2013; 98:2519-31. [DOI: 10.1007/s00253-013-5111-7] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2013] [Revised: 06/06/2013] [Accepted: 07/03/2013] [Indexed: 12/16/2022]
|
70
|
Anantharaman V, Iyer LM, Aravind L. Ter-dependent stress response systems: novel pathways related to metal sensing, production of a nucleoside-like metabolite, and DNA-processing. MOLECULAR BIOSYSTEMS 2013; 8:3142-65. [PMID: 23044854 DOI: 10.1039/c2mb25239b] [Citation(s) in RCA: 73] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
The mode of action of the bacterial ter cluster and TelA genes, implicated in natural resistance to tellurite and other xenobiotic toxic compounds, pore-forming colicins and several bacteriophages, has remained enigmatic for almost two decades. Using comparative genomics, sequence-profile searches and structural analysis we present evidence that the ter gene products and their functional partners constitute previously underappreciated, chemical stress response and anti-viral defense systems of bacteria. Based on contextual information from conserved gene neighborhoods and domain architectures, we show that the ter gene products and TelA lie at the center of membrane-linked metal recognition complexes with regulatory ramifications encompassing phosphorylation-dependent signal transduction, RNA-dependent regulation, biosynthesis of nucleoside-like metabolites and DNA processing. Our analysis suggests that the multiple metal-binding and non-binding TerD paralogs and TerC are likely to constitute a membrane-associated complex, which might also include TerB and TerY, and feature several, distinct metal-binding sites. Versions of the TerB domain might also bind small molecule ligands and link the TerD paralog-TerC complex to biosynthetic modules comprising phosphoribosyltransferases (PRTases), ATP grasp amidoligases, TIM-barrel carbon-carbon lyases, and HAD phosphoesterases, which are predicted to synthesize novel nucleoside-like molecules. One of the PRTases is also likely to interact with RNA by means of its Pelota/Ribosomal protein L7AE-like domain. The von Willebrand factor A domain protein, TerY, is predicted to be part of a distinct phosphorylation switch, coupling a protein kinase and a PP2C phosphatase. We show, based on the evidence from numerous conserved gene neighborhoods and domain architectures, that both the TerB and TelA domains have been linked to diverse lipid-interaction domains, such as two novel PH-like and the Coq4 domains, in different bacteria, and are likely to comprise membrane-associated sensory complexes that might additionally contain periplasmic binding-protein-II and OmpA domains. We also show that the TerD and TerB domains and the TerY-associated phosphorylation system are functionally linked to many distinct DNA-processing complexes, which feature proteins with SWI2/SNF2 and RecQ-like helicases, multiple AAA+ ATPases, McrC-N-terminal domain proteins, several restriction endonuclease fold DNases, DNA-binding domains and a type-VII/Esx-like system, which is at the center of a predicted DNA transfer apparatus. These DNA-processing modules and associated genes are predicted to be involved in restriction or suicidal action in response to phages and possibly repairing xenobiotic-induced DNA damage. In some eukaryotes, certain components of the ter system appear to be recruited to function in conjunction with the ubiquitin system and calcium-signaling pathways.
Collapse
Affiliation(s)
- Vivek Anantharaman
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA
| | | | | |
Collapse
|
71
|
Dutilh BE, Backus L, Edwards RA, Wels M, Bayjanov JR, van Hijum SAFT. Explaining microbial phenotypes on a genomic scale: GWAS for microbes. Brief Funct Genomics 2013; 12:366-80. [PMID: 23625995 PMCID: PMC3743258 DOI: 10.1093/bfgp/elt008] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
There is an increasing availability of complete or draft genome sequences for microbial organisms. These data form a potentially valuable resource for genotype-phenotype association and gene function prediction, provided that phenotypes are consistently annotated for all the sequenced strains. In this review, we address the requirements for successful gene-trait matching. We outline a basic protocol for microbial functional genomics, including genome assembly, annotation of genotypes (including single nucleotide polymorphisms, orthologous groups and prophages), data pre-processing, genotype-phenotype association, visualization and interpretation of results. The methodologies for association described herein can be applied to other data types, opening up possibilities to analyze transcriptome-phenotype associations, and correlate microbial population structure or activity, as measured by metagenomics, to environmental parameters.
Collapse
Affiliation(s)
- Bas E Dutilh
- CMBI, NCMLS, Radboud University Medical Centre. Geert Grooteplein 28, 6525 GA Nijmegen, The Netherlands.
| | | | | | | | | | | |
Collapse
|
72
|
Schlacht A, Mowbrey K, Elias M, Kahn RA, Dacks JB. Ancient complexity, opisthokont plasticity, and discovery of the 11th subfamily of Arf GAP proteins. Traffic 2013; 14:636-49. [PMID: 23433073 DOI: 10.1111/tra.12063] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2012] [Revised: 02/20/2013] [Accepted: 02/22/2013] [Indexed: 12/14/2022]
Abstract
The organelle paralogy hypothesis is one model for the acquisition of nonendosymbiotic organelles, generated from molecular evolutionary analyses of proteins encoding specificity in the membrane traffic system. GTPase activating proteins (GAPs) for the ADP-ribosylation factor (Arfs) GTPases are additional regulators of the kinetics and fidelity of membrane traffic. Here we describe molecular evolutionary analyses of the Arf GAP protein family. Of the 10 subfamilies previously defined in humans, we find that 5 were likely present in the last eukaryotic common ancestor. Of the 3 most recently derived subfamilies, 1 was likely present in the ancestor of opisthokonts (animals and fungi) and apusomonads (flagellates classified as the sister lineage to opisthokonts), while 2 arose in the holozoan lineage. We also propose to have identified a novel ancient subfamily (ArfGAPC2), present in diverse eukaryotes but which is lost frequently, including in the opisthokonts. Surprisingly few ancient domains accompanying the ArfGAP domain were identified, in marked contrast to the extensively decorated human Arf GAPs. Phylogenetic analyses of the subfamilies reveal patterns of single and multiple gene duplications specific to the Holozoa, to some degree mirroring evolution of Arf GAP targets, the Arfs. Conservation, and lack thereof, of various residues in the ArfGAP structure provide contextualization of previously identified functional amino acids and their application to Arf GAP biology in general. Overall, our results yield insights into current Arf GAP biology, reveal complexity in the ancient eukaryotic ancestor and integrate the Arf GAP family into a proposed mechanism for the evolution of nonendosymbiotic organelles.
Collapse
Affiliation(s)
- Alexander Schlacht
- Faculty of Medicine and Dentistry, Department of Cell Biology, University of Alberta, Edmonton, Alberta, Canada
| | | | | | | | | |
Collapse
|
73
|
Radivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A, Graim K, Funk C, Verspoor K, Ben-Hur A, Pandey G, Yunes JM, Talwalkar AS, Repo S, Souza ML, Piovesan D, Casadio R, Wang Z, Cheng J, Fang H, Gough J, Koskinen P, Törönen P, Nokso-Koivisto J, Holm L, Cozzetto D, Buchan DWA, Bryson K, Jones DT, Limaye B, Inamdar H, Datta A, Manjari SK, Joshi R, Chitale M, Kihara D, Lisewski AM, Erdin S, Venner E, Lichtarge O, Rentzsch R, Yang H, Romero AE, Bhat P, Paccanaro A, Hamp T, Kaßner R, Seemayer S, Vicedo E, Schaefer C, Achten D, Auer F, Boehm A, Braun T, Hecht M, Heron M, Hönigschmid P, Hopf TA, Kaufmann S, Kiening M, Krompass D, Landerer C, Mahlich Y, Roos M, Björne J, Salakoski T, Wong A, Shatkay H, Gatzmann F, Sommer I, Wass MN, Sternberg MJE, Škunca N, Supek F, Bošnjak M, Panov P, Džeroski S, Šmuc T, Kourmpetis YAI, van Dijk ADJ, ter Braak CJF, Zhou Y, Gong Q, Dong X, Tian W, Falda M, Fontana P, Lavezzo E, Di Camillo B, Toppo S, Lan L, Djuric N, Guo Y, Vucetic S, Bairoch A, Linial M, Babbitt PC, Brenner SE, Orengo C, Rost B, Mooney SD, Friedberg I. A large-scale evaluation of computational protein function prediction. Nat Methods 2013; 10:221-7. [PMID: 23353650 PMCID: PMC3584181 DOI: 10.1038/nmeth.2340] [Citation(s) in RCA: 604] [Impact Index Per Article: 50.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2012] [Accepted: 12/10/2012] [Indexed: 01/03/2023]
Abstract
A report on the results of the first large-scale community-based critical assessment of protein function annotation (CAFA) experiment. Automated annotation of protein function is challenging. As the number of sequenced genomes rapidly grows, the overwhelming majority of protein products can only be annotated computationally. If computational predictions are to be relied upon, it is crucial that the accuracy of these methods be high. Here we report the results from the first large-scale community-based critical assessment of protein function annotation (CAFA) experiment. Fifty-four methods representing the state of the art for protein function prediction were evaluated on a target set of 866 proteins from 11 organisms. Two findings stand out: (i) today's best protein function prediction algorithms substantially outperform widely used first-generation methods, with large gains on all types of targets; and (ii) although the top methods perform well enough to guide experiments, there is considerable need for improvement of currently available tools.
Collapse
Affiliation(s)
- Predrag Radivojac
- School of Informatics and Computing, Indiana University, Bloomington, Indiana, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
74
|
Singh S, Malhotra AG, Pandey A, Pandey KM. Computational model for pathway reconstruction to unravel the evolutionary significance of melanin synthesis. Bioinformation 2013; 9:94-100. [PMID: 23390353 PMCID: PMC3563405 DOI: 10.6026/97320630009094] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2013] [Accepted: 01/03/2013] [Indexed: 12/15/2022] Open
Abstract
Melanogenesis is a complex multistep process of high molecular weight melanins production by hydroxylation and polymerization of polyphenols. Melanins have a wide range of applications other than being a sun - protection pigment. Melanogenesis pathway exists from prokaryotes to eukaryotes. It has evolved over years owing to the fact that the melanin pigment has different roles in diverse taxa of organisms. Melanin plays a pivotal role in the existence of certain bacteria and fungi whereas in higher organisms it is a measure of protection against the harmful radiation. We have done a detailed study on various pathways known for melanin synthesis across species. It was divulged that melanin production is not restricted to tyrosine but there are other secondary metabolites that synthesize melanin in lower organisms. Furthermore the phylogenetic study of these paths was done to understand their molecular and cellular development. It has revealed that the melanin synthesis paths have co-evolved in several groups of organisms. In this study, we also introduce a method for the comparative analysis of a metabolic pathway to study its evolution based on similarity between enzymatic reactions.
Collapse
Affiliation(s)
- Sudha Singh
- Department of Chemical Engineering and Biotechnology, MANIT, Bhopal (M.P.) - 462051
| | | | - Ajay Pandey
- Department of Applied Mechanics, MANIT, Bhopal (M.P.) – 462051
| | - Khushhali M Pandey
- Department of Chemical Engineering and Biotechnology, MANIT, Bhopal (M.P.) - 462051
| |
Collapse
|
75
|
Muley VY, Ranjan A. Evaluation of physical and functional protein-protein interaction prediction methods for detecting biological pathways. PLoS One 2013; 8:e54325. [PMID: 23349851 PMCID: PMC3547882 DOI: 10.1371/journal.pone.0054325] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2012] [Accepted: 12/11/2012] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND Cellular activities are governed by the physical and the functional interactions among several proteins involved in various biological pathways. With the availability of sequenced genomes and high-throughput experimental data one can identify genome-wide protein-protein interactions using various computational techniques. Comparative assessments of these techniques in predicting protein interactions have been frequently reported in the literature but not their ability to elucidate a particular biological pathway. METHODS Towards the goal of understanding the prediction capabilities of interactions among the specific biological pathway proteins, we report the analyses of 14 biological pathways of Escherichia coli catalogued in KEGG database using five protein-protein functional linkage prediction methods. These methods are phylogenetic profiling, gene neighborhood, co-presence of orthologous genes in the same gene clusters, a mirrortree variant, and expression similarity. CONCLUSIONS Our results reveal that the prediction of metabolic pathway protein interactions continues to be a challenging task for all methods which possibly reflect flexible/independent evolutionary histories of these proteins. These methods have predicted functional associations of proteins involved in amino acids, nucleotide, glycans and vitamins & co-factors pathways slightly better than the random performance on carbohydrate, lipid and energy metabolism. We also make similar observations for interactions involved among the environmental information processing proteins. On the contrary, genetic information processing or specialized processes such as motility related protein-protein linkages that occur in the subset of organisms are predicted with comparable accuracy. Metabolic pathways are best predicted by using neighborhood of orthologous genes whereas phyletic pattern is good enough to reconstruct central dogma pathway protein interactions. We have also shown that the effective use of a particular prediction method depends on the pathway under investigation. In case one is not focused on specific pathway, gene expression similarity method is the best option.
Collapse
Affiliation(s)
- Vijaykumar Yogesh Muley
- Computational and Functional Genomics Group, Centre for DNA Fingerprinting and Diagnostics, Hyderabad, India
| | - Akash Ranjan
- Computational and Functional Genomics Group, Centre for DNA Fingerprinting and Diagnostics, Hyderabad, India
- * E-mail:
| |
Collapse
|
76
|
Pierlé SA, Dark MJ, Dahmen D, Palmer GH, Brayton KA. Comparative genomics and transcriptomics of trait-gene association. BMC Genomics 2012. [PMID: 23181781 PMCID: PMC3542260 DOI: 10.1186/1471-2164-13-669] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
Background The Order Rickettsiales includes important tick-borne pathogens, from Rickettsia rickettsii, which causes Rocky Mountain spotted fever, to Anaplasma marginale, the most prevalent vector-borne pathogen of cattle. Although most pathogens in this Order are transmitted by arthropod vectors, little is known about the microbial determinants of transmission. A. marginale provides unique tools for studying the determinants of transmission, with multiple strain sequences available that display distinct and reproducible transmission phenotypes. The closed core A. marginale genome suggests that any phenotypic differences are due to single nucleotide polymorphisms (SNPs). We combined DNA/RNA comparative genomic approaches using strains with different tick transmission phenotypes and identified genes that segregate with transmissibility. Results Comparison of seven strains with different transmission phenotypes generated a list of SNPs affecting 18 genes and nine promoters. Transcriptional analysis found two candidate genes downstream from promoter SNPs that were differentially transcribed. To corroborate the comparative genomics approach we used three RNA-seq platforms to analyze the transcriptomes from two A. marginale strains with different transmission phenotypes. RNA-seq analysis confirmed the comparative genomics data and found 10 additional genes whose transcription between strains with distinct transmission efficiencies was significantly different. Six regions of the genome that contained no annotation were found to be transcriptionally active, and two of these newly identified transcripts were differentially transcribed. Conclusions This approach identified 30 genes and two novel transcripts potentially involved in tick transmission. We describe the transcriptome of an obligate intracellular bacterium in depth, while employing massive parallel sequencing to dissect an important trait in bacterial pathogenesis.
Collapse
Affiliation(s)
- Sebastián Aguilar Pierlé
- Program in Genomics, Department of Veterinary Microbiology and Pathology, Paul G, Allen School for Global Animal Health, Washington State University, Pullman, WA 99164-7040, USA.
| | | | | | | | | |
Collapse
|
77
|
Abstract
The introduction of the term ‘Tubulin Polymerization Promoting Protein (TPPP)-like proteins’ is suggested. They constitute a eukaryotic protein superfamily, characterized by the presence of the p25alpha domain (Pfam05517, IPR008907), and named after the first identified member, TPPP/p25, exhibiting microtubule stabilizing function. TPPP-like proteins can be grouped on the basis of two characteristics: the length of their p25alpha domain, which can be long, short, truncated or partial, and the presence or absence of additional domain(s). TPPPs, in the strict sense, contain no other domains but one long or short p25alpha one (long- and short-type TPPPs, respectively). Proteins possessing truncated p25alpha domain are first described in this paper. They evolved from the long-type TPPPs and can be considered as arthropod-specific paralogs of long-type TPPPs. Phylogenetic analysis shows that the two groups (long-type and truncated TPPPs) split in the common ancestor of arthropods. Incomplete p25alpha domains can be found in multidomain TPPP-like proteins as well. The various subfamilies occur with a characteristic phyletic distribution: e. g., animal genomes/proteomes contain almost without exception long-type TPPPs; the multidomain apicortins occur almost exclusively in apicomplexan parasites. There are no data about the physiological function of these proteins except two human long-type TPPP paralogs which are involved in developmental processes of the brain and the musculoskeletal system, respectively. I predict that the superfamily members containing long or partial p25alpha domain are often intrinsically disordered proteins, while those with short or truncated domain(s) are structurally ordered. Interestingly, members of this superfamily connected or maybe connected to diseases are intrinsically disordered proteins.
Collapse
Affiliation(s)
- Ferenc Orosz
- Institute of Enzymology, Research Centre for Natural Sciences, Hungarian Academy of Sciences, Budapest, Hungary.
| |
Collapse
|
78
|
A fish-specific member of the TPPP protein family? J Mol Evol 2012; 75:55-72. [PMID: 23053195 DOI: 10.1007/s00239-012-9521-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2011] [Accepted: 09/24/2012] [Indexed: 02/02/2023]
Abstract
A eukaryotic protein family, the tubulin polymerization promoting proteins (TPPPs), has recently been identified. It has been termed after its first member, TPPP/p25 or TPPP1, which exhibits microtubule-stabilizing function and plays a role in neurodegenerative diseases. In mammalian genomes, two further paralogues, TPPP2 and TPPP3, can be found. In this article, I show that TPPP1 and TPPP3, but not TPPP2, are included in paralogons, on human chromosomes, Hsa5 and Hsa16, respectively. I suggest that the single non-vertebrate tppp gene was duplicated in the first round of whole-genome duplication in the vertebrate lineage giving rise to tppp1 and the precursor of tppp2/tppp3. The existence of a teleost fish-specific fourth paralogue, tppp4, has also been raised, but it is not supported by synteny analysis. Alternatively, the new group can be considered as the fish orthologue of TPPP2. The case that the new group is the consequence of the teleost fish-specific whole-genome duplication (3R) cannot be excluded.
Collapse
|
79
|
EISENHABER FRANK. A DECADE AFTER THE FIRST FULL HUMAN GENOME SEQUENCING: WHEN WILL WE UNDERSTAND OUR OWN GENOME? J Bioinform Comput Biol 2012; 10:1271001. [DOI: 10.1142/s0219720012710011] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
The contrast between the pomp of celebrating the first full human genome sequencing in 2000 and the cautious tone of recollections a decade thereafter could hardly be greater. The promises with regard to medical cures and biotechnology applications have been realized not even nearly to the expectations. Understanding the human genomes means knowing the genes' and proteins' functions and their interconnectedness via biomolecular mechanisms. This articles estimates how long will it take to achieve this goal if we extrapolate from the previous decade (indeed, a century!) and the possible disruptive trends in science, technology and society that may accelerate the pace of progress dramatically.
Collapse
Affiliation(s)
- FRANK EISENHABER
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, Singapore 138671, Singapore
- Department of Biological Sciences (DBS), National University of Singapore (NUS), 8 Medical Drive, Singapore 117597, Singapore
- School of Computer Engineering (SCE), Nanyang Technological University (NTU), 50 Nanyang Drive, Singapore 637553, Singapore
| |
Collapse
|
80
|
Muley VY, Ranjan A. Effect of reference genome selection on the performance of computational methods for genome-wide protein-protein interaction prediction. PLoS One 2012; 7:e42057. [PMID: 22844541 PMCID: PMC3406042 DOI: 10.1371/journal.pone.0042057] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2011] [Accepted: 07/02/2012] [Indexed: 12/20/2022] Open
Abstract
Background Recent progress in computational methods for predicting physical and functional protein-protein interactions has provided new insights into the complexity of biological processes. Most of these methods assume that functionally interacting proteins are likely to have a shared evolutionary history. This history can be traced out for the protein pairs of a query genome by correlating different evolutionary aspects of their homologs in multiple genomes known as the reference genomes. These methods include phylogenetic profiling, gene neighborhood and co-occurrence of the orthologous protein coding genes in the same cluster or operon. These are collectively known as genomic context methods. On the other hand a method called mirrortree is based on the similarity of phylogenetic trees between two interacting proteins. Comprehensive performance analyses of these methods have been frequently reported in literature. However, very few studies provide insight into the effect of reference genome selection on detection of meaningful protein interactions. Methods We analyzed the performance of four methods and their variants to understand the effect of reference genome selection on prediction efficacy. We used six sets of reference genomes, sampled in accordance with phylogenetic diversity and relationship between organisms from 565 bacteria. We used Escherichia coli as a model organism and the gold standard datasets of interacting proteins reported in DIP, EcoCyc and KEGG databases to compare the performance of the prediction methods. Conclusions Higher performance for predicting protein-protein interactions was achievable even with 100–150 bacterial genomes out of 565 genomes. Inclusion of archaeal genomes in the reference genome set improves performance. We find that in order to obtain a good performance, it is better to sample few genomes of related genera of prokaryotes from the large number of available genomes. Moreover, such a sampling allows for selecting 50–100 genomes for comparable accuracy of predictions when computational resources are limited.
Collapse
Affiliation(s)
- Vijaykumar Yogesh Muley
- Computational and Functional Genomics Group, Centre for DNA Fingerprinting and Diagnostics, Hyderabad, Andhra Pradesh, India
- Department of Biotechnology, Dr. Babasaheb Ambedkar Marathwada University, Sub-centre, Osmanabad, Maharashtra, India
| | - Akash Ranjan
- Computational and Functional Genomics Group, Centre for DNA Fingerprinting and Diagnostics, Hyderabad, Andhra Pradesh, India
- * E-mail:
| |
Collapse
|
81
|
Altenhoff AM, Studer RA, Robinson-Rechavi M, Dessimoz C. Resolving the ortholog conjecture: orthologs tend to be weakly, but significantly, more similar in function than paralogs. PLoS Comput Biol 2012; 8:e1002514. [PMID: 22615551 PMCID: PMC3355068 DOI: 10.1371/journal.pcbi.1002514] [Citation(s) in RCA: 144] [Impact Index Per Article: 11.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2011] [Accepted: 03/26/2012] [Indexed: 02/07/2023] Open
Abstract
The function of most proteins is not determined experimentally, but is extrapolated from homologs. According to the “ortholog conjecture”, or standard model of phylogenomics, protein function changes rapidly after duplication, leading to paralogs with different functions, while orthologs retain the ancestral function. We report here that a comparison of experimentally supported functional annotations among homologs from 13 genomes mostly supports this model. We show that to analyze GO annotation effectively, several confounding factors need to be controlled: authorship bias, variation of GO term frequency among species, variation of background similarity among species pairs, and propagated annotation bias. After controlling for these biases, we observe that orthologs have generally more similar functional annotations than paralogs. This is especially strong for sub-cellular localization. We observe only a weak decrease in functional similarity with increasing sequence divergence. These findings hold over a large diversity of species; notably orthologs from model organisms such as E. coli, yeast or mouse have conserved function with human proteins. To infer the function of an unknown gene, possibly the most effective way is to identify a well-characterized evolutionarily related gene, and assume that they have both kept their ancestral function. If several such homologs are available, all else being equal, it has long been assumed that those that diverged by speciation (“ortholog”) are functionally closer than those that diverged by duplication (“paralogs”); thus function is more reliably inferred from the former. But despite its prevalence, this model mostly rests on first principles, as for the longest time we have not had sufficient data to test it empirically. Recently, some studies began investigating this question and have cast doubt on the validity of this model. Here, we show that by considering a wide range of organisms and data, and, crucially, by correcting for several easily overlooked biases affecting functional annotations, the standard model is corroborated by the presently available experimental data.
Collapse
Affiliation(s)
- Adrian M. Altenhoff
- ETH Zurich, Department of Computer Science, Zürich, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Romain A. Studer
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
- Department of Ecology and Evolution, University of Lausanne, Lausanne, Switzerland
- Institute of Structural and Molecular Biology, Division of Biosciences, University College London, London, United Kingdom
| | - Marc Robinson-Rechavi
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
- Department of Ecology and Evolution, University of Lausanne, Lausanne, Switzerland
| | - Christophe Dessimoz
- ETH Zurich, Department of Computer Science, Zürich, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
- EMBL-European Bioinformatics Institute, Hinxton, Cambridge, United Kingdom
- * E-mail:
| |
Collapse
|
82
|
Guy L, Nystedt B, Sun Y, Näslund K, Berglund EC, Andersson SGE. A genome-wide study of recombination rate variation in Bartonella henselae. BMC Evol Biol 2012; 12:65. [PMID: 22577862 PMCID: PMC3483213 DOI: 10.1186/1471-2148-12-65] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2012] [Accepted: 04/17/2012] [Indexed: 11/16/2022] Open
Abstract
Background Rates of recombination vary by three orders of magnitude in bacteria but the reasons for this variation is unclear. We performed a genome-wide study of recombination rate variation among genes in the intracellular bacterium Bartonella henselae, which has among the lowest estimated ratio of recombination relative to mutation in prokaryotes. Results The 1.9 Mb genomes of B. henselae strains IC11, UGA10 and Houston-1 genomes showed only minor gene content variation. Nucleotide sequence divergence levels were less than 1% and the relative rate of recombination to mutation was estimated to 1.1 for the genome overall. Four to eight segments per genome presented significantly enhanced divergences, the most pronounced of which were the virB and trw gene clusters for type IV secretion systems that play essential roles in the infection process. Consistently, multiple recombination events were identified inside these gene clusters. High recombination frequencies were also observed for a gene putatively involved in iron metabolism. A phylogenetic study of this gene in 80 strains of Bartonella quintana, B. henselae and B. grahamii indicated different population structures for each species and revealed horizontal gene transfers across Bartonella species with different host preferences. Conclusions Our analysis has shown little novel gene acquisition in B. henselae, indicative of a closed pan-genome, but higher recombination frequencies within the population than previously estimated. We propose that the dramatically increased fixation rate for recombination events at gene clusters for type IV secretion systems is driven by selection for sequence variability.
Collapse
Affiliation(s)
- Lionel Guy
- Department of Molecular Evolution, Biomedical Centre, Uppsala University, SE-751 24, Uppsala, Sweden
| | | | | | | | | | | |
Collapse
|
83
|
James K, Wipat A, Hallinan J. Is newer better?--evaluating the effects of data curation on integrated analyses in Saccharomyces cerevisiae. Integr Biol (Camb) 2012; 4:715-27. [PMID: 22526920 DOI: 10.1039/c2ib00123c] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Recent high-throughput experiments have produced a wealth of heterogeneous datasets, each of which provides information about different aspects of the cell. Consequently, integration of diverse data types is essential in order to address many biological questions. The quality of any integrated analysis system is dependent upon the quality of its component data, and upon the Gold Standard data used to evaluate it. It is commonly assumed that the quality of data improves as databases grow and change, particularly for manually curated databases. However, the validity of this assumption can be questioned, given the constant changes in the data coupled with the high level of noise associated with high-throughput experimental techniques. One of the most powerful approaches to data integration is the use of Probabilistic Functional Integrated Networks (PFINs). Here, we systematically analyse the changes in four highly-curated and widely-used online databases and evaluate the extent to which these changes affect the protein function prediction performance of PFINs in the yeast Saccharomyces cerevisiae. We find that the global trend in network performance improves over time. Where individual areas of biology are concerned, however, the most recent files do not always produce the best results. Individual datasets have unique biases towards different biological processes and by selecting and integrating relevant datasets performance can be improved. When using any type of integrated system to answer a specific biological question careful selection of raw data and Gold Standard is vital, since the most recent data may not be the most appropriate.
Collapse
Affiliation(s)
- Katherine James
- School of Computing Science, Newcastle University, Newcastle upon Tyne, NE1 7RU, United Kingdom
| | | | | |
Collapse
|
84
|
Phylogeny of the staphylococcal major autolysin and its use in genus and species typing. J Bacteriol 2012; 194:2630-6. [PMID: 22427631 DOI: 10.1128/jb.06609-11] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
The major staphylococcal autolysin Atl is an important player in cell separation and daughter cell formation. In this study, we investigated the amino acid sequences of Atl proteins derived from 15 staphylococcal and 1 macrococcal species representatives. The overall organization of the bifunctional precursor protein consisting of the signal peptide, a propeptide (PP), the amidase (AM), six repeat sequences (R(1) to R(6)), and the glucosaminidase (GL) was highly conserved in all of the species. The most-conserved domains were the enzyme domains AM and GL; the least-conserved regions were the PP and R regions. An Atl-based phylogenetic tree for the various species representatives correlated well with the corresponding 16S rRNA-based tree and also perfectly matched the phylogenetic trees based on core genome analysis. The phylogenetic distance analysis of 18 AtlA proteins of various Staphylococcus aureus strains and 15 AtlE proteins of S. epidermidis revealed that both species representatives formed a relatively homogeneous cluster. Two S. epidermidis strains, M23864:W1 and VCU116, were identified by Atl typing that clustered far more distantly and belonged to either S. caprae and S. capitis or a new subspecies. Here we show that Atl typing is a useful tool for staphylococcal genus and species typing by using either the highly conserved AM domain or the less-conserved PP domain.
Collapse
|
85
|
Merhej V, Notredame C, Royer-Carenzi M, Pontarotti P, Raoult D. The rhizome of life: the sympatric Rickettsia felis paradigm demonstrates the random transfer of DNA sequences. Mol Biol Evol 2012; 28:3213-23. [PMID: 22024628 DOI: 10.1093/molbev/msr239] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Abstract
The intracellular flea symbiont, Rickettsia felis, may meet other organisms intracellularly such as R. typhi. We used a single-gene phylogenetic approach of the 1375 R. felis genes to look for horizontal transfers that occurred as a result of the bacterial promiscuity with other organisms. Our results showed that besides genes that are linked to the Spotted Fever Group, 165 genes have a different history and are linked to other Rickettsia such as R. bellii (107 genes), R. typhi (15 genes), or to other bacteria such as Legionella sp. and Francisella sp. or to eukaryotes. Among these genes, we identified 73 individual genes and 34 spatial clusters containing 2-4 adjacent genes, a total of 79 genes, with evidence of en bloc transfer. We described 13 chimeric genes resulting from gene recombination with sympatric R. typhi. The transferred DNA sequences present different sizes and functions, suggesting that the horizontal transfer in R. felis is random and neutral within its specific host. Our study shows that the strict intracellular bacteria R. felis exhibits a mosaic genome. We therefore developed a new representation for the evolutionary history of R. felis showing its different putative ancestors in the form of a rhizome.
Collapse
Affiliation(s)
- Vicky Merhej
- Unité de Recherche en Maladies Infectieuses et Tropicales Emergentes, CNRS-IRD UMR6236-198, Université de la Méditerranée, Faculté de Médecine, Marseille, France
| | | | | | | | | |
Collapse
|
86
|
CHEN YZ, LI ZR, UNG CY. COMPUTATIONAL METHOD FOR DRUG TARGET SEARCH AND APPLICATION IN DRUG DISCOVERY. JOURNAL OF THEORETICAL & COMPUTATIONAL CHEMISTRY 2012. [DOI: 10.1142/s0219633602000166] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Ligand-protein inverse docking has recently been introduced as a computer method for identification of potential protein targets of a drug. A protein structure database is searched to find proteins to which a drug can bind or weakly bind. Examples of potential applications of this method in facilitating drug discovery include: (1) identification of unknown and secondary therapeutic targets of a drug, (2) prediction of potential toxicity and side effect of an investigative drug, and (3) probing molecular mechanism of bioactive herbal compounds such as those extracted from plants used in traditional medicines. This method and recent results on its applications in solving various drug discovery problems are reviewed.
Collapse
Affiliation(s)
- Y. Z. CHEN
- Department of Computational Science, National University of Singapore, Blk Soc1, Level 7, 3 Science Drive 2, Singapore 117543, Singapore
- Singapore-MIT alliance, National University of Singapore, E4-04-10, 4 Engineering Drive 3, Singapore 117576, Singapore
| | - Z. R. LI
- Department of Computational Science, National University of Singapore, Blk Soc1, Level 7, 3 Science Drive 2, Singapore 117543, Singapore
- Singapore-MIT alliance, National University of Singapore, E4-04-10, 4 Engineering Drive 3, Singapore 117576, Singapore
| | - C. Y. UNG
- Department of Computational Science, National University of Singapore, Blk Soc1, Level 7, 3 Science Drive 2, Singapore 117543, Singapore
| |
Collapse
|
87
|
Environmental Comparative Pharmacology: Theory and Application. EMERGING TOPICS IN ECOTOXICOLOGY 2012. [DOI: 10.1007/978-1-4614-3473-3_5] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
|
88
|
Hallinan JS, James K, Wipat A. Network approaches to the functional analysis of microbial proteins. Adv Microb Physiol 2011; 59:101-33. [PMID: 22114841 DOI: 10.1016/b978-0-12-387661-4.00005-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Large amounts of detailed biological data have been generated over the past few decades. Much of these data is freely available in over 1000 online databases; an enticing, but frustrating resource for microbiologists interested in a systems-level view of the structure and function of microbial cells. The frustration engendered by the need to trawl manually through hundreds of databases in order to accumulate information about a gene, protein, pathway, or organism of interest can be alleviated by the use of computational data integration to generated network views of the system of interest. Biological networks can be constructed from a single type of data, such as protein-protein binding information, or from data generated by multiple experimental approaches. In an integrated network, nodes usually represent genes or gene products, while edges represent some form of interaction between the nodes. Edges between nodes may be weighted to represent the probability that the edge exists in vivo. Networks may also be enriched with ontological annotations, facilitating both visual browsing and computational analysis via web service interfaces. In this review, we describe the construction, analysis of both single-data source and integrated networks, and their application to the inference of protein function in microbes.
Collapse
Affiliation(s)
- J S Hallinan
- School of Computing Science, Newcastle University, Newcastle, UK
| | | | | |
Collapse
|
89
|
WONG WINGCHEONG, MAURER-STROH SEBASTIAN, EISENHABER FRANK. THE JANUS-FACED E-VALUES OF HMMER2: EXTREME VALUE DISTRIBUTION OR LOGISTIC FUNCTION? J Bioinform Comput Biol 2011. [DOI: 10.1142/s0219720011005264] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
E-value guided extrapolation of protein domain annotation from libraries such as Pfam with the HMMER suite is indispensable for hypothesizing about the function of experimentally uncharacterized protein sequences. Since the recent release of HMMER3 does not supersede all functions of HMMER2, the latter will remain relevant for ongoing research as well as for the evaluation of annotations that reside in databases and in the literature. In HMMER2, the E-value is computed from the score via a logistic function or via a domain model-specific extreme value distribution (EVD); the lower of the two is returned as E-value for the domain hit in the query sequence. We find that, for thousands of domain models, this treatment results in switching from the EVD to the statistical model with the logistic function when scores grow (for Pfam release 23, 99% in the global mode and 75% in the fragment mode). If the score corresponding to the breakpoint results in an E-value above a user-defined threshold (e.g. 0.1), a critical score region with conflicting E-values from the logistic function (below the threshold) and from EVD (above the threshold) does exist. Thus, this switch will affect E-value guided annotation decisions in an automated mode. To emphasize, switching in the fragment mode is of no practical relevance since it occurs only at E-values far below 0.1. Unfortunately, a critical score region does exist for 185 domain models in the hmmpfam and 1,748 domain models in the hmmsearch global-search mode. For 145 out the respective 185 models, the critical score region is indeed populated by actual sequences. In total, 24.4% of their hits have a logistic function-derived E-value < 0.1 when the EVD provides an E-value > 0.1. We provide examples of false annotations and critically discuss the appropriateness of a logistic function as alternative to the EVD.
Collapse
Affiliation(s)
- WING-CHEONG WONG
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A *STAR), 30 Biopolis Street, #07-01, Matrix, Singapore 138671, Singapore
| | - SEBASTIAN MAURER-STROH
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A *STAR), 30 Biopolis Street, #07-01, Matrix, Singapore 138671, Singapore
- School of Biological Sciences (SBS), Nanyang Technological University (NTU), 60 Nanyang Drive, Singapore 63755, Singapore
| | - FRANK EISENHABER
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A *STAR), 30 Biopolis Street, #07-01, Matrix, Singapore 138671, Singapore
- Department of Biological Sciences (DBS), National University of Singapore (NUS), 8 Medical Drive, Singapore 117597, Singapore
- School of Computer Engineering (SCE), Nanyang Technological University (NTU), 50 Nanyang Drive, Singapore 637553, Singapore
| |
Collapse
|
90
|
Wong WC, Maurer-Stroh S, Eisenhaber F. Not all transmembrane helices are born equal: Towards the extension of the sequence homology concept to membrane proteins. Biol Direct 2011; 6:57. [PMID: 22024092 PMCID: PMC3217874 DOI: 10.1186/1745-6150-6-57] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2011] [Accepted: 10/25/2011] [Indexed: 12/17/2022] Open
Abstract
BACKGROUND Sequence homology considerations widely used to transfer functional annotation to uncharacterized protein sequences require special precautions in the case of non-globular sequence segments including membrane-spanning stretches composed of non-polar residues. Simple, quantitative criteria are desirable for identifying transmembrane helices (TMs) that must be included into or should be excluded from start sequence segments in similarity searches aimed at finding distant homologues. RESULTS We found that there are two types of TMs in membrane-associated proteins. On the one hand, there are so-called simple TMs with elevated hydrophobicity, low sequence complexity and extraordinary enrichment in long aliphatic residues. They merely serve as membrane-anchoring device. In contrast, so-called complex TMs have lower hydrophobicity, higher sequence complexity and some functional residues. These TMs have additional roles besides membrane anchoring such as intra-membrane complex formation, ligand binding or a catalytic role. Simple and complex TMs can occur both in single- and multi-membrane-spanning proteins essentially in any type of topology. Whereas simple TMs have the potential to confuse searches for sequence homologues and to generate unrelated hits with seemingly convincing statistical significance, complex TMs contain essential evolutionary information. CONCLUSION For extending the homology concept onto membrane proteins, we provide a necessary quantitative criterion to distinguish simple TMs (and a sufficient criterion for complex TMs) in query sequences prior to their usage in homology searches based on assessment of hydrophobicity and sequence complexity of the TM sequence segments.
Collapse
Affiliation(s)
- Wing-Cheong Wong
- Bioinformatics Institute, Agency for Science, Technology and Research, Matrix, Singapore
| | | | | |
Collapse
|
91
|
Cross-Genome Comparisons of Newly Identified Domains in Mycoplasma gallisepticum and Domain Architectures with Other Mycoplasma species. Comp Funct Genomics 2011; 2011:878973. [PMID: 21860605 PMCID: PMC3155973 DOI: 10.1155/2011/878973] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2010] [Revised: 02/21/2011] [Accepted: 05/23/2011] [Indexed: 11/30/2022] Open
Abstract
Accurate functional annotation of protein sequences is hampered by important factors such as the failure of sequence search methods to identify relationships and the inherent diversity in function of proteins related at low sequence similarities. Earlier, we had employed intermediate sequence search approach to establish new domain relationships in the unassigned regions of gene products at the whole genome level by taking Mycoplasma gallisepticum as a specific example and established new domain relationships. In this paper, we report a detailed comparison of the conservation status of the domain and domain architectures of the gene products that bear our newly predicted domains amongst 14 other Mycoplasma genomes and reported the probable implications for the organisms. Some of the domain associations, observed in Mycoplasma that afflict humans and other non-human primates, are involved in regulation of solute transport and DNA binding suggesting specific modes of host-pathogen interactions.
Collapse
|
92
|
Zinman GE, Zhong S, Bar-Joseph Z. Biological interaction networks are conserved at the module level. BMC SYSTEMS BIOLOGY 2011; 5:134. [PMID: 21861884 PMCID: PMC3212960 DOI: 10.1186/1752-0509-5-134] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/16/2011] [Accepted: 08/23/2011] [Indexed: 12/02/2022]
Abstract
Background Orthologous genes are highly conserved between closely related species and biological systems often utilize the same genes across different organisms. However, while sequence similarity often implies functional similarity, interaction data is not well conserved even for proteins with high sequence similarity. Several recent studies comparing high throughput data including expression, protein-protein, protein-DNA, and genetic interactions between close species show conservation at a much lower rate than expected. Results In this work we collected comprehensive high-throughput interaction datasets for four model organisms (S. cerevisiae, S. pombe, C. elegans, and D. melanogaster) and carried out systematic analyses in order to explain the apparent lower conservation of interaction data when compared to the conservation of sequence data. We first showed that several previously proposed hypotheses only provide a limited explanation for such lower conservation rates. We combined all interaction evidences into an integrated network for each species and identified functional modules from these integrated networks. We then demonstrate that interactions that are part of functional modules are conserved at much higher rates than previous reports in the literature, while interactions that connect between distinct functional modules are conserved at lower rates. Conclusions We show that conservation is maintained between species, but mainly at the module level. Our results indicate that interactions within modules are much more likely to be conserved than interactions between proteins in different modules. This provides a network based explanation to the observed conservation rates that can also help explain why so many biological processes are well conserved despite the lower levels of conservation for the interactions of proteins participating in these processes. Accompanying website: http://www.sb.cs.cmu.edu/CrossSP
Collapse
Affiliation(s)
- Guy E Zinman
- Lane Center for Computational Biology, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | | | | |
Collapse
|
93
|
Gallone G, Simpson TI, Armstrong JD, Jarman AP. Bio::Homology::InterologWalk--a Perl module to build putative protein-protein interaction networks through interolog mapping. BMC Bioinformatics 2011; 12:289. [PMID: 21767381 PMCID: PMC3161927 DOI: 10.1186/1471-2105-12-289] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2011] [Accepted: 07/18/2011] [Indexed: 02/25/2023] Open
Abstract
BACKGROUND Protein-protein interaction (PPI) data are widely used to generate network models that aim to describe the relationships between proteins in biological systems. The fidelity and completeness of such networks is primarily limited by the paucity of protein interaction information and by the restriction of most of these data to just a few widely studied experimental organisms. In order to extend the utility of existing PPIs, computational methods can be used that exploit functional conservation between orthologous proteins across taxa to predict putative PPIs or 'interologs'. To date most interolog prediction efforts have been restricted to specific biological domains with fixed underlying data sources and there are no software tools available that provide a generalised framework for 'on-the-fly' interolog prediction. RESULTS We introduce Bio::Homology::InterologWalk, a Perl module to retrieve, prioritise and visualise putative protein-protein interactions through an orthology-walk method. The module uses orthology and experimental interaction data to generate putative PPIs and optionally collates meta-data into an Interaction Prioritisation Index that can be used to help prioritise interologs for further analysis. We show the application of our interolog prediction method to the genomic interactome of the fruit fly, Drosophila melanogaster. We analyse the resulting interaction networks and show that the method proposes new interactome members and interactions that are candidates for future experimental investigation. CONCLUSIONS Our interolog prediction tool employs the Ensembl Perl API and PSICQUIC enabled protein interaction data sources to generate up to date interologs 'on-the-fly'. This represents a significant advance on previous methods for interolog prediction as it allows the use of the latest orthology and protein interaction data for all of the genomes in Ensembl. The module outputs simple text files, making it easy to customise the results by post-processing, allowing the putative PPI datasets to be easily integrated into existing analysis workflows. The Bio::Homology::InterologWalk module, sample scripts and full documentation are freely available from the Comprehensive Perl Archive Network (CPAN) under the GNU Public license.
Collapse
Affiliation(s)
- Giuseppe Gallone
- Centre for Integrative Physiology, University of Edinburgh, Hugh Robson Building, George Square, Edinburgh EH8 9XD, UK.
| | | | | | | |
Collapse
|
94
|
José-Edwards DS, Kerner P, Kugler JE, Deng W, Jiang D, Di Gregorio A. The identification of transcription factors expressed in the notochord of Ciona intestinalis adds new potential players to the brachyury gene regulatory network. Dev Dyn 2011; 240:1793-805. [PMID: 21594950 PMCID: PMC3685856 DOI: 10.1002/dvdy.22656] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/20/2011] [Indexed: 11/07/2022] Open
Abstract
The notochord is the distinctive characteristic of chordates; however, the knowledge of the complement of transcription factors governing the development of this structure is still incomplete. Here we present the expression patterns of seven transcription factor genes detected in the notochord of the ascidian Ciona intestinalis at various stages of embryonic development. Four of these transcription factors, Fos-a, NFAT5, AFF and Klf15, have not been directly associated with the notochord in previous studies, while the others, including Spalt-like-a, Lmx-like, and STAT5/6-b, display evolutionarily conserved expression in this structure as well as in other domains. We examined the hierarchical relationships between these genes and the transcription factor Brachyury, which is necessary for notochord development in all chordates. We found that Ciona Brachyury regulates the expression of most, although not all, of these genes. These results shed light on the genetic regulatory program underlying notochord formation in Ciona and possibly other chordates.
Collapse
Affiliation(s)
- Diana S. José-Edwards
- Department of Cell and Developmental Biology, Weill Medical College of Cornell University, 1300 York Avenue, Box 60, New York, NY 10065, U.S.A
| | - Pierre Kerner
- Department of Cell and Developmental Biology, Weill Medical College of Cornell University, 1300 York Avenue, Box 60, New York, NY 10065, U.S.A
| | - Jamie E. Kugler
- Department of Cell and Developmental Biology, Weill Medical College of Cornell University, 1300 York Avenue, Box 60, New York, NY 10065, U.S.A
| | - Wei Deng
- Sars International Centre for Marine Molecular Biology, Thormøhlensgt. 55, N-5008 Bergen, Norway
| | - Di Jiang
- Sars International Centre for Marine Molecular Biology, Thormøhlensgt. 55, N-5008 Bergen, Norway
| | - Anna Di Gregorio
- Department of Cell and Developmental Biology, Weill Medical College of Cornell University, 1300 York Avenue, Box 60, New York, NY 10065, U.S.A
| |
Collapse
|
95
|
Orosz F. Apicomplexan apicortins possess a long disordered N-terminal extension. INFECTION GENETICS AND EVOLUTION 2011; 11:1037-44. [DOI: 10.1016/j.meegid.2011.03.023] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/03/2010] [Revised: 03/24/2011] [Accepted: 03/25/2011] [Indexed: 01/01/2023]
|
96
|
Cloning and characterization of boron transporters in Brassica napus. Mol Biol Rep 2011; 39:1963-73. [PMID: 21660474 DOI: 10.1007/s11033-011-0930-z] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2010] [Accepted: 05/24/2011] [Indexed: 10/18/2022]
Abstract
Six full-length cDNA encoding boron transporters (BOR) were isolated from Brassica napus (AACC) by rapid amplification of cDNA ends (RACE). The phylogenic analysis revealed that the six BORs were the orthologues of AtBOR1, which formed companying with the triplication and allotetra-ploidization process of B. napus, and were divided into three groups in B. napus. Each group was comprised of two members, one of which was originated from Brassica rapa (AA) and the other from Brassica oleracea (CC). Based on the phylogenetic relationships, the six genes were named as BnBOR1;1a, BnBOR1;1c, BnBOR1;2a, BnBOR1;2c, BnBOR1;3a and BnBOR1;3c, respectively. The deduced BnBOR1 s had extensive similarity with other plant BORs, with the identity of 74-96.8% in amino acid sequence. The BnBOR1;3a and BnBOR1;3c resembled AtBOR1 in number and positions of the 11 introns, but the others only have 9 introns. After the gene duplication, there was evidence of purifying selection under a divergent selective pressure. The expression patterns of the six BnBOR1 s were detected by semi-quantitative RT-PCR. The BnBOR1;3a and BnBOR1;3c showed a ubiquitous expression in all of the investigated tissues, whereas the other four genes showed similar tissue-specific expression profile. Unlike the non-transcriptional regulation of AtBOR1, the expression of BnBOR1;1c and BnBOR1;2a were obviously induced by boron deficiency. This study suggested that the BOR1 s had undergone a divergent expression pattern in the genome of B. napus after that the B. napus diverged from Arabidopsis thaliana.
Collapse
|
97
|
Lechner M, Findeiss S, Steiner L, Marz M, Stadler PF, Prohaska SJ. Proteinortho: detection of (co-)orthologs in large-scale analysis. BMC Bioinformatics 2011; 12:124. [PMID: 21526987 PMCID: PMC3114741 DOI: 10.1186/1471-2105-12-124] [Citation(s) in RCA: 825] [Impact Index Per Article: 58.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2010] [Accepted: 04/28/2011] [Indexed: 02/07/2023] Open
Abstract
Background Orthology analysis is an important part of data analysis in many areas of bioinformatics such as comparative genomics and molecular phylogenetics. The ever-increasing flood of sequence data, and hence the rapidly increasing number of genomes that can be compared simultaneously, calls for efficient software tools as brute-force approaches with quadratic memory requirements become infeasible in practise. The rapid pace at which new data become available, furthermore, makes it desirable to compute genome-wide orthology relations for a given dataset rather than relying on relations listed in databases. Results The program Proteinortho described here is a stand-alone tool that is geared towards large datasets and makes use of distributed computing techniques when run on multi-core hardware. It implements an extended version of the reciprocal best alignment heuristic. We apply Proteinortho to compute orthologous proteins in the complete set of all 717 eubacterial genomes available at NCBI at the beginning of 2009. We identified thirty proteins present in 99% of all bacterial proteomes. Conclusions Proteinortho significantly reduces the required amount of memory for orthology analysis compared to existing tools, allowing such computations to be performed on off-the-shelf hardware.
Collapse
Affiliation(s)
- Marcus Lechner
- RNA Bioinformatics Group, Department of Pharmaceutical Chemistry, Philipps-University Marburg, Germany.
| | | | | | | | | | | |
Collapse
|
98
|
Salichos L, Rokas A. Evaluating ortholog prediction algorithms in a yeast model clade. PLoS One 2011; 6:e18755. [PMID: 21533202 PMCID: PMC3076445 DOI: 10.1371/journal.pone.0018755] [Citation(s) in RCA: 74] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2010] [Accepted: 03/15/2011] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND Accurate identification of orthologs is crucial for evolutionary studies and for functional annotation. Several algorithms have been developed for ortholog delineation, but so far, manually curated genome-scale biological databases of orthologous genes for algorithm evaluation have been lacking. We evaluated four popular ortholog prediction algorithms (MultiParanoid; and OrthoMCL; RBH: Reciprocal Best Hit; RSD: Reciprocal Smallest Distance; the last two extended into clustering algorithms cRBH and cRSD, respectively, so that they can predict orthologs across multiple taxa) against a set of 2,723 groups of high-quality curated orthologs from 6 Saccharomycete yeasts in the Yeast Gene Order Browser. RESULTS Examination of sensitivity [TP/(TP+FN)], specificity [TN/(TN+FP)], and accuracy [(TP+TN)/(TP+TN+FP+FN)] across a broad parameter range showed that cRBH was the most accurate and specific algorithm, whereas OrthoMCL was the most sensitive. Evaluation of the algorithms across a varying number of species showed that cRBH had the highest accuracy and lowest false discovery rate [FP/(FP+TP)], followed by cRSD. Of the six species in our set, three descended from an ancestor that underwent whole genome duplication. Subsequent differential duplicate loss events in the three descendants resulted in distinct classes of gene loss patterns, including cases where the genes retained in the three descendants are paralogs, constituting 'traps' for ortholog prediction algorithms. We found that the false discovery rate of all algorithms dramatically increased in these traps. CONCLUSIONS These results suggest that simple algorithms, like cRBH, may be better ortholog predictors than more complex ones (e.g., OrthoMCL and MultiParanoid) for evolutionary and functional genomics studies where the objective is the accurate inference of single-copy orthologs (e.g., molecular phylogenetics), but that all algorithms fail to accurately predict orthologs when paralogy is rampant.
Collapse
Affiliation(s)
- Leonidas Salichos
- Department of Biological Sciences, Vanderbilt University, Nashville,
Tennessee, United States of America
| | - Antonis Rokas
- Department of Biological Sciences, Vanderbilt University, Nashville,
Tennessee, United States of America
| |
Collapse
|
99
|
Kerner P, Degnan SM, Marchand L, Degnan BM, Vervoort M. Evolution of RNA-binding proteins in animals: insights from genome-wide analysis in the sponge Amphimedon queenslandica. Mol Biol Evol 2011; 28:2289-303. [PMID: 21325094 DOI: 10.1093/molbev/msr046] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
RNA-binding proteins (RBPs) are key players in various biological processes, most notably regulation of gene expression at the posttranscriptional level. Although many RBPs have been carefully studied in model organisms, very few studies have addressed the evolution of these proteins at the scale of the animal kingdom. We identified a large set of putative RBPs encoded by the genome of the demosponge Amphimedon queenslandica, a species representing a basal animal lineage. We compared the Amphimedon RBPs with those encoded by the genomes of two bilaterians (human and Drosophila), representatives of two other basal metazoan lineages (a placozoan and a cnidarian), a choanoflagellate (probable sister group of animals), and two fungi. We established the evolutionary history of 32 families of RBPs and found that most of the diversity of RBPs present in contemporary metazoans, including humans, was already established in the last common ancestor (LCA) of animals. This includes RBPs known to be involved in key processes in bilaterians, such as development, stem and/or germ cells properties, and noncoding RNA pathways. From this analysis, we infer that a complex toolkit of RBPs was present in the LCA of animals and that it has been recruited to perform new functions during early animal evolution, in particular in relation to the acquisition of multicellularity.
Collapse
Affiliation(s)
- Pierre Kerner
- Development and Neurobiology Programme, Institut Jacques Monod, Centre national de la recherche scientifique/Université Paris Diderot-Paris 7, Paris cedex, France
| | | | | | | | | |
Collapse
|
100
|
Kugler JE, Kerner P, Bouquet JM, Jiang D, Di Gregorio A. Evolutionary changes in the notochord genetic toolkit: a comparative analysis of notochord genes in the ascidian Ciona and the larvacean Oikopleura. BMC Evol Biol 2011; 11:21. [PMID: 21251251 PMCID: PMC3034685 DOI: 10.1186/1471-2148-11-21] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2010] [Accepted: 01/20/2011] [Indexed: 11/12/2022] Open
Abstract
Background The notochord is a defining feature of the chordate clade, and invertebrate chordates, such as tunicates, are uniquely suited for studies of this structure. Here we used a well-characterized set of 50 notochord genes known to be targets of the notochord-specific Brachyury transcription factor in one tunicate, Ciona intestinalis (Class Ascidiacea), to begin determining whether the same genetic toolkit is employed to build the notochord in another tunicate, Oikopleura dioica (Class Larvacea). We identified Oikopleura orthologs of the Ciona notochord genes, as well as lineage-specific duplicates for which we determined the phylogenetic relationships with related genes from other chordates, and we analyzed their expression patterns in Oikopleura embryos. Results Of the 50 Ciona notochord genes that were used as a reference, only 26 had clearly identifiable orthologs in Oikopleura. Two of these conserved genes appeared to have undergone Oikopleura- and/or tunicate-specific duplications, and one was present in three copies in Oikopleura, thus bringing the number of genes to test to 30. We were able to clone and test 28 of these genes. Thirteen of the 28 Oikopleura orthologs of Ciona notochord genes showed clear expression in all or in part of the Oikopleura notochord, seven were diffusely expressed throughout the tail, six were expressed in tissues other than the notochord, while two probes did not provide a detectable signal at any of the stages analyzed. One of the notochord genes identified, Oikopleura netrin, was found to be unevenly expressed in notochord cells, in a pattern reminiscent of that previously observed for one of the Oikopleura Hox genes. Conclusions A surprisingly high number of Ciona notochord genes do not have apparent counterparts in Oikopleura, and only a fraction of the evolutionarily conserved genes show clear notochord expression. This suggests that Ciona and Oikopleura, despite the morphological similarities of their notochords, have developed rather divergent sets of notochord genes after their split from a common tunicate ancestor. This study demonstrates that comparisons between divergent tunicates can lead to insights into the basic complement of genes sufficient for notochord development, and elucidate the constraints that control its composition.
Collapse
Affiliation(s)
- Jamie E Kugler
- Department of Cell and Developmental Biology, Weill Medical College of Cornell University, New York, NY 10065, USA
| | | | | | | | | |
Collapse
|