1
|
Dosch J, Bergmann H, Tran V, Ebersberger I. FAS: assessing the similarity between proteins using multi-layered feature architectures. Bioinformatics 2023; 39:btad226. [PMID: 37084276 PMCID: PMC10185405 DOI: 10.1093/bioinformatics/btad226] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2022] [Revised: 02/23/2023] [Accepted: 04/13/2023] [Indexed: 04/23/2023] Open
Abstract
MOTIVATION Protein sequence comparison is a fundamental element in the bioinformatics toolkit. When sequences are annotated with features such as functional domains, transmembrane domains, low complexity regions or secondary structure elements, the resulting feature architectures allow better informed comparisons. However, many existing schemes for scoring architecture similarities cannot cope with features arising from multiple annotation sources. Those that do fall short in the resolution of overlapping and redundant feature annotations. RESULTS Here, we introduce FAS, a scoring method that integrates features from multiple annotation sources in a directed acyclic architecture graph. Redundancies are resolved as part of the architecture comparison by finding the paths through the graphs that maximize the pair-wise architecture similarity. In a large-scale evaluation on more than 10 000 human-yeast ortholog pairs, architecture similarities assessed with FAS are consistently more plausible than those obtained using e-values to resolve overlaps or leaving overlaps unresolved. Three case studies demonstrate the utility of FAS on architecture comparison tasks: benchmarking of orthology assignment software, identification of functionally diverged orthologs, and diagnosing protein architecture changes stemming from faulty gene predictions. With the help of FAS, feature architecture comparisons can now be routinely integrated into these and many other applications. AVAILABILITY AND IMPLEMENTATION FAS is available as python package: https://pypi.org/project/greedyFAS/.
Collapse
Affiliation(s)
- Julian Dosch
- Applied Bioinformatics Group, Goethe University Frankfurt, Faculty of Biosciences, Institute of Cell Biology and Neuroscience, Frankfurt, 60438, Germany
| | - Holger Bergmann
- Applied Bioinformatics Group, Goethe University Frankfurt, Faculty of Biosciences, Institute of Cell Biology and Neuroscience, Frankfurt, 60438, Germany
| | - Vinh Tran
- Applied Bioinformatics Group, Goethe University Frankfurt, Faculty of Biosciences, Institute of Cell Biology and Neuroscience, Frankfurt, 60438, Germany
| | - Ingo Ebersberger
- Applied Bioinformatics Group, Goethe University Frankfurt, Faculty of Biosciences, Institute of Cell Biology and Neuroscience, Frankfurt, 60438, Germany
- Senckenberg Biodiversity and Climate Research Centre (S-BIKF), Frankfurt, 60325, Germany
- LOEWE Centre for Translational Biodiversity Genomics (TBG), Frankfurt, 60325, Germany
| |
Collapse
|
2
|
Moussa S, Kilgour M, Jans C, Hernandez-Garcia A, Cuperlovic-Culf M, Bengio Y, Simine L. Diversifying Design of Nucleic Acid Aptamers Using Unsupervised Machine Learning. J Phys Chem B 2023; 127:62-68. [PMID: 36574492 DOI: 10.1021/acs.jpcb.2c05660] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
Inverse design of short single-stranded RNA and DNA sequences (aptamers) is the task of finding sequences that satisfy a set of desired criteria. Relevant criteria may be, for example, the presence of specific folding motifs, binding to molecular ligands, sensing properties, and so on. Most practical approaches to aptamer design identify a small set of promising candidate sequences using high-throughput experiments (e.g., SELEX) and then optimize performance by introducing only minor modifications to the empirically found candidates. Sequences that possess the desired properties but differ drastically in chemical composition will add diversity to the search space and facilitate the discovery of useful nucleic acid aptamers. Systematic diversification protocols are needed. Here we propose to use an unsupervised machine learning model known as the Potts model to discover new, useful sequences with controllable sequence diversity. We start by training a Potts model using the maximum entropy principle on a small set of empirically identified sequences unified by a common feature. To generate new candidate sequences with a controllable degree of diversity, we take advantage of the model's spectral feature: an "energy" bandgap separating sequences that are similar to the training set from those that are distinct. By controlling the Potts energy range that is sampled, we generate sequences that are distinct from the training set yet still likely to have the encoded features. To demonstrate performance, we apply our approach to design diverse pools of sequences with specified secondary structure motifs in 30-mer RNA and DNA aptamers.
Collapse
Affiliation(s)
- Siba Moussa
- Department of Chemistry, McGill University, 801 Sherbrooke Street West, Montreal, QuebecH3A 0B8, Canada
| | - Michael Kilgour
- Department of Chemistry, McGill University, 801 Sherbrooke Street West, Montreal, QuebecH3A 0B8, Canada
| | - Clara Jans
- Department of Chemistry, McGill University, 801 Sherbrooke Street West, Montreal, QuebecH3A 0B8, Canada
| | - Alex Hernandez-Garcia
- Montreal Institute for Learning Algorithms, 6666 St. Urbain, #200, Montreal, QuebecH2S 3H1, Canada
| | - Miroslava Cuperlovic-Culf
- Digital Technologies Research Centre, National Research Council of Canada, 1200 Montreal Road, Ottawa, OntarioK1A 0R6, Canada
| | - Yoshua Bengio
- Montreal Institute for Learning Algorithms, 6666 St. Urbain, #200, Montreal, QuebecH2S 3H1, Canada
| | - Lena Simine
- Department of Chemistry, McGill University, 801 Sherbrooke Street West, Montreal, QuebecH3A 0B8, Canada
| |
Collapse
|
3
|
Sepúlveda V, Maurelia F, González M, Aguayo J, Caprile T. SCO-spondin, a giant matricellular protein that regulates cerebrospinal fluid activity. Fluids Barriers CNS 2021; 18:45. [PMID: 34600566 PMCID: PMC8487547 DOI: 10.1186/s12987-021-00277-w] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2021] [Accepted: 09/11/2021] [Indexed: 12/28/2022] Open
Abstract
Cerebrospinal fluid is a clear fluid that occupies the ventricular and subarachnoid spaces within and around the brain and spinal cord. Cerebrospinal fluid is a dynamic signaling milieu that transports nutrients, waste materials and neuroactive substances that are crucial for the development, homeostasis and functionality of the central nervous system. The mechanisms that enable cerebrospinal fluid to simultaneously exert these homeostatic/dynamic functions are not fully understood. SCO-spondin is a large glycoprotein secreted since the early stages of development into the cerebrospinal fluid. Its domain architecture resembles a combination of a matricellular protein and the ligand-binding region of LDL receptor family. The matricellular proteins are a group of extracellular proteins with the capacity to interact with different molecules, such as growth factors, cytokines and cellular receptors; enabling the integration of information to modulate various physiological and pathological processes. In the same way, the LDL receptor family interacts with many ligands, including β-amyloid peptide and different growth factors. The domains similarity suggests that SCO-spondin is a matricellular protein enabled to bind, modulate, and transport different cerebrospinal fluid molecules. SCO-spondin can be found soluble or polymerized into a dynamic threadlike structure called the Reissner fiber, which extends from the diencephalon to the caudal tip of the spinal cord. Reissner fiber continuously moves caudally as new SCO-spondin molecules are added at the cephalic end and are disaggregated at the caudal end. This movement, like a conveyor belt, allows the transport of the bound molecules, thereby increasing their lifespan and action radius. The binding of SCO-spondin to some relevant molecules has already been reported; however, in this review we suggest more than 30 possible binding partners, including peptide β-amyloid and several growth factors. This new perspective characterizes SCO-spondin as a regulator of cerebrospinal fluid activity, explaining its high evolutionary conservation, its apparent multifunctionality, and the lethality or severe malformations, such as hydrocephalus and curved body axis, of knockout embryos. Understanding the regulation and identifying binding partners of SCO-spondin are crucial for better comprehension of cerebrospinal fluid physiology.
Collapse
Affiliation(s)
- Vania Sepúlveda
- Departamento de Biología Celular, Facultad de Ciencias Biológicas, Universidad de Concepción, Concepción, Chile
| | - Felipe Maurelia
- Departamento de Biología Celular, Facultad de Ciencias Biológicas, Universidad de Concepción, Concepción, Chile
| | - Maryori González
- Departamento de Biología Celular, Facultad de Ciencias Biológicas, Universidad de Concepción, Concepción, Chile
| | - Jaime Aguayo
- Departamento de Biología Celular, Facultad de Ciencias Biológicas, Universidad de Concepción, Concepción, Chile
| | - Teresa Caprile
- Departamento de Biología Celular, Facultad de Ciencias Biológicas, Universidad de Concepción, Concepción, Chile.
| |
Collapse
|
4
|
Strepis N, Naranjo HD, Meier-Kolthoff J, Göker M, Shapiro N, Kyrpides N, Klenk HP, Schaap PJ, Stams AJM, Sousa DZ. Genome-guided analysis allows the identification of novel physiological traits in Trichococcus species. BMC Genomics 2020; 21:24. [PMID: 31914924 PMCID: PMC6950789 DOI: 10.1186/s12864-019-6410-x] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2018] [Accepted: 12/18/2019] [Indexed: 11/29/2022] Open
Abstract
BACKGROUND The genus Trichococcus currently contains nine species: T. flocculiformis, T. pasteurii, T. palustris, T. collinsii, T. patagoniensis, T. ilyis, T. paludicola, T. alkaliphilus, and T. shcherbakoviae. In general, Trichococcus species can degrade a wide range of carbohydrates. However, only T. pasteurii and a non-characterized strain of Trichococcus, strain ES5, have the capacity of converting glycerol to mainly 1,3-propanediol. Comparative genomic analysis of Trichococcus species provides the opportunity to further explore the physiological potential and uncover novel properties of this genus. RESULTS In this study, a genotype-phenotype comparative analysis of Trichococcus strains was performed. The genome of Trichococcus strain ES5 was sequenced and included in the comparison with the other nine type strains. Genes encoding functions related to e.g. the utilization of different carbon sources (glycerol, arabinan and alginate), antibiotic resistance, tolerance to low temperature and osmoregulation could be identified in all the sequences analysed. T. pasteurii and Trichococcus strain ES5 contain a operon with genes encoding necessary enzymes for 1,3-PDO production from glycerol. All the analysed genomes comprise genes encoding for cold shock domains, but only five of the Trichococcus species can grow at 0 °C. Protein domains associated to osmoregulation mechanisms are encoded in the genomes of all Trichococcus species, except in T. palustris, which had a lower resistance to salinity than the other nine studied Trichococcus strains. CONCLUSIONS Genome analysis and comparison of ten Trichococcus strains allowed the identification of physiological traits related to substrate utilization and environmental stress resistance (e.g. to cold and salinity). Some substrates were used by single species, e.g. alginate by T. collinsii and arabinan by T. alkaliphilus. Strain ES5 may represent a subspecies of Trichococcus flocculiformis and contrary to the type strain (DSM 2094T), is able to grow on glycerol with the production of 1,3-propanediol.
Collapse
Affiliation(s)
- Nikolaos Strepis
- Laboratory of Microbiology, Wageningen University & Research, Stippeneng 4, 6708 WE Wageningen, The Netherlands
- Laboratory of Systems and Synthetic Biology, Wageningen University & Research, Stippeneng 4, 6708 WE Wageningen, The Netherlands
| | - Henry D. Naranjo
- Laboratory of Microbiology, Wageningen University & Research, Stippeneng 4, 6708 WE Wageningen, The Netherlands
| | - Jan Meier-Kolthoff
- Leibniz Institute DSMZ German Collection of Microorganisms and Cell Cultures, Inhoffenstraße 7B, 38124 Braunschweig, Germany
| | - Markus Göker
- Leibniz Institute DSMZ German Collection of Microorganisms and Cell Cultures, Inhoffenstraße 7B, 38124 Braunschweig, Germany
| | - Nicole Shapiro
- DOE Joint Genome Institute, 2800 Mitchell Drive 100, CA, Walnut Creek, CA 94598 USA
| | - Nikos Kyrpides
- DOE Joint Genome Institute, 2800 Mitchell Drive 100, CA, Walnut Creek, CA 94598 USA
| | - Hans-Peter Klenk
- Leibniz Institute DSMZ German Collection of Microorganisms and Cell Cultures, Inhoffenstraße 7B, 38124 Braunschweig, Germany
- School of Biology, Newcastle University, Ridley Building 2, Newcastle, NE1 7RU UK
| | - Peter J. Schaap
- Laboratory of Systems and Synthetic Biology, Wageningen University & Research, Stippeneng 4, 6708 WE Wageningen, The Netherlands
| | - Alfons J. M. Stams
- Laboratory of Microbiology, Wageningen University & Research, Stippeneng 4, 6708 WE Wageningen, The Netherlands
- Centre of Biological Engineering, University of Minho, Campus de Gualtar, 4710-057 Braga, Portugal
| | - Diana Z. Sousa
- Laboratory of Microbiology, Wageningen University & Research, Stippeneng 4, 6708 WE Wageningen, The Netherlands
| |
Collapse
|
5
|
Hernandez-Guerrero R, Galán-Vásquez E, Pérez-Rueda E. The protein architecture in Bacteria and Archaea identifies a set of promiscuous and ancient domains. PLoS One 2019; 14:e0226604. [PMID: 31856202 PMCID: PMC6922389 DOI: 10.1371/journal.pone.0226604] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2019] [Accepted: 11/29/2019] [Indexed: 11/19/2022] Open
Abstract
In this work, we describe a systematic comparative genomic analysis of promiscuous domains in genomes of Bacteria and Archaea. A quantitative measure of domain promiscuity, the weighted domain architecture score (WDAS), was used and applied to 1317 domains in 1320 genomes of Bacteria and Archaea. A functional analysis associated with the WDAS per genome showed that 18 of 50 functional categories were identified as significantly enriched in the promiscuous domains; in particular, small-molecule binding domains, transferases domains, DNA binding domains (transcription factors), and signal transduction domains were identified as promiscuous. In contrast, non-promiscuous domains were identified as associated with 6 of 50 functional categories, and the category Function unknown was enriched. In addition, the WDASs of 52 domains correlated with genome size, i.e., WDAS values decreased as the genome size increased, suggesting that the number of combinations at larger domains increases, including domains in the superfamilies Winged helix-turn-helix and P-loop-containing nucleoside triphosphate hydrolases. Finally, based on classification of the domains according to their ancestry, we determined that the set of 52 promiscuous domains are also ancient and abundant among all the genomes, in contrast to the non-promiscuous domains. In summary, we consider that the association between these two classes of protein domains (promiscuous and non-promiscuous) provides bacterial and archaeal cells with the ability to respond to diverse environmental challenges.
Collapse
Affiliation(s)
- Rafael Hernandez-Guerrero
- Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Universidad Nacional Autónoma de México, Unidad Académica Yucatán, Mérida, Yucatán, México
| | - Edgardo Galán-Vásquez
- Departamento de Ingeniería de Sistemas Computacionales y Automatización, Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Ciudad Universitaria, Universidad Nacional Autónoma de México, Ciudad de México, México
| | - Ernesto Pérez-Rueda
- Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Universidad Nacional Autónoma de México, Unidad Académica Yucatán, Mérida, Yucatán, México
- Centro de Genómica y Bioinformática, Facultad de Ciencias, Universidad Mayor, Santiago, Chile
- * E-mail:
| |
Collapse
|
6
|
Abstract
This chapter reviews current research on how protein domain architectures evolve. We begin by summarizing work on the phylogenetic distribution of proteins, as this will directly impact which domain architectures can be formed in different species. Studies relating domain family size to occurrence have shown that they generally follow power law distributions, both within genomes and larger evolutionary groups. These findings were subsequently extended to multi-domain architectures. Genome evolution models that have been suggested to explain the shape of these distributions are reviewed, as well as evidence for selective pressure to expand certain domain families more than others. Each domain has an intrinsic combinatorial propensity, and the effects of this have been studied using measures of domain versatility or promiscuity. Next, we study the principles of protein domain architecture evolution and how these have been inferred from distributions of extant domain arrangements. Following this, we review inferences of ancestral domain architecture and the conclusions concerning domain architecture evolution mechanisms that can be drawn from these. Finally, we examine whether all known cases of a given domain architecture can be assumed to have a single common origin (monophyly) or have evolved convergently (polyphyly). We end by a discussion of some available tools for computational analysis or exploitation of protein domain architectures and their evolution.
Collapse
|
7
|
Perez-Rueda E, Hernandez-Guerrero R, Martinez-Nuñez MA, Armenta-Medina D, Sanchez I, Ibarra JA. Abundance, diversity and domain architecture variability in prokaryotic DNA-binding transcription factors. PLoS One 2018; 13:e0195332. [PMID: 29614096 PMCID: PMC5882156 DOI: 10.1371/journal.pone.0195332] [Citation(s) in RCA: 49] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2017] [Accepted: 03/20/2018] [Indexed: 02/04/2023] Open
Abstract
Gene regulation at the transcriptional level is a central process in all organisms, and DNA-binding transcription factors, known as TFs, play a fundamental role. This class of proteins usually binds at specific DNA sequences, activating or repressing gene expression. In general, TFs are composed of two domains: the DNA-binding domain (DBD) and an extra domain, which in this work we have named “companion domain” (CD). This latter could be involved in one or more functions such as ligand binding, protein-protein interactions or even with enzymatic activity. In contrast to DBDs, which have been widely characterized both experimentally and bioinformatically, information on the abundance, distribution, variability and possible role of the CDs is scarce. Here, we investigated these issues associated with the domain architectures of TFs in prokaryotic genomes. To this end, 19 families of TFs in 761 non-redundant bacterial and archaeal genomes were evaluated. In this regard we found four main groups based on the abundance and distribution in the analyzed genomes: i) LysR and TetR/AcrR; ii) AraC/XylS, SinR, and others; iii) Lrp, Fis, ArsR, and others; and iv) a group that included only two families, ArgR and BirA. Based on a classification of the organisms according to the life-styles, a major abundance of regulatory families in free-living organisms, in contrast with pathogenic, extremophilic or intracellular organisms, was identified. Finally, the protein architecture diversity associated to the 19 families considering a weight score for domain promiscuity evidenced which regulatory families were characterized by either a large diversity of CDs, here named as “promiscuous” families given the elevated number of variable domains found in those TFs, or a low diversity of CDs. Altogether this information helped us to understand the diversity and distribution of the 19 Prokaryotes TF families. Moreover, initial steps were taken to comprehend the variability of the extra domain in those TFs, which eventually might assist in evolutionary and functional studies.
Collapse
Affiliation(s)
- Ernesto Perez-Rueda
- Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Universidad Nacional Autónoma de México, Unidad Académica Yucatán, Mérida, Yucatán, México
- Departamento de Ingenieria Celular y Biocatálisis, Instituto de Biotecnología, UNAM, Cuernavaca, Morelos, México
- * E-mail: (EPR); , (JAI)
| | - Rafael Hernandez-Guerrero
- Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Universidad Nacional Autónoma de México, Unidad Académica Yucatán, Mérida, Yucatán, México
| | - Mario Alberto Martinez-Nuñez
- Laboratorio de Ecogenómica, Unidad Académica de Ciencias y Tecnología de Yucatán, Facultad de Ciencias, UNAM, Mérida, Yucatán, México
| | | | - Israel Sanchez
- Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Universidad Nacional Autónoma de México, Unidad Académica Yucatán, Mérida, Yucatán, México
| | - J. Antonio Ibarra
- Laboratorio de Genética Microbiana, Departamento de Microbiología, Escuela Nacional de Ciencias Biológicas, Instituto Politécnico Nacional, Ciudad de México, México
- * E-mail: (EPR); , (JAI)
| |
Collapse
|
8
|
Mata AR, Pacheco CM, Cruz Pérez JF, Sáenz MM, Baca BE. In silico comparative analysis of GGDEF and EAL domain signaling proteins from the Azospirillum genomes. BMC Microbiol 2018; 18:20. [PMID: 29523074 PMCID: PMC5845226 DOI: 10.1186/s12866-018-1157-0] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2017] [Accepted: 02/09/2018] [Indexed: 01/13/2023] Open
Abstract
BACKGROUND The cyclic-di-GMP (c-di-GMP) second messenger exemplifies a signaling system that regulates many bacterial behaviors of key importance; among them, c-di-GMP controls the transition between motile and sessile life-styles in bacteria. Cellular c-di-GMP levels in bacteria are regulated by the opposite enzymatic activities of diguanylate cyclases and phosphodiesterases, which are proteins that have GGDEF and EAL domains, respectively. Azospirillum is a genus of plant-growth-promoting bacteria, and members of this genus have beneficial effects in many agronomically and ecologically essential plants. These bacteria also inhabit aquatic ecosystems, and have been isolated from humus-reducing habitats. Bioinformatic and structural approaches were used to identify genes predicted to encode GG[D/E]EF, EAL and GG[D/E]EF-EAL domain proteins from nine genome sequences. RESULTS The analyzed sequences revealed that the genomes of A. humicireducens SgZ-5T, A. lipoferum 4B, Azospirillum sp. B510, A. thiophilum BV-ST, A. halopraeferens DSM3675, A. oryzae A2P, and A. brasilense Sp7, Sp245 and Az39 encode for 29 to 41 of these predicted proteins. Notably, only 15 proteins were conserved in all nine genomes: eight GGDEF, three EAL and four GGDEF-EAL hybrid domain proteins, all of which corresponded to core genes in the genomes. The predicted proteins exhibited variable lengths, architectures and sensor domains. In addition, the predicted cellular localizations showed that some of the proteins to contain transmembrane domains, suggesting that these proteins are anchored to the membrane. Therefore, as reported in other soil bacteria, the Azospirillum genomes encode a large number of proteins that are likely involved in c-di-GMP metabolism. In addition, the data obtained here strongly suggest host specificity and environment specific adaptation. CONCLUSIONS Bacteria of the Azospirillum genus cope with diverse environmental conditions to survive in soil and aquatic habitats and, in certain cases, to colonize and benefit their host plant. Gaining information on the structures of proteins involved in c-di-GMP metabolism in Azospirillum appears to be an important step in determining the c-di-GMP signaling pathways, involved in the transition of a motile cell towards a biofilm life-style, as an example of microbial genome plasticity under diverse in situ environments.
Collapse
Affiliation(s)
- Alberto Ramírez Mata
- Centro de Investigaciones en Ciencias Microbiológicas, Benemérita Universidad Autónoma de Puebla. Edif. IC11, Ciudad Universitaria, Col. San Manuel Puebla Pue, CP72570 Puebla, Mexico
| | - César Millán Pacheco
- Facultad de Farmacia, Universidad Autónoma del Estado de Morelos, Av. Universidad #1001, Col. Chamilpa, C.P, 62209 Cuernavaca, Morelos Mexico
| | - José F. Cruz Pérez
- Centro de Investigaciones en Ciencias Microbiológicas, Benemérita Universidad Autónoma de Puebla. Edif. IC11, Ciudad Universitaria, Col. San Manuel Puebla Pue, CP72570 Puebla, Mexico
| | - Martha Minjárez Sáenz
- Centro de Investigaciones en Ciencias Microbiológicas, Benemérita Universidad Autónoma de Puebla. Edif. IC11, Ciudad Universitaria, Col. San Manuel Puebla Pue, CP72570 Puebla, Mexico
| | - Beatriz E. Baca
- Centro de Investigaciones en Ciencias Microbiológicas, Benemérita Universidad Autónoma de Puebla. Edif. IC11, Ciudad Universitaria, Col. San Manuel Puebla Pue, CP72570 Puebla, Mexico
| |
Collapse
|
9
|
Koehorst JJ, Saccenti E, Schaap PJ, Martins Dos Santos VAP, Suarez-Diez M. Protein domain architectures provide a fast, efficient and scalable alternative to sequence-based methods for comparative functional genomics. F1000Res 2016; 5:1987. [PMID: 27703668 PMCID: PMC5031134 DOI: 10.12688/f1000research.9416.3] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 06/26/2017] [Indexed: 11/20/2022] Open
Abstract
A functional comparative genome analysis is essential to understand the mechanisms underlying bacterial evolution and adaptation. Detection of functional orthologs using standard global sequence similarity methods faces several problems; the need for defining arbitrary acceptance thresholds for similarity and alignment length, lateral gene acquisition and the high computational cost for finding bi-directional best matches at a large scale. We investigated the use of protein domain architectures for large scale functional comparative analysis as an alternative method. The performance of both approaches was assessed through functional comparison of 446 bacterial genomes sampled at different taxonomic levels. We show that protein domain architectures provide a fast and efficient alternative to methods based on sequence similarity to identify groups of functionally equivalent proteins within and across taxonomic boundaries, and it is suitable for large scale comparative analysis. Running both methods in parallel pinpoints potential functional adaptations that may add to bacterial fitness.
Collapse
Affiliation(s)
- Jasper J Koehorst
- Laboratory of Systems and Synthetic Biology, Wageningen University and Research, Wageningen, Netherlands
| | - Edoardo Saccenti
- Laboratory of Systems and Synthetic Biology, Wageningen University and Research, Wageningen, Netherlands
| | - Peter J Schaap
- Laboratory of Systems and Synthetic Biology, Wageningen University and Research, Wageningen, Netherlands
| | - Vitor A P Martins Dos Santos
- Laboratory of Systems and Synthetic Biology, Wageningen University and Research, Wageningen, Netherlands.,LifeGlimmer GmBH, Berlin, Germany
| | - Maria Suarez-Diez
- Laboratory of Systems and Synthetic Biology, Wageningen University and Research, Wageningen, Netherlands
| |
Collapse
|
10
|
Doğan T, MacDougall A, Saidi R, Poggioli D, Bateman A, O'Donovan C, Martin MJ. UniProt-DAAC: domain architecture alignment and classification, a new method for automatic functional annotation in UniProtKB. Bioinformatics 2016; 32:2264-71. [PMID: 27153729 PMCID: PMC4965628 DOI: 10.1093/bioinformatics/btw114] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2015] [Revised: 01/22/2016] [Accepted: 02/25/2016] [Indexed: 11/17/2022] Open
Abstract
MOTIVATION Similarity-based methods have been widely used in order to infer the properties of genes and gene products containing little or no experimental annotation. New approaches that overcome the limitations of methods that rely solely upon sequence similarity are attracting increased attention. One of these novel approaches is to use the organization of the structural domains in proteins. RESULTS We propose a method for the automatic annotation of protein sequences in the UniProt Knowledgebase (UniProtKB) by comparing their domain architectures, classifying proteins based on the similarities and propagating functional annotation. The performance of this method was measured through a cross-validation analysis using the Gene Ontology (GO) annotation of a sub-set of UniProtKB/Swiss-Prot. The results demonstrate the effectiveness of this approach in detecting functional similarity with an average F-score: 0.85. We applied the method on nearly 55.3 million uncharacterized proteins in UniProtKB/TrEMBL resulted in 44 818 178 GO term predictions for 12 172 114 proteins. 22% of these predictions were for 2 812 016 previously non-annotated protein entries indicating the significance of the value added by this approach. AVAILABILITY AND IMPLEMENTATION The results of the method are available at: ftp://ftp.ebi.ac.uk/pub/contrib/martin/DAAC/ CONTACT: tdogan@ebi.ac.uk SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Tunca Doğan
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, UK
| | - Alistair MacDougall
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, UK
| | - Rabie Saidi
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, UK
| | - Diego Poggioli
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, UK
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, UK
| | - Claire O'Donovan
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, UK
| | - Maria J Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, UK
| |
Collapse
|
11
|
Figueiredo HCP, Soares SC, Pereira FL, Dorella FA, Carvalho AF, Teixeira JP, Azevedo VAC, Leal CAG. Comparative genome analysis of Weissella ceti, an emerging pathogen of farm-raised rainbow trout. BMC Genomics 2015; 16:1095. [PMID: 26694728 PMCID: PMC4687380 DOI: 10.1186/s12864-015-2324-4] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2015] [Accepted: 12/15/2015] [Indexed: 11/10/2022] Open
Abstract
Background The genus Weissella belongs to the lactic acid bacteria and includes 18 currently identified species, predominantly isolated from fermented food but rarely from cases of bacteremia in animals. Recently, a new species, designated Weissella ceti, has been correlated with hemorrhagic illness in farm-raised rainbow trout in China, Brazil, and the USA, with high transmission and mortality rates during outbreaks. Although W. ceti is an important emerging veterinary pathogen, little is known about its genomic features or virulence mechanisms. To better understand these and to characterize the species, we have previously sequenced the genomes of W. ceti strains WS08, WS74, and WS105, isolated from different rainbow trout farms in Brazil and displaying different pulsed-field gel electrophoresis patterns. Here, we present a comparative analysis of the three previously sequenced genomes of W. ceti strains from Brazil along with W. ceti NC36 from the USA and those of other Weissella species. Results Phylogenomic and orthology-based analyses both showed a high-similarity in the genetic structure of these W. ceti strains. This structure is corroborated by the highly syntenic order of their genes and the neutral evolution inferred from Tajima’s D. A whole-genome multilocus sequence typing analysis distinguished strains WS08 and NC36 from strains WS74 and WS105. We predicted 10 putative genomic islands (GEI), among which PAIs 3a and 3b are phage sequences that occur only in WS105 and WS74, respectively, whereas PAI 1 is species specific. Conclusions We identified several genes putatively involved in the basic processes of bacterial physiology and pathogenesis, including survival in aquatic environment, adherence in the host, spread inside the host, resistance to immune-system-mediated stresses, and antibiotic resistance. These data provide new insights in the molecular epidemiology and host adaptation for this emerging pathogen in aquaculture. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-2324-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Henrique C P Figueiredo
- AQUACEN, National Reference Laboratory for Aquatic Animal Diseases, Ministry of Fisheries and Aquaculture, Federal University of Minas Gerais, Belo Horizonte, MG, Brazil. .,Veterinary School, Department of Preventive Veterinary Medicine, Federal University of Minas Gerais, Av. Antônio Carlos 6627, Pampulha, Belo Horizonte, 30161-970, MG, Brazil.
| | - Siomar C Soares
- AQUACEN, National Reference Laboratory for Aquatic Animal Diseases, Ministry of Fisheries and Aquaculture, Federal University of Minas Gerais, Belo Horizonte, MG, Brazil.
| | - Felipe L Pereira
- AQUACEN, National Reference Laboratory for Aquatic Animal Diseases, Ministry of Fisheries and Aquaculture, Federal University of Minas Gerais, Belo Horizonte, MG, Brazil.
| | - Fernanda A Dorella
- AQUACEN, National Reference Laboratory for Aquatic Animal Diseases, Ministry of Fisheries and Aquaculture, Federal University of Minas Gerais, Belo Horizonte, MG, Brazil.
| | - Alex F Carvalho
- AQUACEN, National Reference Laboratory for Aquatic Animal Diseases, Ministry of Fisheries and Aquaculture, Federal University of Minas Gerais, Belo Horizonte, MG, Brazil.
| | - Júnia P Teixeira
- AQUACEN, National Reference Laboratory for Aquatic Animal Diseases, Ministry of Fisheries and Aquaculture, Federal University of Minas Gerais, Belo Horizonte, MG, Brazil.
| | - Vasco A C Azevedo
- Laboratory of Cellular and Molecular Genetics, Institute for Biological Science, Federal University of Minas Gerais, Belo Horizonte, MG, Brazil.
| | - Carlos A G Leal
- AQUACEN, National Reference Laboratory for Aquatic Animal Diseases, Ministry of Fisheries and Aquaculture, Federal University of Minas Gerais, Belo Horizonte, MG, Brazil.
| |
Collapse
|
12
|
Assessing the Metabolic Diversity of Streptococcus from a Protein Domain Point of View. PLoS One 2015; 10:e0137908. [PMID: 26366735 PMCID: PMC4569324 DOI: 10.1371/journal.pone.0137908] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2015] [Accepted: 08/22/2015] [Indexed: 01/17/2023] Open
Abstract
Understanding the diversity and robustness of the metabolism of bacteria is fundamental for understanding how bacteria evolve and adapt to different environments. In this study, we characterised 121 Streptococcus strains and studied metabolic diversity from a protein domain perspective. Metabolic pathways were described in terms of the promiscuity of domains participating in metabolic pathways that were inferred to be functional. Promiscuity was defined by adapting existing measures based on domain abundance and versatility. The approach proved to be successful in capturing bacterial metabolic flexibility and species diversity, indicating that it can be described in terms of reuse and sharing functional domains in different proteins involved in metabolic activity. Additionally, we showed striking differences among metabolic organisation of the pathogenic serotype 2 Streptococcus suis and other strains.
Collapse
|
13
|
Analysis of the protein domain and domain architecture content in fungi and its application in the search of new antifungal targets. PLoS Comput Biol 2014; 10:e1003733. [PMID: 25033262 PMCID: PMC4102429 DOI: 10.1371/journal.pcbi.1003733] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2013] [Accepted: 06/04/2014] [Indexed: 01/25/2023] Open
Abstract
Over the past several years fungal infections have shown an increasing incidence in the susceptible population, and caused high mortality rates. In parallel, multi-resistant fungi are emerging in human infections. Therefore, the identification of new potential antifungal targets is a priority. The first task of this study was to analyse the protein domain and domain architecture content of the 137 fungal proteomes (corresponding to 111 species) available in UniProtKB (UniProt KnowledgeBase) by January 2013. The resulting list of core and exclusive domain and domain architectures is provided in this paper. It delineates the different levels of fungal taxonomic classification: phylum, subphylum, order, genus and species. The analysis highlighted Aspergillus as the most diverse genus in terms of exclusive domain content. In addition, we also investigated which domains could be considered promiscuous in the different organisms. As an application of this analysis, we explored three different ways to detect potential targets for antifungal drugs. First, we compared the domain and domain architecture content of the human and fungal proteomes, and identified those domains and domain architectures only present in fungi. Secondly, we looked for information regarding fungal pathways in public repositories, where proteins containing promiscuous domains could be involved. Three pathways were identified as a result: lovastatin biosynthesis, xylan degradation and biosynthesis of siroheme. Finally, we classified a subset of the studied fungi in five groups depending on their occurrence in clinical samples. We then looked for exclusive domains in the groups that were more relevant clinically and determined which of them had the potential to bind small molecules. Overall, this study provides a comprehensive analysis of the available fungal proteomes and shows three approaches that can be used as a first step in the detection of new antifungal targets. Some fungi have become pathogenic to plants and in a lesser extent to animals. Under certain conditions their presence in the human body can prove a threat for human health, especially for immunocompromised patients. Yet, some fungi can also infect healthy individuals. The low sensitivity of the antifungal drugs available together with the clinically observed resistance of some fungi raises the demand for new alternative treatments. Proteins are biological molecules which perform essential functions within the living organisms. Many of those functions are attributed to the varying folded structure of each protein. These configurations are composed of functional units -also called domains- each one independently responsible for a fraction of the overall biological function. Understanding how the different block combinations are distributed across members of the same or similar families of organisms is important. For instance, exclusive domain combinations can hold particular acquired functions. Blocks displaying a high mobility can play major roles for the organism's survival. The biological goal of this study was to analyse the functional implications of protein domains and domain combinations in the available fungal proteomes. This information can be used to highlight proteins and pathways that could be potentially used as drug targets.
Collapse
|
14
|
Terrapon N, Weiner J, Grath S, Moore AD, Bornberg-Bauer E. Rapid similarity search of proteins using alignments of domain arrangements. ACTA ACUST UNITED AC 2013; 30:274-81. [PMID: 23828785 DOI: 10.1093/bioinformatics/btt379] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
MOTIVATION Homology search methods are dominated by the central paradigm that sequence similarity is a proxy for common ancestry and, by extension, functional similarity. For determining sequence similarity in proteins, most widely used methods use models of sequence evolution and compare amino-acid strings in search for conserved linear stretches. Probabilistic models or sequence profiles capture the position-specific variation in an alignment of homologous sequences and can identify conserved motifs or domains. While profile-based search methods are generally more accurate than simple sequence comparison methods, they tend to be computationally more demanding. In recent years, several methods have emerged that perform protein similarity searches based on domain composition. However, few methods have considered the linear arrangements of domains when conducting similarity searches, despite strong evidence that domain order can harbour considerable functional and evolutionary signal. RESULTS Here, we introduce an alignment scheme that uses a classical dynamic programming approach to the global alignment of domains. We illustrate that representing proteins as strings of domains (domain arrangements) and comparing these strings globally allows for a both fast and sensitive homology search. Further, we demonstrate that the presented methods complement existing methods by finding similar proteins missed by popular amino-acid-based comparison methods. AVAILABILITY An implementation of the presented algorithms, a web-based interface as well as a command-line program for batch searching against the UniProt database can be found at http://rads.uni-muenster.de. Furthermore, we provide a JAVA API for programmatic access to domain-string–based search methods.
Collapse
Affiliation(s)
- Nicolas Terrapon
- Westfalian Wilhelms University, Institute of Evolution and Biodiversity, Huefferstr. 1, 48149 Muenster, Germany and Max Planck Institute for Infection Biology, Charitéplatz 1, 10117 Berlin, Germany
| | | | | | | | | |
Collapse
|
15
|
Syamaladevi DP, Joshi A, Sowdhamini R. An alignment-free domain architecture similarity search (ADASS) algorithm for inferring homology between multi-domain proteins. Bioinformation 2013; 9:491-9. [PMID: 23861564 PMCID: PMC3705623 DOI: 10.6026/97320630009491] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2012] [Revised: 01/01/2013] [Accepted: 01/02/2013] [Indexed: 11/23/2022] Open
Abstract
Annotations of the genes and their products are largely guided by inferring homology. Sequence
similarity is the primary measure used for annotation purpose however, the domain content and
order were given less importance albeit the fact that domain insertion, deletion, positional
changes can bring in functional varieties. Of late, several methods developed quantify domain
architecture similarity depending on alignments of their sequences and are focused on only homologous
proteins. We present an alignment-free domain architecture-similarity search (ADASS) algorithm that
identifies proteins that share very poor sequence similarity yet having similar domain architectures.
We introduce a “singlet matching-triplet comparison” method in ADASS, wherein triplet of domains is
compared with other triplets in a pair-wise comparison of two domain architectures. Different events
in the triplet comparison are scored as per a scoring scheme and an average pairwise distance score
(Domain Architecture Distance score - DAD Score) is calculated between protein domains architectures.
We use domain architectures of a selected domain termed as centric domain and cluster them based on DAD score.
The algorithm has high Positive Prediction Value (PPV) with respect to the clustering of the sequences of selected
domain architectures. A comparison of domain architecture based dendrograms using ADASS method and an existing
method revealed that ADASS can classify proteins depending on the extent of domain architecture level similarity.
ADASS is more relevant in cases of proteins with tiny domains having little contribution to the overall sequence
similarity but contributing significantly to the overall function.
Collapse
Affiliation(s)
- Divya P Syamaladevi
- Sugarcane Breeding Institute Indian Council of Agricultural Research Coimbatore, India, PIN 641 007 ; National Center for Biological Sciences (TIFR), UAS-GKVK Campus, Bellary Road, Bangalore 560 065, India
| | | | | |
Collapse
|
16
|
Wang JJY, Bensmail H, Gao X. Multiple graph regularized protein domain ranking. BMC Bioinformatics 2012; 13:307. [PMID: 23157331 PMCID: PMC3583823 DOI: 10.1186/1471-2105-13-307] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2012] [Accepted: 10/29/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Protein domain ranking is a fundamental task in structural biology. Most protein domain ranking methods rely on the pairwise comparison of protein domains while neglecting the global manifold structure of the protein domain database. Recently, graph regularized ranking that exploits the global structure of the graph defined by the pairwise similarities has been proposed. However, the existing graph regularized ranking methods are very sensitive to the choice of the graph model and parameters, and this remains a difficult problem for most of the protein domain ranking methods. RESULTS To tackle this problem, we have developed the Multiple Graph regularized Ranking algorithm, MultiG-Rank. Instead of using a single graph to regularize the ranking scores, MultiG-Rank approximates the intrinsic manifold of protein domain distribution by combining multiple initial graphs for the regularization. Graph weights are learned with ranking scores jointly and automatically, by alternately minimizing an objective function in an iterative algorithm. Experimental results on a subset of the ASTRAL SCOP protein domain database demonstrate that MultiG-Rank achieves a better ranking performance than single graph regularized ranking methods and pairwise similarity based ranking methods. CONCLUSION The problem of graph model and parameter selection in graph regularized protein domain ranking can be solved effectively by combining multiple graphs. This aspect of generalization introduces a new frontier in applying multiple graphs to solving protein domain ranking applications.
Collapse
Affiliation(s)
- Jim Jing-Yan Wang
- Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia.
| | | | | |
Collapse
|
17
|
Wang J, Gao X, Wang Q, Li Y. ProDis-ContSHC: learning protein dissimilarity measures and hierarchical context coherently for protein-protein comparison in protein database retrieval. BMC Bioinformatics 2012; 13 Suppl 7:S2. [PMID: 22594999 PMCID: PMC3348016 DOI: 10.1186/1471-2105-13-s7-s2] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND The need to retrieve or classify protein molecules using structure or sequence-based similarity measures underlies a wide range of biomedical applications. Traditional protein search methods rely on a pairwise dissimilarity/similarity measure for comparing a pair of proteins. This kind of pairwise measures suffer from the limitation of neglecting the distribution of other proteins and thus cannot satisfy the need for high accuracy of the retrieval systems. Recent work in the machine learning community has shown that exploiting the global structure of the database and learning the contextual dissimilarity/similarity measures can improve the retrieval performance significantly. However, most existing contextual dissimilarity/similarity learning algorithms work in an unsupervised manner, which does not utilize the information of the known class labels of proteins in the database. RESULTS In this paper, we propose a novel protein-protein dissimilarity learning algorithm, ProDis-ContSHC. ProDis-ContSHC regularizes an existing dissimilarity measure dij by considering the contextual information of the proteins. The context of a protein is defined by its neighboring proteins. The basic idea is, for a pair of proteins (i, j), if their context N(i) and N(j) is similar to each other, the two proteins should also have a high similarity. We implement this idea by regularizing dij by a factor learned from the context N(i) and N(j).Moreover, we divide the context to hierarchial sub-context and get the contextual dissimilarity vector for each protein pair. Using the class label information of the proteins, we select the relevant (a pair of proteins that has the same class labels) and irrelevant (with different labels) protein pairs, and train an SVM model to distinguish between their contextual dissimilarity vectors. The SVM model is further used to learn a supervised regularizing factor. Finally, with the new Supervised learned Dissimilarity measure, we update the Protein Hierarchial Context Coherently in an iterative algorithm--ProDis-ContSHC.We test the performance of ProDis-ContSHC on two benchmark sets, i.e., the ASTRAL 1.73 database and the FSSP/DALI database. Experimental results demonstrate that plugging our supervised contextual dissimilarity measures into the retrieval systems significantly outperforms the context-free dissimilarity/similarity measures and other unsupervised contextual dissimilarity measures that do not use the class label information. CONCLUSIONS Using the contextual proteins with their class labels in the database, we can improve the accuracy of the pairwise dissimilarity/similarity measures dramatically for the protein retrieval tasks. In this work, for the first time, we propose the idea of supervised contextual dissimilarity learning, resulting in the ProDis-ContSHC algorithm. Among different contextual dissimilarity learning approaches that can be used to compare a pair of proteins, ProDis-ContSHC provides the highest accuracy. Finally, ProDis-ContSHC compares favorably with other methods reported in the recent literature.
Collapse
Affiliation(s)
- Jingyan Wang
- King Abdullah University of Science and Technology (KAUST), Mathematical and Computer Sciences and Engineering Division, Thuwal, 23955-6900, Saudi Arabia
| | | | | | | |
Collapse
|
18
|
Cohen-Gihon I, Fong JH, Sharan R, Nussinov R, Przytycka TM, Panchenko AR. Evolution of domain promiscuity in eukaryotic genomes--a perspective from the inferred ancestral domain architectures. MOLECULAR BIOSYSTEMS 2011; 7:784-92. [PMID: 21127809 PMCID: PMC3321261 DOI: 10.1039/c0mb00182a] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Most eukaryotic proteins are composed of two or more domains. These assemble in a modular manner to create new proteins usually by the acquisition of one or more domains to an existing protein. Promiscuous domains which are found embedded in a variety of proteins and co-exist with many other domains are of particular interest and were shown to have roles in signaling pathways and mediating network communication. The evolution of domain promiscuity is still an open problem, mostly due to the lack of sequenced ancestral genomes. Here we use inferred domain architectures of ancestral genomes to trace the evolution of domain promiscuity in eukaryotic genomes. We find an increase in average promiscuity along many branches of the eukaryotic tree. Moreover, domain promiscuity can proceed at almost a steady rate over long evolutionary time or exhibit lineage-specific acceleration. We also observe that many signaling and regulatory domains gained domain promiscuity around the Bilateria divergence. In addition we show that those domains that played a role in the creation of two body axes and existed before the divergence of the bilaterians from fungi/metazoan achieve a boost in their promiscuities during the bilaterian evolution.
Collapse
Affiliation(s)
- Inbar Cohen-Gihon
- Sackler Institute of Molecular Medicine, Department of Human Genetics, Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv 69978, Israel
| | - Jessica H. Fong
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Roded Sharan
- The Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel
| | - Ruth Nussinov
- Sackler Institute of Molecular Medicine, Department of Human Genetics, Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv 69978, Israel
- Center for Cancer Research Nanobiology Program, SAIC-Frederick, Inc., NCI-Frederick, Frederick, MD 21702, USA
| | - Teresa M. Przytycka
- Center for Cancer Research Nanobiology Program, SAIC-Frederick, Inc., NCI-Frederick, Frederick, MD 21702, USA
| | - Anna R. Panchenko
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| |
Collapse
|
19
|
FACT: functional annotation transfer between proteins with similar feature architectures. BMC Bioinformatics 2010; 11:417. [PMID: 20696036 PMCID: PMC2931517 DOI: 10.1186/1471-2105-11-417] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2010] [Accepted: 08/09/2010] [Indexed: 11/24/2022] Open
Abstract
Background The increasing number of sequenced genomes provides the basis for exploring the genetic and functional diversity within the tree of life. Only a tiny fraction of the encoded proteins undergoes a thorough experimental characterization. For the remainder, bioinformatics annotation tools are the only means to infer their function. Exploiting significant sequence similarities to already characterized proteins, commonly taken as evidence for homology, is the prevalent method to deduce functional equivalence. Such methods fail when homologs are too diverged, or when they have assumed a different function. Finally, due to convergent evolution, functional equivalence is not necessarily linked to common ancestry. Therefore complementary approaches are required to identify functional equivalents. Results We present the Feature Architecture Comparison Tool http://www.cibiv.at/FACT to search for functionally equivalent proteins. FACT uses the similarity between feature architectures of two proteins, i.e., the arrangements of functional domains, secondary structure elements and compositional properties, as a proxy for their functional equivalence. A scoring function measures feature architecture similarities, which enables searching for functional equivalents in entire proteomes. Our evaluation of 9,570 EC classified enzymes revealed that FACT, using the full feature, set outperformed the existing architecture-based approaches by identifying significantly more functional equivalents as highest scoring proteins. We show that FACT can identify functional equivalents that share no significant sequence similarity. However, when the highest scoring protein of FACT is also the protein with the highest local sequence similarity, it is in 99% of the cases functionally equivalent to the query. We demonstrate the versatility of FACT by identifying a missing link in the yeast glutathione metabolism and also by searching for the human GolgA5 equivalent in Trypanosoma brucei. Conclusions FACT facilitates a quick and sensitive search for functionally equivalent proteins in entire proteomes. FACT is complementary to approaches using sequence similarity to identify proteins with the same function. Thus, FACT is particularly useful when functional equivalents need to be identified in evolutionarily distant species, or when functional equivalents are not homologous. The most reliable annotation transfers, however, are achieved when feature architecture similarity and sequence similarity are jointly taken into account.
Collapse
|
20
|
Abstract
The 2009 annual conference of the Asia Pacific Bioinformatics Network (APBioNet), Asia's oldest bioinformatics organisation from 1998, was organized as the 8th International Conference on Bioinformatics (InCoB), Sept. 9-11, 2009 at Biopolis, Singapore. InCoB has actively engaged researchers from the area of life sciences, systems biology and clinicians, to facilitate greater synergy between these groups. To encourage bioinformatics students and new researchers, tutorials and student symposium, the Singapore Symposium on Computational Biology (SYMBIO) were organized, along with the Workshop on Education in Bioinformatics and Computational Biology (WEBCB) and the Clinical Bioinformatics (CBAS) Symposium. However, to many students and young researchers, pursuing a career in a multi-disciplinary area such as bioinformatics poses a Himalayan challenge. A collection to tips is presented here to provide signposts on the road to a career in bioinformatics. An overview of the application of bioinformatics to traditional and emerging areas, published in this supplement, is also presented to provide possible future avenues of bioinformatics investigation. A case study on the application of e-learning tools in undergraduate bioinformatics curriculum provides information on how to go impart targeted education, to sustain bioinformatics in the Asia-Pacific region. The next InCoB is scheduled to be held in Tokyo, Japan, Sept. 26-28, 2010.
Collapse
|