1
|
Chen Y, Huang J, Qin H, Zhang K, Fu Y, Li J, Wang R, Chen K, Xiong J, Miao W, Wang G, Zhang L. Chromosome-level genome assembly of Cryptosporidium parvum by long-read sequencing of ten oocysts. Sci Data 2024; 11:1287. [PMID: 39592642 PMCID: PMC11599830 DOI: 10.1038/s41597-024-04150-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2023] [Accepted: 11/19/2024] [Indexed: 11/28/2024] Open
Abstract
Cryptosporidium parvum is a zoonotic parasite of the intestine and poses a threat to human and animal health. However, it is difficult to obtain a large number of oocysts for genome sequencing using in vitro culture. To address this challenge, we employed the strategy of whole-genome amplification of 10 oocysts followed by long-read sequencing and obtained a high-quality genome assembly of C. parvum IIdA19G1 subtype isolated from a pre-weaning calf with diarrhea. The assembled genome was 9.13 Mb long and encompassed eight chromosomes with six capped by telomeric sequences at one or both ends. In total, 3,915 protein-coding genes were predicted, exhibiting a high completeness with 98.2% single-copy BUSCO genes. To our current knowledge, this represents the first chromosome-level genome assembly of C. parvum achieved through the combined use of whole-genome amplification of 10 oocysts and long-read sequencing. This achievement not only advances our understanding of the genomic landscape of this zoonotic intestinal parasite, but also provides valuable resources for comparative genomics and evolutionary analyses within the Cryptosporidium clade.
Collapse
Affiliation(s)
- Yuancai Chen
- College of Veterinary Medicine, Henan Agricultural University, Zhengzhou, 450046, P. R. China
| | - Jianying Huang
- College of Veterinary Medicine, Henan Agricultural University, Zhengzhou, 450046, P. R. China
| | - Huikai Qin
- College of Veterinary Medicine, Henan Agricultural University, Zhengzhou, 450046, P. R. China
| | - Kaihui Zhang
- College of Veterinary Medicine, Henan Agricultural University, Zhengzhou, 450046, P. R. China
| | - Yin Fu
- College of Veterinary Medicine, Henan Agricultural University, Zhengzhou, 450046, P. R. China
| | - Junqiang Li
- College of Veterinary Medicine, Henan Agricultural University, Zhengzhou, 450046, P. R. China
| | - Rongjun Wang
- College of Veterinary Medicine, Henan Agricultural University, Zhengzhou, 450046, P. R. China
| | - Kai Chen
- Key Laboratory of Aquatic Biodiversity and Conservation, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, 430072, China
| | - Jie Xiong
- Key Laboratory of Aquatic Biodiversity and Conservation, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, 430072, China
- Key Laboratory of Breeding Biotechnology and Sustainable Aquaculture, Chinese Academy of Sciences, Wuhan, 430072, China
| | - Wei Miao
- Key Laboratory of Aquatic Biodiversity and Conservation, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, 430072, China
- Key laboratory of Lake and Watershed Science for Water Security, Chinese Academy of Sciences, Nanjing, 210008, China
| | - Guangying Wang
- Key Laboratory of Aquatic Biodiversity and Conservation, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, 430072, China.
| | - Longxian Zhang
- College of Veterinary Medicine, Henan Agricultural University, Zhengzhou, 450046, P. R. China.
- International Joint Research Laboratory for Zoonotic Diseases of Henan, Zhengzhou, 450046, P. R. China.
- Key Laboratory of Quality and Safety Control of Poultry Products (Zhengzhou), Ministry of Agriculture and Rural Affairs, Zhengzhou, 450046, China.
| |
Collapse
|
2
|
Yang Q, Yang L, Wang Y, Chen Y, Hu K, Yang W, Zuo S, Xu J, Kang Z, Xiao X, Li G. A High-Quality Genome of Rhizoctonia solani, a Devastating Fungal Pathogen with a Wide Host Range. MOLECULAR PLANT-MICROBE INTERACTIONS : MPMI 2022; 35:954-958. [PMID: 36173255 DOI: 10.1094/mpmi-06-22-0126-a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Affiliation(s)
- Qun Yang
- State Key Laboratory of Agricultural Microbiology, Hubei Hongshan Laboratory, the Provincial Key Laboratory of Plant Pathology of Hubei Province, College of Plant Science & Technology, Huazhong Agricultural University, Wuhan 430070, China
| | - Lei Yang
- State Key Laboratory of Agricultural Microbiology, Hubei Hongshan Laboratory, the Provincial Key Laboratory of Plant Pathology of Hubei Province, College of Plant Science & Technology, Huazhong Agricultural University, Wuhan 430070, China
| | - Yin Wang
- State Key Laboratory of Agricultural Microbiology, Hubei Hongshan Laboratory, the Provincial Key Laboratory of Plant Pathology of Hubei Province, College of Plant Science & Technology, Huazhong Agricultural University, Wuhan 430070, China
| | - Yilyu Chen
- State Key Laboratory of Agricultural Microbiology, Hubei Hongshan Laboratory, the Provincial Key Laboratory of Plant Pathology of Hubei Province, College of Plant Science & Technology, Huazhong Agricultural University, Wuhan 430070, China
| | - Keming Hu
- Key Laboratory of Plant Functional Genomics of The Ministry of Education, Jiangsu Key Laboratory of Crop Genomics and Molecular Breeding, Agricultural College of Yangzhou University, Yangzhou 25009, China
| | - Wei Yang
- State Key Laboratory of Agricultural Microbiology, Hubei Hongshan Laboratory, the Provincial Key Laboratory of Plant Pathology of Hubei Province, College of Plant Science & Technology, Huazhong Agricultural University, Wuhan 430070, China
| | - Shimin Zuo
- Key Laboratory of Plant Functional Genomics of The Ministry of Education, Jiangsu Key Laboratory of Crop Genomics and Molecular Breeding, Agricultural College of Yangzhou University, Yangzhou 25009, China
| | - Jiandi Xu
- Institute of Wetland Agriculture and Ecology, Shandong Academy of Agricultural Sciences, Jinan 250100, China
| | - Zhensheng Kang
- State Key Laboratory of Crop Stress Biology for Arid Areas and College of Plant Protection, Northwest A&F University, Yangling 712100, Shaanxi China
| | - Xueqiong Xiao
- State Key Laboratory of Agricultural Microbiology, Hubei Hongshan Laboratory, the Provincial Key Laboratory of Plant Pathology of Hubei Province, College of Plant Science & Technology, Huazhong Agricultural University, Wuhan 430070, China
| | - Guotian Li
- State Key Laboratory of Agricultural Microbiology, Hubei Hongshan Laboratory, the Provincial Key Laboratory of Plant Pathology of Hubei Province, College of Plant Science & Technology, Huazhong Agricultural University, Wuhan 430070, China
| |
Collapse
|
3
|
Masłowska-Górnicz A, van den Bosch MRM, Saccenti E, Suarez-Diez M. A large-scale analysis of codon usage bias in 4868 bacterial genomes shows association of codon adaptation index with GC content, protein functional domains and bacterial phenotypes. BIOCHIMICA ET BIOPHYSICA ACTA. GENE REGULATORY MECHANISMS 2022; 1865:194826. [PMID: 35605953 DOI: 10.1016/j.bbagrm.2022.194826] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/10/2022] [Revised: 05/05/2022] [Accepted: 05/12/2022] [Indexed: 06/15/2023]
Abstract
Multiple synonymous codons code for the same amino acid, resulting in the degeneracy of the genetic code and in the preferred used of some codons called codon bias usage (CBU). We performed a large-scale analysis of codon usage bias analysing the distribution of the codon adaptation index (CAI) and the codon relative adaptiveness index (RA) in 4868 bacterial genomes. We found that CAI values differ significantly between protein functional domains and part of the protein outside domains and show how CAI, GC content and preferred usage of polymerase III alpha subunits are related. Additionally, we give evidence of the association between CAI and bacterial phenotypes.
Collapse
Affiliation(s)
- Anna Masłowska-Górnicz
- Laboratory of Systems and Synthetic Biology, Wageningen University & Research, Stippeneng 4, 6708 WE Wageningen, the Netherlands
| | - Melanie R M van den Bosch
- Laboratory of Systems and Synthetic Biology, Wageningen University & Research, Stippeneng 4, 6708 WE Wageningen, the Netherlands
| | - Edoardo Saccenti
- Laboratory of Systems and Synthetic Biology, Wageningen University & Research, Stippeneng 4, 6708 WE Wageningen, the Netherlands.
| | - Maria Suarez-Diez
- Laboratory of Systems and Synthetic Biology, Wageningen University & Research, Stippeneng 4, 6708 WE Wageningen, the Netherlands.
| |
Collapse
|
4
|
Mulnaes D, Golchin P, Koenig F, Gohlke H. TopDomain: Exhaustive Protein Domain Boundary Metaprediction Combining Multisource Information and Deep Learning. J Chem Theory Comput 2021; 17:4599-4613. [PMID: 34161735 DOI: 10.1021/acs.jctc.1c00129] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Protein domains are independent, functional, and stable structural units of proteins. Accurate protein domain boundary prediction plays an important role in understanding protein structure and evolution, as well as for protein structure prediction. Current domain boundary prediction methods differ in terms of boundary definition, methodology, and training databases resulting in disparate performance for different proteins. We developed TopDomain, an exhaustive metapredictor, that uses deep neural networks to combine multisource information from sequence- and homology-based features of over 50 primary predictors. For this purpose, we developed a new domain boundary data set termed the TopDomain data set, in which the true annotations are informed by SCOPe annotations, structural domain parsers, human inspection, and deep learning. We benchmark TopDomain against 2484 targets with 3354 boundaries from the TopDomain test set and achieve F1 scores of 78.4% and 73.8% for multidomain boundary prediction within ±20 residues and ±10 residues of the true boundary, respectively. When examined on targets from CASP11-13 competitions, TopDomain achieves F1 scores of 47.5% and 42.8% for multidomain proteins. TopDomain significantly outperforms 15 widely used, state-of-the-art ab initio and homology-based domain boundary predictors. Finally, we implemented TopDomainTMC, which accurately predicts whether domain parsing is necessary for the target protein.
Collapse
Affiliation(s)
- Daniel Mulnaes
- Institut für Pharmazeutische und Medizinische Chemie, Heinrich-Heine-Universität Düsseldorf, Universitätsstr. 1, 40225 Düsseldorf, Germany
| | - Pegah Golchin
- Institut für Pharmazeutische und Medizinische Chemie, Heinrich-Heine-Universität Düsseldorf, Universitätsstr. 1, 40225 Düsseldorf, Germany
| | - Filip Koenig
- Institut für Pharmazeutische und Medizinische Chemie, Heinrich-Heine-Universität Düsseldorf, Universitätsstr. 1, 40225 Düsseldorf, Germany
| | - Holger Gohlke
- Institut für Pharmazeutische und Medizinische Chemie, Heinrich-Heine-Universität Düsseldorf, Universitätsstr. 1, 40225 Düsseldorf, Germany.,John von Neumann Institute for Computing (NIC), Jülich Supercomputing Centre (JSC), Institute of Biological Information Processing (IBI-7: Structural Biochemistry) & Institute of Bio- and Geosciences (IBG-4: Bioinformatics), Forschungszentrum Jülich GmbH, 52425 Jülich, Germany
| |
Collapse
|
5
|
Hu XJ, Li T, Wang Y, Xiong Y, Wu XH, Zhang DL, Ye ZQ, Wu YD. Prokaryotic and Highly-Repetitive WD40 Proteins: A Systematic Study. Sci Rep 2017; 7:10585. [PMID: 28878378 PMCID: PMC5587647 DOI: 10.1038/s41598-017-11115-1] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2017] [Accepted: 08/18/2017] [Indexed: 12/22/2022] Open
Abstract
As an ancient protein family, the WD40 repeat proteins often play essential roles in fundamental cellular processes in eukaryotes. Although investigations of eukaryotic WD40 proteins have been frequently reported, prokaryotic ones remain largely uncharacterized. In this paper, we report a systematic analysis of prokaryotic WD40 proteins and detailed comparisons with eukaryotic ones. About 4,000 prokaryotic WD40 proteins have been identified, accounting for 6.5% of all WD40s. While their abundances are less than 0.1% in most prokaryotes, they are enriched in certain species from Cyanobacteria and Planctomycetes, and participate in various functions such as prokaryotic signal transduction and nutrient synthesis. Comparisons show that a higher proportion of prokaryotic WD40s tend to contain multiple WD40 domains and a large number of hydrogen bond networks. The observation that prokaryotic WD40 proteins tend to show high internal sequence identity suggests that a substantial proportion of them (~20%) should be formed by recent or young repeat duplication events. Further studies demonstrate that the very young WD40 proteins, i.e., Highly-Repetitive WD40s, should be of higher stability. Our results have presented a catalogue of prokaryotic WD40 proteins, and have shed light on their evolutionary origins.
Collapse
Affiliation(s)
- Xue-Jia Hu
- Lab of Computational Chemistry and Drug Design, Laboratory of Chemical Genomics, Peking University Shenzhen Graduate School, Shenzhen, 518055, P.R. China
| | - Tuan Li
- Lab of Computational Chemistry and Drug Design, Laboratory of Chemical Genomics, Peking University Shenzhen Graduate School, Shenzhen, 518055, P.R. China
| | - Yang Wang
- Lab of Computational Chemistry and Drug Design, Laboratory of Chemical Genomics, Peking University Shenzhen Graduate School, Shenzhen, 518055, P.R. China
| | - Yao Xiong
- Lab of Computational Chemistry and Drug Design, Laboratory of Chemical Genomics, Peking University Shenzhen Graduate School, Shenzhen, 518055, P.R. China
| | - Xian-Hui Wu
- Lab of Computational Chemistry and Drug Design, Laboratory of Chemical Genomics, Peking University Shenzhen Graduate School, Shenzhen, 518055, P.R. China
| | - De-Lin Zhang
- Lab of Computational Chemistry and Drug Design, Laboratory of Chemical Genomics, Peking University Shenzhen Graduate School, Shenzhen, 518055, P.R. China
| | - Zhi-Qiang Ye
- Lab of Computational Chemistry and Drug Design, Laboratory of Chemical Genomics, Peking University Shenzhen Graduate School, Shenzhen, 518055, P.R. China.
| | - Yun-Dong Wu
- Lab of Computational Chemistry and Drug Design, Laboratory of Chemical Genomics, Peking University Shenzhen Graduate School, Shenzhen, 518055, P.R. China.
- College of Chemistry, Peking University, Beijing, 100871, P.R. China.
| |
Collapse
|
6
|
Yao S, Flight RM, Rouchka EC, Moseley HNB. Aberrant coordination geometries discovered in the most abundant metalloproteins. Proteins 2017; 85:885-907. [PMID: 28142195 PMCID: PMC5389913 DOI: 10.1002/prot.25257] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2016] [Revised: 01/16/2017] [Accepted: 01/18/2017] [Indexed: 11/09/2022]
Abstract
Metalloproteins bind and utilize metal ions for a variety of biological purposes. Due to the ubiquity of metalloprotein involvement throughout these processes across all domains of life, how proteins coordinate metal ions for different biochemical functions is of great relevance to understanding the implementation of these biological processes. Toward these ends, we have improved our methodology for structurally and functionally characterizing metal binding sites in metalloproteins. Our new ligand detection method is statistically much more robust, producing estimated false positive and false negative rates of ∼0.11% and ∼1.2%, respectively. Additional improvements expand both the range of metal ions and their coordination number that can be effectively analyzed. Also, the inclusion of additional quality control filters has significantly improved structure-function Spearman correlations as demonstrated by rho values greater than 0.90 for several metal coordination analyses and even one rho value above 0.95. Also, improvements in bond-length distributions have revealed bond-length modes specific to chemical functional groups involved in multidentation. Using these improved methods, we analyzed all single metal ion binding sites with Zn, Mg, Ca, Fe, and Na ions in the wwPDB, producing statistically rigorous results supporting the existence of both a significant number of unexpected compressed angles and subsequent aberrant metal ion coordination geometries (CGs) within structurally known metalloproteins. By recognizing these aberrant CGs in our clustering analyses, high correlations are achieved between structural and functional descriptions of metal ion coordination. Moreover, distinct biochemical functions are associated with aberrant CGs versus nonaberrant CGs. Proteins 2017; 85:885-907. © 2016 Wiley Periodicals, Inc.
Collapse
Affiliation(s)
- Sen Yao
- School of Interdisciplinary and Graduate Studies, University of Louisville, Louisville, Kentucky, 40292.,Department of Computer Engineering and Computer Science, University of Louisville, Louisville, Kentucky, 40292.,Department of Molecular and Cellular Biochemistry, University of Kentucky, Lexington, Kentucky, 40356.,Markey Cancer Center, University of Kentucky, Lexington, Kentucky, 40356.,Center for Environmental and Systems Biochemistry, University of Kentucky, Lexington, Kentucky, 40356
| | - Robert M Flight
- Department of Molecular and Cellular Biochemistry, University of Kentucky, Lexington, Kentucky, 40356.,Markey Cancer Center, University of Kentucky, Lexington, Kentucky, 40356.,Center for Environmental and Systems Biochemistry, University of Kentucky, Lexington, Kentucky, 40356
| | - Eric C Rouchka
- School of Interdisciplinary and Graduate Studies, University of Louisville, Louisville, Kentucky, 40292.,Department of Computer Engineering and Computer Science, University of Louisville, Louisville, Kentucky, 40292
| | - Hunter N B Moseley
- Department of Molecular and Cellular Biochemistry, University of Kentucky, Lexington, Kentucky, 40356.,Markey Cancer Center, University of Kentucky, Lexington, Kentucky, 40356.,Center for Environmental and Systems Biochemistry, University of Kentucky, Lexington, Kentucky, 40356
| |
Collapse
|
7
|
Kludas J, Arvas M, Castillo S, Pakula T, Oja M, Brouard C, Jäntti J, Penttilä M, Rousu J. Machine Learning of Protein Interactions in Fungal Secretory Pathways. PLoS One 2016; 11:e0159302. [PMID: 27441920 PMCID: PMC4956264 DOI: 10.1371/journal.pone.0159302] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2016] [Accepted: 06/30/2016] [Indexed: 12/18/2022] Open
Abstract
In this paper we apply machine learning methods for predicting protein interactions in fungal secretion pathways. We assume an inter-species transfer setting, where training data is obtained from a single species and the objective is to predict protein interactions in other, related species. In our methodology, we combine several state of the art machine learning approaches, namely, multiple kernel learning (MKL), pairwise kernels and kernelized structured output prediction in the supervised graph inference framework. For MKL, we apply recently proposed centered kernel alignment and p-norm path following approaches to integrate several feature sets describing the proteins, demonstrating improved performance. For graph inference, we apply input-output kernel regression (IOKR) in supervised and semi-supervised modes as well as output kernel trees (OK3). In our experiments simulating increasing genetic distance, Input-Output Kernel Regression proved to be the most robust prediction approach. We also show that the MKL approaches improve the predictions compared to uniform combination of the kernels. We evaluate the methods on the task of predicting protein-protein-interactions in the secretion pathways in fungi, S.cerevisiae, baker's yeast, being the source, T. reesei being the target of the inter-species transfer learning. We identify completely novel candidate secretion proteins conserved in filamentous fungi. These proteins could contribute to their unique secretion capabilities.
Collapse
Affiliation(s)
- Jana Kludas
- Helsinki Institute for Information Technology HIIT, Department of Computer Science, Aalto University, Espoo, Finland
| | - Mikko Arvas
- VTT Technical Research Centre of Finland, Espoo, Finland
| | | | - Tiina Pakula
- VTT Technical Research Centre of Finland, Espoo, Finland
| | - Merja Oja
- VTT Technical Research Centre of Finland, Espoo, Finland
| | - Céline Brouard
- Helsinki Institute for Information Technology HIIT, Department of Computer Science, Aalto University, Espoo, Finland
| | - Jussi Jäntti
- VTT Technical Research Centre of Finland, Espoo, Finland
| | - Merja Penttilä
- VTT Technical Research Centre of Finland, Espoo, Finland
| | - Juho Rousu
- Helsinki Institute for Information Technology HIIT, Department of Computer Science, Aalto University, Espoo, Finland
| |
Collapse
|
8
|
An assessment of the amount of untapped fold level novelty in under-sampled areas of the tree of life. Sci Rep 2015; 5:14717. [PMID: 26434770 PMCID: PMC4592975 DOI: 10.1038/srep14717] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2015] [Accepted: 09/07/2015] [Indexed: 11/14/2022] Open
Abstract
Previous studies of protein fold space suggest that fold coverage is plateauing. However, sequence sampling has been -and remains to a large extent- heavily biased, focusing on culturable phyla. Sustained technological developments have fuelled the advent of metagenomics and single-cell sequencing, which might correct the current sequencing bias. The extent to which these efforts affect structural diversity remains unclear, although preliminary results suggest that uncultured organisms could constitute a source of new folds. We investigate to what extent genomes from uncultured and under-sampled phyla accessed through single cell sequencing, metagenomics and high-throughput culturing efforts have the potential to increase protein fold space, and conclude that i) genomes from under-sampled phyla appear enriched in sequences not covered by current protein family and fold profile libraries, ii) this enrichment is linked to an excess of short (and possibly partly spurious) sequences in some of the datasets, iii) the discovery rate of novel folds among sequences uncovered by current fold and family profile libraries may be as high as 36%, but would ultimately translate into a marginal increase in global discovery of novel folds. Thus, genomes from under-sampled phyla should have a rather limited impact on increasing coarse grained tertiary structure level novelty.
Collapse
|
9
|
Deng L, Chen Z. An Integrated Framework for Functional Annotation of Protein Structural Domains. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:902-13. [PMID: 26357331 DOI: 10.1109/tcbb.2015.2389213] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
Structural domains are evolutionary and functional units of proteins and play a critical role in comparative and functional genomics. Computational assignment of domain function with high reliability is essential for understanding whole-protein functions. However, functional annotations are conventionally assigned onto full-length proteins rather than associating specific functions to the individual structural domains. In this article, we present Structural Domain Annotation (SDA), a novel computational approach to predict functions for SCOP structural domains. The SDA method integrates heterogeneous information sources, including structure alignment based protein-SCOP mapping features, InterPro2GO mapping information, PSSM Profiles, and sequence neighborhood features, with a Bayesian network. By large-scale annotating Gene Ontology terms to SCOP domains with SDA, we obtained a database of SCOP domain to Gene Ontology mappings, which contains ~162,000 out of the approximately 166,900 domains in SCOPe 2.03 (>97 percent) and their predicted Gene Ontology functions. We have benchmarked SDA using a single-domain protein dataset and an independent dataset from different species. Comparative studies show that SDA significantly outperforms the existing function prediction methods for structural domains in terms of coverage and maximum F-measure.
Collapse
|
10
|
Roche DB, Brüls T. The enzymatic nature of an anonymous protein sequence cannot reliably be inferred from superfamily level structural information alone. Protein Sci 2015; 24:643-50. [PMID: 25559918 DOI: 10.1002/pro.2635] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2014] [Revised: 12/13/2014] [Accepted: 12/29/2014] [Indexed: 11/12/2022]
Abstract
As the largest fraction of any proteome does not carry out enzymatic functions, and in order to leverage 3D structural data for the annotation of increasingly higher volumes of sequence data, we wanted to assess the strength of the link between coarse grained structural data (i.e., homologous superfamily level) and the enzymatic versus non-enzymatic nature of protein sequences. To probe this relationship, we took advantage of 41 phylogenetically diverse (encompassing 11 distinct phyla) genomes recently sequenced within the GEBA initiative, for which we integrated structural information, as defined by CATH, with enzyme level information, as defined by Enzyme Commission (EC) numbers. This analysis revealed that only a very small fraction (about 1%) of domain sequences occurring in the analyzed genomes was found to be associated with homologous superfamilies strongly indicative of enzymatic function. Resorting to less stringent criteria to define enzyme versus non-enzyme biased structural classes or excluding highly prevalent folds from the analysis had only modest effect on this proportion. Thus, the low genomic coverage by structurally anchored protein domains strongly associated to catalytic activities indicates that, on its own, the power of coarse grained structural information to infer the general property of being an enzyme is rather limited.
Collapse
Affiliation(s)
- Daniel Barry Roche
- Laboratoire de génomique et biochimie du métabolisme, Genoscope, Institut de Génomique, Commissariat à l'Energie Atomique et aux Energies Alternatives, Evry, Essonne, 91057, France; UMR 8030 - Génomique Métabolique, Centre National de la Recherche Scientifique, Evry, Essonne, 91057, France; Départment de Biologie, Université d'Evry-Val-d'Essonne, Evry, Essonne, 91000, France; PRES UniverSud Paris, Saint-Aubin, Essonne, 91190, France
| | | |
Collapse
|
11
|
Zhao H, Peng Z, Fei B, Li L, Hu T, Gao Z, Jiang Z. BambooGDB: a bamboo genome database with functional annotation and an analysis platform. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2014; 2014:bau006. [PMID: 24602877 PMCID: PMC3944406 DOI: 10.1093/database/bau006] [Citation(s) in RCA: 53] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
Bamboo, as one of the most important non-timber forest products and fastest-growing plants in the world, represents the only major lineage of grasses that is native to forests. Recent success on the first high-quality draft genome sequence of moso bamboo (Phyllostachys edulis) provides new insights on bamboo genetics and evolution. To further extend our understanding on bamboo genome and facilitate future studies on the basis of previous achievements, here we have developed BambooGDB, a bamboo genome database with functional annotation and analysis platform. The de novo sequencing data, together with the full-length complementary DNA and RNA-seq data of moso bamboo composed the main contents of this database. Based on these sequence data, a comprehensively functional annotation for bamboo genome was made. Besides, an analytical platform composed of comparative genomic analysis, protein–protein interactions network, pathway analysis and visualization of genomic data was also constructed. As discovery tools to understand and identify biological mechanisms of bamboo, the platform can be used as a systematic framework for helping and designing experiments for further validation. Moreover, diverse and powerful search tools and a convenient browser were incorporated to facilitate the navigation of these data. As far as we know, this is the first genome database for bamboo. Through integrating high-throughput sequencing data, a full functional annotation and several analysis modules, BambooGDB aims to provide worldwide researchers with a central genomic resource and an extensible analysis platform for bamboo genome. BambooGDB is freely available at http://www.bamboogdb.org/. Database URL: http://www.bamboogdb.org
Collapse
Affiliation(s)
- Hansheng Zhao
- State Forestry Administration Key Open Laboratory on the Science and Technology of Bamboo and Rattan, International Center for Bamboo and Rattan, Beijing 100102, China, State key laboratory of tree genetics and breeding, Research Institute of Forestry, Chinese Academy of Forestry, Beijing 100091, China and Key Laboratory of Tree Breeding and Cultivation, State Forestry Administration, Research Institute of Forestry, Chinese Academy of Forestry, Beijing 100091, China
| | | | | | | | | | | | | |
Collapse
|
12
|
Zhao H, Peng Z, Fei B, Li L, Hu T, Gao Z, Jiang Z. BambooGDB: a bamboo genome database with functional annotation and an analysis platform. Database (Oxford) 2014. [PMID: 24602877 DOI: 10.1093/database/bau100636t36t36t] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/17/2023]
Abstract
Bamboo, as one of the most important non-timber forest products and fastest-growing plants in the world, represents the only major lineage of grasses that is native to forests. Recent success on the first high-quality draft genome sequence of moso bamboo (Phyllostachys edulis) provides new insights on bamboo genetics and evolution. To further extend our understanding on bamboo genome and facilitate future studies on the basis of previous achievements, here we have developed BambooGDB, a bamboo genome database with functional annotation and analysis platform. The de novo sequencing data, together with the full-length complementary DNA and RNA-seq data of moso bamboo composed the main contents of this database. Based on these sequence data, a comprehensively functional annotation for bamboo genome was made. Besides, an analytical platform composed of comparative genomic analysis, protein-protein interactions network, pathway analysis and visualization of genomic data was also constructed. As discovery tools to understand and identify biological mechanisms of bamboo, the platform can be used as a systematic framework for helping and designing experiments for further validation. Moreover, diverse and powerful search tools and a convenient browser were incorporated to facilitate the navigation of these data. As far as we know, this is the first genome database for bamboo. Through integrating high-throughput sequencing data, a full functional annotation and several analysis modules, BambooGDB aims to provide worldwide researchers with a central genomic resource and an extensible analysis platform for bamboo genome. BambooGDB is freely available at http://www.bamboogdb.org/. Database URL: http://www.bamboogdb.org.
Collapse
Affiliation(s)
- Hansheng Zhao
- State Forestry Administration Key Open Laboratory on the Science and Technology of Bamboo and Rattan, International Center for Bamboo and Rattan, Beijing 100102, China, State key laboratory of tree genetics and breeding, Research Institute of Forestry, Chinese Academy of Forestry, Beijing 100091, China and Key Laboratory of Tree Breeding and Cultivation, State Forestry Administration, Research Institute of Forestry, Chinese Academy of Forestry, Beijing 100091, China
| | | | | | | | | | | | | |
Collapse
|
13
|
Sillitoe I, Cuff AL, Dessailly BH, Dawson NL, Furnham N, Lee D, Lees JG, Lewis TE, Studer RA, Rentzsch R, Yeats C, Thornton JM, Orengo CA. New functional families (FunFams) in CATH to improve the mapping of conserved functional sites to 3D structures. Nucleic Acids Res 2012. [PMID: 23203873 PMCID: PMC3531114 DOI: 10.1093/nar/gks1211] [Citation(s) in RCA: 175] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
CATH version 3.5 (Class, Architecture, Topology, Homology, available at http://www.cathdb.info/) contains 173 536 domains, 2626 homologous superfamilies and 1313 fold groups. When focusing on structural genomics (SG) structures, we observe that the number of new folds for CATH v3.5 is slightly less than for previous releases, and this observation suggests that we may now know the majority of folds that are easily accessible to structure determination. We have improved the accuracy of our functional family (FunFams) sub-classification method and the CATH sequence domain search facility has been extended to provide FunFam annotations for each domain. The CATH website has been redesigned. We have improved the display of functional data and of conserved sequence features associated with FunFams within each CATH superfamily.
Collapse
Affiliation(s)
- Ian Sillitoe
- Institute of Structural and Molecular Biology, Division of Biosciences, University College London, Gower Street, London WC1E 6BT, UK
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
14
|
Furnham N, Laskowski RA, Thornton JM. Abstracting knowledge from the protein data bank. Biopolymers 2012; 99:183-8. [DOI: 10.1002/bip.22107] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2012] [Accepted: 05/25/2012] [Indexed: 12/27/2022]
|
15
|
Ahmadi Adl A, Nowzari-Dalini A, Xue B, Uversky VN, Qian X. Accurate prediction of protein structural classes using functional domains and predicted secondary structure sequences. J Biomol Struct Dyn 2012; 29:623-33. [DOI: 10.1080/07391102.2011.672626] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
16
|
Uncovering the molecular machinery of the human spindle--an integration of wet and dry systems biology. PLoS One 2012; 7:e31813. [PMID: 22427808 PMCID: PMC3302876 DOI: 10.1371/journal.pone.0031813] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2011] [Accepted: 01/18/2012] [Indexed: 11/19/2022] Open
Abstract
The mitotic spindle is an essential molecular machine involved in cell division, whose composition has been studied extensively by detailed cellular biology, high-throughput proteomics, and RNA interference experiments. However, because of its dynamic organization and complex regulation it is difficult to obtain a complete description of its molecular composition. We have implemented an integrated computational approach to characterize novel human spindle components and have analysed in detail the individual candidates predicted to be spindle proteins, as well as the network of predicted relations connecting known and putative spindle proteins. The subsequent experimental validation of a number of predicted novel proteins confirmed not only their association with the spindle apparatus but also their role in mitosis. We found that 75% of our tested proteins are localizing to the spindle apparatus compared to a success rate of 35% when expert knowledge alone was used. We compare our results to the previously published MitoCheck study and see that our approach does validate some findings by this consortium. Further, we predict so-called "hidden spindle hub", proteins whose network of interactions is still poorly characterised by experimental means and which are thought to influence the functionality of the mitotic spindle on a large scale. Our analyses suggest that we are still far from knowing the complete repertoire of functionally important components of the human spindle network. Combining integrated bio-computational approaches and single gene experimental follow-ups could be key to exploring the still hidden regions of the human spindle system.
Collapse
|
17
|
Ezkurdia I, Tress ML. Protein structural domains: definition and prediction. CURRENT PROTOCOLS IN PROTEIN SCIENCE 2011; Chapter 2:2.14.1-2.14.16. [PMID: 22045561 DOI: 10.1002/0471140864.ps0214s66] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
Recognition and prediction of structural domains in proteins is an important part of structure and function prediction. This unit lists the range of tools available for domain prediction, and describes sequence and structural analysis tools that complement domain prediction methods. Also detailed are the basic domain prediction steps, along with suggested strategies for different protein sequences and potential pitfalls in domain boundary prediction. The difficult problem of domain orientation prediction is also discussed. All the resources necessary for domain boundary prediction are accessible via publicly available Web servers and databases and do not require computational expertise.
Collapse
Affiliation(s)
- Iakes Ezkurdia
- Spanish National Cancer Research Centre (CNIO)-Structural Biology and Biocomputing Programme, Madrid, Spain
| | - Michael L Tress
- Spanish National Cancer Research Centre (CNIO)-Structural Biology and Biocomputing Programme, Madrid, Spain
| |
Collapse
|
18
|
Backes C, Ludwig N, Leidinger P, Harz C, Hoffmann J, Keller A, Meese E, Lenhof HP. Immunogenicity of autoantigens. BMC Genomics 2011; 12:340. [PMID: 21726451 PMCID: PMC3149588 DOI: 10.1186/1471-2164-12-340] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2011] [Accepted: 07/04/2011] [Indexed: 11/10/2022] Open
Abstract
Background Autoantibodies against self-antigens have been associated not only with autoimmune diseases, but also with cancer and are even found in healthy individuals. The mechanism causing the autoantibody response remains elusive for the majority of the immunogenic antigens. To deepen the understanding of autoantibody responses, we ask whether natural-occurring, autoimmunity-associated and tumor-associated antigens have structural or biological features related to the immune response. To this end, we have carried out the most comprehensive in-silicio study of different groups of autoantigens including large antigen sets identified by our groups combined with publicly available antigen sets. Results We found evidence for an enrichment of genes with a larger exon length increasing the probability of the occurrence of potential immunogenic features such as mutations, SNPs, immunogenic sequence patterns and structural epitopes, or alternative splicing events. While SNPs seem to play a more central role in autoimmunity, somatic mutations seem to be stronger enriched in tumor-associated antigens. In addition, antigens of autoimmune diseases are different from other antigen sets in that they appear preferentially secreted, have frequently an extracellular location, and they are enriched in pathways associated with the immune system. Furthermore, for autoantibodies in general, we found enrichment of sequence-based properties including coiled-coils motifs, ELR motifs, and Zinc finger DNA-binding motifs. Moreover, we found enrichment of proteins binding to proteins or nucleic acids including RNA and enrichment of proteins that are part of ribosome or spliceosome. Both, homologies to proteins of other species and an enrichment of ancient protein domains indicate that immunogenic proteins are evolutionary conserved and that mimicry might play a central role. Conclusions Our results provide evidence that proteins which i) are evolutionary conserved, ii) show specific sequence motifs, and iii) are part of cellular structures show an increased likelihood to become autoimmunogenic.
Collapse
Affiliation(s)
- Christina Backes
- Center for Bioinformatics, Saarland University, 66041 Saarbrücken, Germany.
| | | | | | | | | | | | | | | |
Collapse
|
19
|
Gabanyi MJ, Adams PD, Arnold K, Bordoli L, Carter LG, Flippen-Andersen J, Gifford L, Haas J, Kouranov A, McLaughlin WA, Micallef DI, Minor W, Shah R, Schwede T, Tao YP, Westbrook JD, Zimmerman M, Berman HM. The Structural Biology Knowledgebase: a portal to protein structures, sequences, functions, and methods. JOURNAL OF STRUCTURAL AND FUNCTIONAL GENOMICS 2011. [PMID: 21472436 DOI: 10.1007/s10969-011-9106-2.] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 09/29/2022]
Abstract
The Protein Structure Initiative's Structural Biology Knowledgebase (SBKB, URL: http://sbkb.org ) is an open web resource designed to turn the products of the structural genomics and structural biology efforts into knowledge that can be used by the biological community to understand living systems and disease. Here we will present examples on how to use the SBKB to enable biological research. For example, a protein sequence or Protein Data Bank (PDB) structure ID search will provide a list of related protein structures in the PDB, associated biological descriptions (annotations), homology models, structural genomics protein target status, experimental protocols, and the ability to order available DNA clones from the PSI:Biology-Materials Repository. A text search will find publication and technology reports resulting from the PSI's high-throughput research efforts. Web tools that aid in research, including a system that accepts protein structure requests from the community, will also be described. Created in collaboration with the Nature Publishing Group, the Structural Biology Knowledgebase monthly update also provides a research library, editorials about new research advances, news, and an events calendar to present a broader view of structural genomics and structural biology.
Collapse
Affiliation(s)
- Margaret J Gabanyi
- Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
20
|
The Structural Biology Knowledgebase: a portal to protein structures, sequences, functions, and methods. ACTA ACUST UNITED AC 2011; 12:45-54. [PMID: 21472436 PMCID: PMC3123456 DOI: 10.1007/s10969-011-9106-2] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2010] [Accepted: 03/21/2011] [Indexed: 01/10/2023]
Abstract
The Protein Structure Initiative’s Structural Biology Knowledgebase (SBKB, URL: http://sbkb.org) is an open web resource designed to turn the products of the structural genomics and structural biology efforts into knowledge that can be used by the biological community to understand living systems and disease. Here we will present examples on how to use the SBKB to enable biological research. For example, a protein sequence or Protein Data Bank (PDB) structure ID search will provide a list of related protein structures in the PDB, associated biological descriptions (annotations), homology models, structural genomics protein target status, experimental protocols, and the ability to order available DNA clones from the PSI:Biology-Materials Repository. A text search will find publication and technology reports resulting from the PSI’s high-throughput research efforts. Web tools that aid in research, including a system that accepts protein structure requests from the community, will also be described. Created in collaboration with the Nature Publishing Group, the Structural Biology Knowledgebase monthly update also provides a research library, editorials about new research advances, news, and an events calendar to present a broader view of structural genomics and structural biology.
Collapse
|
21
|
Dessailly BH, Redfern OC, Cuff AL, Orengo CA. Detailed analysis of function divergence in a large and diverse domain superfamily: toward a refined protocol of function classification. Structure 2011; 18:1522-35. [PMID: 21070951 DOI: 10.1016/j.str.2010.08.017] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2010] [Revised: 08/06/2010] [Accepted: 08/13/2010] [Indexed: 10/18/2022]
Abstract
Some superfamilies contain large numbers of protein domains with very different functions. The ability to refine the functional classification of domains within these superfamilies is necessary for better understanding the evolution of functions and to guide function prediction of new relatives. To achieve this, a suitable starting point is the detailed analysis of functional divisions and mechanisms of functional divergence in a single superfamily. Here, we present such a detailed analysis in the superfamily of HUP domains. A biologically meaningful functional classification of HUP domains is obtained manually. Mechanisms of function diversification are investigated in detail using this classification. We observe that structural motifs play an important role in shaping broad functional divergence, whereas residue-level changes shape diversity at a more specific level. In parallel we examine the ability of an automated protocol to capture the biologically meaningful classification, with a view to automatically extending this classification in the future.
Collapse
Affiliation(s)
- Benoit H Dessailly
- Department of Structural and Molecular Biology, University College of London, Gower Street, London WC1E6BT, UK.
| | | | | | | |
Collapse
|
22
|
Abstract
Improvements in nucleotide sequencing technology have resulted in an ever increasing number of nucleotide and protein sequences being deposited in databases. Unfortunately, the ability to manually classify and annotate these sequences cannot keep pace with their rapid generation, resulting in an increased bias toward unannotated sequence. Automatic annotation tools can help redress the balance. There are a number of different groups working to produce protein signatures that describe protein families, functional domains or conserved sites within related groups of proteins. Protein signature databases include CATH-Gene3D, HAMAP, PANTHER, Pfam, PIRSF, PRINTS, ProDom, PROSITE, SMART, SUPERFAMILY, and TIGRFAMs. Their approaches range from characterising small conserved motifs that can identify members of a family or subfamily, to the use of hidden Markov models that describe the conservation of residues over entire domains or whole proteins. To increase their value as protein classification tools, protein signatures from these 11 databases have been combined into one, powerful annotation tool: the InterPro database (http://www.ebi.ac.uk/interpro/) (Hunter et al., Nucleic Acids Res 37:D211-D215, 2009). InterPro is an open-source protein resource used for the automatic annotation of proteins, and is scalable to the analysis of entire new genomes through the use of a downloadable version of InterProScan, which can be incorporated into an existing local pipeline. InterPro provides structural information from PDB (Kouranov et al., Nucleic Acids Res 34:D302-D305, 2006), its classification in CATH (Cuff et al., Nucleic Acids Res 37:D310-D314, 2009) and SCOP (Andreeva et al., Nucleic Acids Res 36:D419-D425, 2008), as well as homology models from ModBase (Pieper et al., Nucleic Acids Res 37:D347-D354, 2009) and SwissModel (Kiefer et al., Nucleic Acids Res 37:D387-D392, 2009), allowing a direct comparison of the protein signatures with the available structural information. This chapter reviews the signature methods found in the InterPro database, and provides an overview of the InterPro resource itself.
Collapse
|
23
|
Abstract
In the past decades, a variety of publicly available data repositories and resources have been developed to support protein related information management, data-driven hypothesis generation and biological knowledge discovery. However, there is also an increasing confusion for the researchers who are trying to quickly find the appropriate resources to help them solve their problems. In this chapter, we present a comprehensive review (with categorization and description) of major protein bioinformatics databases and resources that are relevant to comparative proteomics research. We conclude the chapter by discussing the challenges and opportunities for developing new protein bioinformatics databases.
Collapse
|
24
|
Cuff AL, Sillitoe I, Lewis T, Clegg AB, Rentzsch R, Furnham N, Pellegrini-Calace M, Jones D, Thornton J, Orengo CA. Extending CATH: increasing coverage of the protein structure universe and linking structure with function. Nucleic Acids Res 2010; 39:D420-6. [PMID: 21097779 PMCID: PMC3013636 DOI: 10.1093/nar/gkq1001] [Citation(s) in RCA: 118] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
CATH version 3.3 (class, architecture, topology, homology) contains 128 688 domains, 2386 homologous superfamilies and 1233 fold groups, and reflects a major focus on classifying structural genomics (SG) structures and transmembrane proteins, both of which are likely to add structural novelty to the database and therefore increase the coverage of protein fold space within CATH. For CATH version 3.4 we have significantly improved the presentation of sequence information and associated functional information for CATH superfamilies. The CATH superfamily pages now reflect both the functional and structural diversity within the superfamily and include structural alignments of close and distant relatives within the superfamily, annotated with functional information and details of conserved residues. A significantly more efficient search function for CATH has been established by implementing the search server Solr (http://lucene.apache.org/solr/). The CATH v3.4 webpages have been built using the Catalyst web framework.
Collapse
Affiliation(s)
- Alison L Cuff
- Institute of Structural and Molecular Biology, University College London, Darwin Building, Gower Street, London WC1E 6BT, UK.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
25
|
Morilla I, Lees JG, Reid AJ, Orengo C, Ranea JAG. Assessment of protein domain fusions in human protein interaction networks prediction: application to the human kinetochore model. N Biotechnol 2010; 27:755-65. [PMID: 20851221 DOI: 10.1016/j.nbt.2010.09.005] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2010] [Revised: 07/29/2010] [Accepted: 09/11/2010] [Indexed: 01/13/2023]
Abstract
In order to understand how biological systems function it is necessary to determine the interactions and associations between proteins. Some proteins, involved in a common biological process and encoded by separate genes in one organism, can be found fused within a single protein chain in other organisms. By detecting these triplets, a functional relationship can be established between the unfused proteins. Here we use a domain fusion prediction method to predict these protein interactions for the human interactome. We observed that gene fusion events are more related to physical interaction between proteins than to other weaker functional relationships such as participation in a common biological pathway. These results suggest that domain fusion is an appropriate method for predicting protein complexes. The most reliable fused domain predictions were used to build protein-protein interaction (PPI) networks. These predicted PPI network models showed the same topological features as real biological networks and different features from random behaviour. We built the PPI domain fusion sub-network model of the human kinetochore and observed that the majority of the predicted interactions have not yet been experimentally characterised in the publicly available PPI repositories. The study of the human kinetochore domain fusion sub-network reveals undiscovered kinetochore proteins with presumably relevant functions, such as hubs with many connections in the kinetochore sub-network. These results suggest that experimentally hidden regions in the predicted PPI networks contain key functional elements, associated with important functional areas, still undiscovered in the human interactome. Until novel experiments shed light on these hidden regions; domain fusion predictions provide a valuable approach for exploring them.
Collapse
Affiliation(s)
- Ian Morilla
- Department of Molecular Biology and Biochemistry, University of Malaga, Malaga, Spain.
| | | | | | | | | |
Collapse
|
26
|
Reid AJ, Ranea JAG, Clegg AB, Orengo CA. CODA: accurate detection of functional associations between proteins in eukaryotic genomes using domain fusion. PLoS One 2010; 5:e10908. [PMID: 20532224 PMCID: PMC2879367 DOI: 10.1371/journal.pone.0010908] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2009] [Accepted: 05/10/2010] [Indexed: 11/25/2022] Open
Abstract
Background In order to understand how biological systems function it is necessary to determine the interactions and associations between proteins. Gene fusion prediction is one approach to detection of such functional relationships. Its use is however known to be problematic in higher eukaryotic genomes due to the presence of large homologous domain families. Here we introduce CODA (Co-Occurrence of Domains Analysis), a method to predict functional associations based on the gene fusion idiom. Methodology/Principal Findings We apply a novel scoring scheme which takes account of the genome-specific size of homologous domain families involved in fusion to improve accuracy in predicting functional associations. We show that CODA is able to accurately predict functional similarities in human with comparison to state-of-the-art methods and show that different methods can be complementary. CODA is used to produce evidence that a currently uncharacterised human protein may be involved in pathways related to depression and that another is involved in DNA replication. Conclusions/Significance The relative performance of different gene fusion methodologies has not previously been explored. We find that they are largely complementary, with different methods being more or less appropriate in different genomes. Our method is the only one currently available for download and can be run on an arbitrary dataset by the user. The CODA software and datasets are freely available from ftp://ftp.biochem.ucl.ac.uk/pub/gene3d_data/v6.1.0/CODA/. Predictions are also available via web services from http://funcnet.eu/.
Collapse
Affiliation(s)
- Adam J Reid
- Wellcome Trust Sanger Institute, Cambridge, United Kingdom.
| | | | | | | |
Collapse
|
27
|
Buchan DWA, Ward SM, Lobley AE, Nugent TCO, Bryson K, Jones DT. Protein annotation and modelling servers at University College London. Nucleic Acids Res 2010; 38:W563-8. [PMID: 20507913 PMCID: PMC2896093 DOI: 10.1093/nar/gkq427] [Citation(s) in RCA: 283] [Impact Index Per Article: 18.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
The UCL Bioinformatics Group web portal offers several high quality protein structure prediction and function annotation algorithms including PSIPRED, pGenTHREADER, pDomTHREADER, MEMSAT, MetSite, DISOPRED2, DomPred and FFPred for the prediction of secondary structure, protein fold, protein structural domain, transmembrane helix topology, metal binding sites, regions of protein disorder, protein domain boundaries and protein function, respectively. We also now offer a fully automated 3D modelling pipeline: BioSerf, which performed well in CASP8 and uses a fragment-assembly approach which placed it in the top five servers in the de novo modelling category. The servers are available via the group web site at http://bioinf.cs.ucl.ac.uk/.
Collapse
Affiliation(s)
- D W A Buchan
- Bioinformatics Group, University College London, Gower Street, London, WC1E 6BT, UK.
| | | | | | | | | | | |
Collapse
|
28
|
Hinz U. From protein sequences to 3D-structures and beyond: the example of the UniProt knowledgebase. Cell Mol Life Sci 2010; 67:1049-64. [PMID: 20043185 PMCID: PMC2835715 DOI: 10.1007/s00018-009-0229-6] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2009] [Revised: 12/01/2009] [Accepted: 12/07/2009] [Indexed: 11/12/2022]
Abstract
With the dramatic increase in the volume of experimental results in every domain of life sciences, assembling pertinent data and combining information from different fields has become a challenge. Information is dispersed over numerous specialized databases and is presented in many different formats. Rapid access to experiment-based information about well-characterized proteins helps predict the function of uncharacterized proteins identified by large-scale sequencing. In this context, universal knowledgebases play essential roles in providing access to data from complementary types of experiments and serving as hubs with cross-references to many specialized databases. This review outlines how the value of experimental data is optimized by combining high-quality protein sequences with complementary experimental results, including information derived from protein 3D-structures, using as an example the UniProt knowledgebase (UniProtKB) and the tools and links provided on its website ( http://www.uniprot.org/ ). It also evokes precautions that are necessary for successful predictions and extrapolations.
Collapse
Affiliation(s)
- Ursula Hinz
- Swiss-Prot Group, Swiss Institute of Bioinformatics, 1 rue Michel Servet, 1211, Geneva, Switzerland.
| |
Collapse
|
29
|
Malik R, Dulla K, Nigg EA, Körner R. From proteome lists to biological impact--tools and strategies for the analysis of large MS data sets. Proteomics 2010; 10:1270-1283. [PMID: 20077408 DOI: 10.1002/pmic.200900365] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2009] [Accepted: 11/16/2009] [Indexed: 01/03/2025]
Abstract
MS has become a method-of-choice for proteome analysis, generating large data sets, which reflect proteome-scale protein-protein interaction and PTM networks. However, while a rapid growth in large-scale proteomics data can be observed, the sound biological interpretation of these results clearly lags behind. Therefore, combined efforts of bioinformaticians and biologists have been made to develop strategies and applications to help experimentalists perform this crucial task. This review presents an overview of currently available analytical strategies and tools to extract biologically relevant information from large protein lists. Moreover, we also present current research publications making use of these tools as examples of how the presented strategies may be incorporated into proteomic workflows. Emphasis is placed on the analysis of Gene Ontology terms, interaction networks, biological pathways and PTMs. In addition, topics including domain analysis and text mining are reviewed in the context of computational analysis of proteomic results. We expect that these types of analyses will significantly contribute to a deeper understanding of the role of individual proteins, protein networks and pathways in complex systems.
Collapse
Affiliation(s)
- Rainer Malik
- Max Planck Institute of Biochemistry, Department of Cell Biology, Martinsried, Germany
| | | | | | | |
Collapse
|
30
|
Reid AJ, Ranea JA, Orengo CA. Comparative evolutionary analysis of protein complexes in E. coli and yeast. BMC Genomics 2010; 11:79. [PMID: 20122144 PMCID: PMC2837643 DOI: 10.1186/1471-2164-11-79] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2009] [Accepted: 02/01/2010] [Indexed: 11/17/2022] Open
Abstract
Background Proteins do not act in isolation; they frequently act together in protein complexes to carry out concerted cellular functions. The evolution of complexes is poorly understood, especially in organisms other than yeast, where little experimental data has been available. Results We generated accurate, high coverage datasets of protein complexes for E. coli and yeast in order to study differences in the evolution of complexes between these two species. We show that substantial differences exist in how complexes have evolved between these organisms. A previously proposed model of complex evolution identified complexes with cores of interacting homologues. We support findings of the relative importance of this mode of evolution in yeast, but find that it is much less common in E. coli. Additionally it is shown that those homologues which do cluster in complexes are involved in eukaryote-specific functions. Furthermore we identify correlated pairs of non-homologous domains which occur in multiple protein complexes. These were identified in both yeast and E. coli and we present evidence that these too may represent complex cores in yeast but not those of E. coli. Conclusions Our results suggest that there are differences in the way protein complexes have evolved in E. coli and yeast. Whereas some yeast complexes have evolved by recruiting paralogues, this is not apparent in E. coli. Furthermore, such complexes are involved in eukaryotic-specific functions. This implies that the increase in gene family sizes seen in eukaryotes in part reflects multiple family members being used within complexes. However, in general, in both E. coli and yeast, homologous domains are used in different complexes.
Collapse
Affiliation(s)
- Adam J Reid
- Research Department of Structural & Molecular Biology, University College London, London, WC1E 6BT, UK.
| | | | | |
Collapse
|
31
|
Yeats C, Redfern OC, Orengo C. A fast and automated solution for accurately resolving protein domain architectures. ACTA ACUST UNITED AC 2010; 26:745-51. [PMID: 20118117 DOI: 10.1093/bioinformatics/btq034] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Accurate prediction of the domain content and arrangement in multi-domain proteins (which make up >65% of the large-scale protein databases) provides a valuable tool for function prediction, comparative genomics and studies of molecular evolution. However, scanning a multi-domain protein against a database of domain sequence profiles can often produce conflicting and overlapping matches. We have developed a novel method that employs heaviest weighted clique-finding (HCF), which we show significantly outperforms standard published approaches based on successively assigning the best non-overlapping match (Best Match Cascade, BMC). RESULTS We created benchmark data set of structural domain assignments in the CATH database and a corresponding set of Hidden Markov Model-based domain predictions. Using these, we demonstrate that by considering all possible combinations of matches using the HCF approach, we achieve much higher prediction accuracy than the standard BMC method. We also show that it is essential to allow overlapping domain matches to a query in order to identify correct domain assignments. Furthermore, we introduce a straightforward and effective protocol for resolving any overlapping assignments, and producing a single set of non-overlapping predicted domains. AVAILABILITY AND IMPLEMENTATION The new approach will be used to determine MDAs for UniProt and Ensembl, and made available via the Gene3D website: http://gene3d.biochem.ucl.ac.uk/Gene3D/. The software has been implemented in C++ and compiled for Linux: source code and binaries can be found at: ftp://ftp.biochem.ucl.ac.uk/pub/gene3d_data/DomainFinder3/ CONTACT yeats@biochem.ucl.ac.uk SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Corin Yeats
- Department of Structural and Molecular Biology, UCL, London WC1E 6BT, UK.
| | | | | |
Collapse
|
32
|
Lee DA, Rentzsch R, Orengo C. GeMMA: functional subfamily classification within superfamilies of predicted protein structural domains. Nucleic Acids Res 2009; 38:720-37. [PMID: 19923231 PMCID: PMC2817468 DOI: 10.1093/nar/gkp1049] [Citation(s) in RCA: 52] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
GeMMA (Genome Modelling and Model Annotation) is a new approach to automatic functional subfamily classification within families and superfamilies of protein sequences. A major advantage of GeMMA is its ability to subclassify very large and diverse superfamilies with tens of thousands of members, without the need for an initial multiple sequence alignment. Its performance is shown to be comparable to the established high-performance method SCI-PHY. GeMMA follows an agglomerative clustering protocol that uses existing software for sensitive and accurate multiple sequence alignment and profile–profile comparison. The produced subfamilies are shown to be equivalent in quality whether whole protein sequences are used or just the sequences of component predicted structural domains. A faster, heuristic version of GeMMA that also uses distributed computing is shown to maintain the performance levels of the original implementation. The use of GeMMA to increase the functional annotation coverage of functionally diverse Pfam families is demonstrated. It is further shown how GeMMA clusters can help to predict the impact of experimentally determining a protein domain structure on comparative protein modelling coverage, in the context of structural genomics.
Collapse
Affiliation(s)
- David A Lee
- University College London - Structural and Molecular Biology, London, UK.
| | | | | |
Collapse
|
33
|
Dehal PS, Joachimiak MP, Price MN, Bates JT, Baumohl JK, Chivian D, Friedland GD, Huang KH, Keller K, Novichkov PS, Dubchak IL, Alm EJ, Arkin AP. MicrobesOnline: an integrated portal for comparative and functional genomics. Nucleic Acids Res 2009; 38:D396-400. [PMID: 19906701 PMCID: PMC2808868 DOI: 10.1093/nar/gkp919] [Citation(s) in RCA: 350] [Impact Index Per Article: 21.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Since 2003, MicrobesOnline (http://www.microbesonline.org) has been providing a community resource for comparative and functional genome analysis. The portal includes over 1000 complete genomes of bacteria, archaea and fungi and thousands of expression microarrays from diverse organisms ranging from model organisms such as Escherichia coli and Saccharomyces cerevisiae to environmental microbes such as Desulfovibrio vulgaris and Shewanella oneidensis. To assist in annotating genes and in reconstructing their evolutionary history, MicrobesOnline includes a comparative genome browser based on phylogenetic trees for every gene family as well as a species tree. To identify co-regulated genes, MicrobesOnline can search for genes based on their expression profile, and provides tools for identifying regulatory motifs and seeing if they are conserved. MicrobesOnline also includes fast phylogenetic profile searches, comparative views of metabolic pathways, operon predictions, a workbench for sequence analysis and integration with RegTransBase and other microbial genome resources. The next update of MicrobesOnline will contain significant new functionality, including comparative analysis of metagenomic sequence data. Programmatic access to the database, along with source code and documentation, is available at http://microbesonline.org/programmers.html.
Collapse
Affiliation(s)
- Paramvir S Dehal
- Virtual Institute for Microbial Stress and Survival, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
34
|
The evolution of protein functions and networks: a family-centric approach. Biochem Soc Trans 2009; 37:745-50. [PMID: 19614587 DOI: 10.1042/bst0370745] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
The study of superfamilies of protein domains using a combination of structure, sequence and function data provides insights into deep evolutionary history. In the present paper, analyses of functional diversity within such superfamilies as defined in the CATH-Gene3D resource are described. These analyses focus on structure-function relationships in very large and diverse superfamilies, and on the evolution of domain superfamily members in protein-protein complexes.
Collapse
|
35
|
Dessailly BH, Nair R, Jaroszewski L, Fajardo JE, Kouranov A, Lee D, Fiser A, Godzik A, Rost B, Orengo C. PSI-2: structural genomics to cover protein domain family space. Structure 2009; 17:869-81. [PMID: 19523904 DOI: 10.1016/j.str.2009.03.015] [Citation(s) in RCA: 106] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2008] [Revised: 03/18/2009] [Accepted: 03/22/2009] [Indexed: 11/25/2022]
Abstract
One major objective of structural genomics efforts, including the NIH-funded Protein Structure Initiative (PSI), has been to increase the structural coverage of protein sequence space. Here, we present the target selection strategy used during the second phase of PSI (PSI-2). This strategy, jointly devised by the bioinformatics groups associated with the PSI-2 large-scale production centers, targets representatives from large, structurally uncharacterized protein domain families, and from structurally uncharacterized subfamilies in very large and diverse families with incomplete structural coverage. These very large families are extremely diverse both structurally and functionally, and are highly overrepresented in known proteomes. On the basis of several metrics, we then discuss to what extent PSI-2, during its first 3 years, has increased the structural coverage of genomes, and contributed structural and functional novelty. Together, the results presented here suggest that PSI-2 is successfully meeting its objectives and provides useful insights into structural and functional space.
Collapse
Affiliation(s)
- Benoît H Dessailly
- Department of Structural and Molecular Biology, University College of London, London WC1E6BT, UK.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
36
|
Izarzugaza JMG, Baresic A, McMillan LEM, Yeats C, Clegg AB, Orengo CA, Martin ACR, Valencia A. An integrated approach to the interpretation of single amino acid polymorphisms within the framework of CATH and Gene3D. BMC Bioinformatics 2009; 10 Suppl 8:S5. [PMID: 19758469 PMCID: PMC2745587 DOI: 10.1186/1471-2105-10-s8-s5] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023] Open
Abstract
BACKGROUND The phenotypic effects of sequence variations in protein-coding regions come about primarily via their effects on the resulting structures, for example by disrupting active sites or affecting structural stability. In order better to understand the mechanisms behind known mutant phenotypes, and predict the effects of novel variations, biologists need tools to gauge the impacts of DNA mutations in terms of their structural manifestation. Although many mutations occur within domains whose structure has been solved, many more occur within genes whose protein products have not been structurally characterized. RESULTS Here we present 3DSim (3D Structural Implication of Mutations), a database and web application facilitating the localization and visualization of single amino acid polymorphisms (SAAPs) mapped to protein structures even where the structure of the protein of interest is unknown. The server displays information on 6514 point mutations, 4865 of them known to be associated with disease. These polymorphisms are drawn from SAAPdb, which aggregates data from various sources including dbSNP and several pathogenic mutation databases. While the SAAPdb interface displays mutations on known structures, 3DSim projects mutations onto known sequence domains in Gene3D. This resource contains sequences annotated with domains predicted to belong to structural families in the CATH database. Mappings between domain sequences in Gene3D and known structures in CATH are obtained using a MUSCLE alignment. 1210 three-dimensional structures corresponding to CATH structural domains are currently included in 3DSim; these domains are distributed across 396 CATH superfamilies, and provide a comprehensive overview of the distribution of mutations in structural space. CONCLUSION The server is publicly available at http://3DSim.bioinfo.cnio.es/. In addition, the database containing the mapping between SAAPdb, Gene3D and CATH is available on request and most of the functionality is available through programmatic web service access.
Collapse
Affiliation(s)
- Jose M G Izarzugaza
- Institute of Structural and Molecular Biology, University College London, UK.
| | | | | | | | | | | | | | | |
Collapse
|
37
|
Naumoff DG, Carreras M. PSI protein classifier: A new program automating PSI-BLAST search results. Mol Biol 2009. [DOI: 10.1134/s0026893309040189] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
38
|
Janky R, Helden JV, Babu MM. Investigating transcriptional regulation: From analysis of complex networks to discovery of cis-regulatory elements. Methods 2009; 48:277-86. [DOI: 10.1016/j.ymeth.2009.04.022] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2009] [Revised: 04/17/2009] [Accepted: 04/18/2009] [Indexed: 10/20/2022] Open
|
39
|
Abstract
The large-scale structural biology projects that target human proteins focus predominantly on the catalytic domains of potential therapeutic targets and the domains of human proteins that mediate protein-protein and protein-small-molecule interactions. Their main scientific objective is to elucidate the molecular basis for specificity and selectivity of function within large protein families of therapeutic interest, such as kinases, phosphatases, and proteins involved in epigenetic regulation. Half of the unique human protein structures determined in the past three years derive from these initiatives.
Collapse
Affiliation(s)
- Aled Edwards
- Banting and Best Department of Medical Research, University of Toronto, Ontario M5G 1L6, Canada
| |
Collapse
|
40
|
Chitale M, Hawkins T, Park C, Kihara D. ESG: extended similarity group method for automated protein function prediction. ACTA ACUST UNITED AC 2009; 25:1739-45. [PMID: 19435743 DOI: 10.1093/bioinformatics/btp309] [Citation(s) in RCA: 73] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Importance of accurate automatic protein function prediction is ever increasing in the face of a large number of newly sequenced genomes and proteomics data that are awaiting biological interpretation. Conventional methods have focused on high sequence similarity-based annotation transfer which relies on the concept of homology. However, many cases have been reported that simple transfer of function from top hits of a homology search causes erroneous annotation. New methods are required to handle the sequence similarity in a more robust way to combine together signals from strongly and weakly similar proteins for effectively predicting function for unknown proteins with high reliability. RESULTS We present the extended similarity group (ESG) method, which performs iterative sequence database searches and annotates a query sequence with Gene Ontology terms. Each annotation is assigned with probability based on its relative similarity score with the multiple-level neighbors in the protein similarity graph. We will depict how the statistical framework of ESG improves the prediction accuracy by iteratively taking into account the neighborhood of query protein in the sequence similarity space. ESG outperforms conventional PSI-BLAST and the protein function prediction (PFP) algorithm. It is found that the iterative search is effective in capturing multiple-domains in a query protein, enabling accurately predicting several functions which originate from different domains. AVAILABILITY ESG web server is available for automated protein function prediction at http://dragon.bio.purdue.edu/ESG/.
Collapse
Affiliation(s)
- Meghana Chitale
- Department of Computer Science, Purdue University, IN 47907, USA
| | | | | | | |
Collapse
|
41
|
Valas RE, Yang S, Bourne PE. Nothing about protein structure classification makes sense except in the light of evolution. Curr Opin Struct Biol 2009; 19:329-34. [PMID: 19394812 DOI: 10.1016/j.sbi.2009.03.011] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2008] [Revised: 02/19/2009] [Accepted: 03/16/2009] [Indexed: 12/27/2022]
Abstract
In this, the 200th anniversary of Charles Darwin's birth and the 150th anniversary of the publication of the Origin of Species, it is fitting to revisit the classification of protein structures from an evolutionary perspective. Existing classifications use homologous sequence relationships, but knowing that structure is much more conserved that sequence creates an iterative loop from which structures can be further classified beyond that of the domain, thereby teasing out distant evolutionary relationships. The desired classification scheme is then one in which a fold is merely semantics and structure can be classified as either ancestral or derived.
Collapse
Affiliation(s)
- Ruben E Valas
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, CA 92093-0743, USA
| | | | | |
Collapse
|
42
|
Dessailly BH, Redfern OC, Cuff A, Orengo CA. Exploiting structural classifications for function prediction: towards a domain grammar for protein function. Curr Opin Struct Biol 2009; 19:349-56. [PMID: 19398323 DOI: 10.1016/j.sbi.2009.03.009] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2009] [Revised: 02/17/2009] [Accepted: 03/16/2009] [Indexed: 12/28/2022]
Abstract
The ability to assign function to proteins has become a major bottleneck for comprehensively understanding cellular mechanisms at the molecular level. Here we discuss the extent to which structural domain classifications can help in deciphering the complex relationship between the functions of proteins and their sequences and structures. Structural classifications are particularly helpful in understanding the mosaic manner in which new proteins and functions emerge through evolution. This is partly because they provide reliable and concrete domain definitions and enable the detection of very remote structural similarities and homologies. It is also because structural data can illuminate more clearly the mechanisms by which a broader functional repertoire can emerge during evolution.
Collapse
Affiliation(s)
- Benoît H Dessailly
- Department of Structural and Molecular Biology, University College London, London WC1E 6BT, United Kingdom
| | | | | | | |
Collapse
|
43
|
Reeves GA, Talavera D, Thornton JM. Genome and proteome annotation: organization, interpretation and integration. J R Soc Interface 2009; 6:129-47. [PMID: 19019817 DOI: 10.1098/rsif.2008.0341] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Recent years have seen a huge increase in the generation of genomic and proteomic data. This has been due to improvements in current biological methodologies, the development of new experimental techniques and the use of computers as support tools. All these raw data are useless if they cannot be properly analysed, annotated, stored and displayed. Consequently, a vast number of resources have been created to present the data to the wider community. Annotation tools and databases provide the means to disseminate these data and to comprehend their biological importance. This review examines the various aspects of annotation: type, methodology and availability. Moreover, it puts a special interest on novel annotation fields, such as that of phenotypes, and highlights the recent efforts focused on the integrating annotations.
Collapse
Affiliation(s)
- Gabrielle A Reeves
- EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK.
| | | | | |
Collapse
|
44
|
Rentzsch R, Orengo CA. Protein function prediction--the power of multiplicity. Trends Biotechnol 2009; 27:210-9. [PMID: 19251332 DOI: 10.1016/j.tibtech.2009.01.002] [Citation(s) in RCA: 88] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2008] [Revised: 01/21/2009] [Accepted: 01/23/2009] [Indexed: 01/07/2023]
Abstract
Advances in experimental and computational methods have quietly ushered in a new era in protein function annotation. This 'age of multiplicity' is marked by the notion that only the use of multiple tools, multiple evidence and considering the multiple aspects of function can give us the broad picture that 21st century biology will need to link and alter micro- and macroscopic phenotypes. It might also help us to undo past mistakes by removing errors from our databases and prevent us from producing more. On the downside, multiplicity is often confusing. We therefore systematically review methods and resources for automated protein function prediction, looking at individual (biochemical) and contextual (network) functions, respectively.
Collapse
Affiliation(s)
- Robert Rentzsch
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK.
| | | |
Collapse
|
45
|
Talavera D, Laskowski RA, Thornton JM. WSsas: a web service for the annotation of functional residues through structural homologues. Bioinformatics 2009; 25:1192-4. [PMID: 19251774 DOI: 10.1093/bioinformatics/btp116] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
MOTIVATION Annotation tools help scientists to traverse the gap between characterized and uncharacterized proteins. Tools for the prediction of protein function include those which predict the function of entire proteins or complexes, those annotating functional domains and those which predict specific residues within the domain. We have developed WSsas, a web service focused on the annotation of essential functional residues. WSsas uses similarity searches and pairwise alignments to transfer functional information about binding, catalytic and protein-protein interaction residues from solved structures to query sequences. In addition, WSsas can supply information about the relevant functional atoms. The web service definition (WSDL) file and a Perl client are freely available at http://www.ebi.ac.uk/thornton-srv/databases/WSsas/.
Collapse
Affiliation(s)
- David Talavera
- EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK.
| | | | | |
Collapse
|
46
|
Abstract
Contemporary protein architectures can be regarded as molecular fossils, historical imprints that mark important milestones in the history of life. Whereas sequences change at a considerable pace, higher-order structures are constrained by the energetic landscape of protein folding, the exploration of sequence and structure space, and complex interactions mediated by the proteostasis and proteolytic machineries of the cell. The survey of architectures in the living world that was fuelled by recent structural genomic initiatives has been summarized in protein classification schemes, and the overall structure of fold space explored with novel bioinformatic approaches. However, metrics of general structural comparison have not yet unified architectural complexity using the 'shared and derived' tenet of evolutionary analysis. In contrast, a shift of focus from molecules to proteomes and a census of protein structure in fully sequenced genomes were able to uncover global evolutionary patterns in the structure of proteins. Timelines of discovery of architectures and functions unfolded episodes of specialization, reductive evolutionary tendencies of architectural repertoires in proteomes and the rise of modularity in the protein world. They revealed a biologically complex ancestral proteome and the early origin of the archaeal lineage. Studies also identified an origin of the protein world in enzymes of nucleotide metabolism harbouring the P-loop-containing triphosphate hydrolase fold and the explosive discovery of metabolic functions that recapitulated well-defined prebiotic shells and involved the recruitment of structures and functions. These observations have important implications for origins of modern biochemistry and diversification of life.
Collapse
|
47
|
Nair R, Liu J, Soong TT, Acton TB, Everett JK, Kouranov A, Fiser A, Godzik A, Jaroszewski L, Orengo C, Montelione GT, Rost B. Structural genomics is the largest contributor of novel structural leverage. ACTA ACUST UNITED AC 2009; 10:181-91. [PMID: 19194785 PMCID: PMC2705706 DOI: 10.1007/s10969-008-9055-6] [Citation(s) in RCA: 54] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2008] [Accepted: 12/08/2008] [Indexed: 11/28/2022]
Abstract
The Protein Structural Initiative (PSI) at the US National Institutes of Health (NIH) is funding four large-scale centers for structural genomics (SG). These centers systematically target many large families without structural coverage, as well as very large families with inadequate structural coverage. Here, we report a few simple metrics that demonstrate how successfully these efforts optimize structural coverage: while the PSI-2 (2005-now) contributed more than 8% of all structures deposited into the PDB, it contributed over 20% of all novel structures (i.e. structures for protein sequences with no structural representative in the PDB on the date of deposition). The structural coverage of the protein universe represented by today’s UniProt (v12.8) has increased linearly from 1992 to 2008; structural genomics has contributed significantly to the maintenance of this growth rate. Success in increasing novel leverage (defined in Liu et al. in Nat Biotechnol 25:849–851, 2007) has resulted from systematic targeting of large families. PSI’s per structure contribution to novel leverage was over 4-fold higher than that for non-PSI structural biology efforts during the past 8 years. If the success of the PSI continues, it may just take another ~15 years to cover most sequences in the current UniProt database.
Collapse
Affiliation(s)
- Rajesh Nair
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY 10032, USA
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
48
|
Loewenstein Y, Raimondo D, Redfern OC, Watson J, Frishman D, Linial M, Orengo C, Thornton J, Tramontano A. Protein function annotation by homology-based inference. Genome Biol 2009; 10:207. [PMID: 19226439 PMCID: PMC2688287 DOI: 10.1186/gb-2009-10-2-207] [Citation(s) in RCA: 149] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Where information on homologous proteins is available,
progress is being made in automated prediction of protein function
from sequence and structure. With many genomes now sequenced, computational annotation methods to characterize genes and proteins from their sequence are increasingly important. The BioSapiens Network has developed tools to address all stages of this process, and here we review progress in the automated prediction of protein function based on protein sequence and structure.
Collapse
Affiliation(s)
- Yaniv Loewenstein
- Department of Biological Chemistry, The Hebrew University of Jerusalem, Sudarsky Center, Jerusalem 91904, Israel
| | | | | | | | | | | | | | | | | |
Collapse
|
49
|
Vizcaíno JA, Mueller M, Hermjakob H, Martens L. Charting online OMICS resources: A navigational chart for clinical researchers. Proteomics Clin Appl 2008; 3:18-29. [PMID: 21136933 DOI: 10.1002/prca.200800082] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2008] [Indexed: 12/22/2022]
Abstract
The life sciences have sprouted several popular and successful OMICS technologies that span all levels of biological information transfer. Ever since the start of the Human Genome Project, the then revolutionary idea to make all resulting data publicly available has been central to all of the efforts across OMICS technologies. As a result, a great variety of publicly available data repositories and resources is currently available to the research community. This widespread availability of data does come at the price of increased confusion on the part of the users, especially for those that see the OMICS technologies as tools to help unravel a larger biological or clinical question. We therefore provide a comprehensive overview of the available resources across OMICS fields, with a special emphasis on those databases that are relevant to the study of proteins. Additionally, we also describe various integrative systems that have been established, and highlight new developments in the field that can revolutionize the way in which live data integration is achieved over the internet.
Collapse
Affiliation(s)
- Juan Antonio Vizcaíno
- EMBL Outstation, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK
| | | | | | | |
Collapse
|
50
|
Cuff AL, Sillitoe I, Lewis T, Redfern OC, Garratt R, Thornton J, Orengo CA. The CATH classification revisited--architectures reviewed and new ways to characterize structural divergence in superfamilies. Nucleic Acids Res 2008; 37:D310-4. [PMID: 18996897 PMCID: PMC2686597 DOI: 10.1093/nar/gkn877] [Citation(s) in RCA: 157] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The latest version of CATH (class, architecture, topology, homology) (version 3.2), released in July 2008 (http://www.cathdb.info), contains 114,215 domains, 2178 Homologous superfamilies and 1110 fold groups. We have assigned 20,330 new domains, 87 new homologous superfamilies and 26 new folds since CATH release version 3.1. A total of 28,064 new domains have been assigned since our NAR 2007 database publication (CATH version 3.0). The CATH website has been completely redesigned and includes more comprehensive documentation. We have revisited the CATH architecture level as part of the development of a 'Protein Chart' and present information on the population of each architecture. The CATHEDRAL structure comparison algorithm has been improved and used to characterize structural diversity in CATH superfamilies and structural overlaps between superfamilies. Although the majority of superfamilies in CATH are not structurally diverse and do not overlap significantly with other superfamilies, approximately 4% of superfamilies are very diverse and these are the superfamilies that are most highly populated in both the PDB and in the genomes. Information on the degree of structural diversity in each superfamily and structural overlaps between superfamilies can now be downloaded from the CATH website.
Collapse
Affiliation(s)
- Alison L Cuff
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK.
| | | | | | | | | | | | | |
Collapse
|