1
|
Bordin N, Scholes H, Rauer C, Roca-Martínez J, Sillitoe I, Orengo C. Clustering protein functional families at large scale with hierarchical approaches. Protein Sci 2024; 33:e5140. [PMID: 39145441 PMCID: PMC11325189 DOI: 10.1002/pro.5140] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2024] [Revised: 07/22/2024] [Accepted: 07/24/2024] [Indexed: 08/16/2024]
Abstract
Proteins, fundamental to cellular activities, reveal their function and evolution through their structure and sequence. CATH functional families (FunFams) are coherent clusters of protein domain sequences in which the function is conserved across their members. The increasing volume and complexity of protein data enabled by large-scale repositories like MGnify or AlphaFold Database requires more powerful approaches that can scale to the size of these new resources. In this work, we introduce MARC and FRAN, two algorithms developed to build upon and address limitations of GeMMA/FunFHMMER, our original methods developed to classify proteins with related functions using a hierarchical approach. We also present CATH-eMMA, which uses embeddings or Foldseek distances to form relationship trees from distance matrices, reducing computational demands and handling various data types effectively. CATH-eMMA offers a highly robust and much faster tool for clustering protein functions on a large scale, providing a new tool for future studies in protein function and evolution.
Collapse
Affiliation(s)
- Nicola Bordin
- Institute of Structural and Molecular Biology, University College London, London, UK
| | - Harry Scholes
- Institute of Structural and Molecular Biology, University College London, London, UK
| | - Clemens Rauer
- Institute of Structural and Molecular Biology, University College London, London, UK
- Universidad Autonoma de Madrid, Ciudad Universitaria de Cantoblanco, Madrid, Spain
| | - Joel Roca-Martínez
- Institute of Structural and Molecular Biology, University College London, London, UK
| | - Ian Sillitoe
- Institute of Structural and Molecular Biology, University College London, London, UK
| | - Christine Orengo
- Institute of Structural and Molecular Biology, University College London, London, UK
| |
Collapse
|
2
|
Bonello J, Orengo C. FunPredCATH: An ensemble method for predicting protein function using CATH. BIOCHIMICA ET BIOPHYSICA ACTA. PROTEINS AND PROTEOMICS 2024; 1872:140985. [PMID: 38122964 DOI: 10.1016/j.bbapap.2023.140985] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Revised: 12/05/2023] [Accepted: 12/06/2023] [Indexed: 12/23/2023]
Abstract
MOTIVATION The growth of unannotated proteins in UniProt increases at a very high rate every year due to more efficient sequencing methods. However, the experimental annotation of proteins is a lengthy and expensive process. Using computational techniques to narrow the search can speed up the process by providing highly specific Gene Ontology (GO) terms. METHODOLOGY We propose an ensemble approach that combines three generic base predictors that predict Gene Ontology (BP, CC and MF) terms from sequences across different species. We train our models on UniProtGOA annotation data and use the CATH domain resources to identify the protein families. We then calculate a score based on the prevalence of individual GO terms in the functional families that is then used as an indicator of confidence when assigning the GO term to an uncharacterised protein. METHODS In the ensemble, we use a statistics-based method that scores the occurrence of GO terms in a CATH FunFam against a background set of proteins annotated by the same GO term. We also developed a set-based method that uses Set Intersection and Set Union to score the occurrence of GO terms within the same CATH FunFam. Finally, we also use FunFams-Plus, a predictor method developed by the Orengo Group at UCL to predict GO terms for uncharacterised proteins in the CAFA3 challenge. EVALUATION We evaluated the methods against the CAFA3 benchmark and DomFun. We used the Precision, Recall and Fmax metrics and the benchmark datasets that are used in CAFA3 to evaluate our models and compare them to the CAFA3 results. Our results show that FunPredCATH compares well with top CAFA methods in the different ontologies and benchmarks. CONTRIBUTIONS FunPredCATH compares well with other prediction methods on CAFA3, and the ensemble approach outperforms the base methods. We show that non-IEA models obtain higher Fmax scores than the IEA counterparts, while the models including IEA annotations have higher coverage at the expense of a lower Fmax score.
Collapse
Affiliation(s)
- Joseph Bonello
- Department of Structural and Molecular Biology, University College London, Gower Street, London WC1E 6BT, United Kingdom; Department of Computer Information Systems, University of Malta, Faculty of ICT, Msida, MSD 2080, Malta.
| | - Christine Orengo
- Department of Structural and Molecular Biology, University College London, Gower Street, London WC1E 6BT, United Kingdom
| |
Collapse
|
3
|
Feldmeyer B, Bornberg-Bauer E, Dohmen E, Fouks B, Heckenhauer J, Huylmans AK, Jones ARC, Stolle E, Harrison MC. Comparative Evolutionary Genomics in Insects. Methods Mol Biol 2024; 2802:473-514. [PMID: 38819569 DOI: 10.1007/978-1-0716-3838-5_16] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/01/2024]
Abstract
Genome sequencing quality, in terms of both read length and accuracy, is constantly improving. By combining long-read sequencing technologies with various scaffolding techniques, chromosome-level genome assemblies are now achievable at an affordable price for non-model organisms. Insects represent an exciting taxon for studying the genomic underpinnings of evolutionary innovations, due to ancient origins, immense species-richness, and broad phenotypic diversity. Here we summarize some of the most important methods for carrying out a comparative genomics study on insects. We describe available tools and offer concrete tips on all stages of such an endeavor from DNA extraction through genome sequencing, annotation, and several evolutionary analyses. Along the way we describe important insect-specific aspects, such as DNA extraction difficulties or gene families that are particularly difficult to annotate, and offer solutions. We describe results from several examples of comparative genomics analyses on insects to illustrate the fascinating questions that can now be addressed in this new age of genomics research.
Collapse
Affiliation(s)
- Barbara Feldmeyer
- Senckenberg Biodiversity and Climate Research Centre (SBiK-F), Molecular Ecology, Frankfurt, Germany
| | - Erich Bornberg-Bauer
- Institute for Evolution and Biodiversity, University of Münster, Münster, Germany
- Department of Protein Evolution, Max Planck Institute for Developmental Biology, Tübingen, Germany
| | - Elias Dohmen
- Institute for Evolution and Biodiversity, University of Münster, Münster, Germany
| | - Bertrand Fouks
- Institute for Evolution and Biodiversity, University of Münster, Münster, Germany
| | - Jacqueline Heckenhauer
- LOEWE Centre for Translational Biodiversity Genomics (LOEWE-TBG), Frankfurt, Germany
- Department of Terrestrial Zoology, Senckenberg Research Institute and Natural History Museum Frankfurt, Frankfurt, Germany
| | - Ann Kathrin Huylmans
- Institute of Organismic and Molecular Evolution, Johannes Gutenberg University, Mainz, Germany
| | - Alun R C Jones
- Institute for Evolution and Biodiversity, University of Münster, Münster, Germany
| | - Eckart Stolle
- Museum Koenig, Leibniz Institute for the Analysis of Biodiversity Change (LIB), Bonn, Germany
| | - Mark C Harrison
- Institute for Evolution and Biodiversity, University of Münster, Münster, Germany.
| |
Collapse
|
4
|
Quantitative In Silico Evaluation of Allergenic Proteins from Anacardium occidentale, Carya illinoinensis, Juglans regia and Pistacia vera and Their Epitopes as Precursors of Bioactive Peptides. Curr Issues Mol Biol 2022; 44:3100-3117. [PMID: 35877438 PMCID: PMC9317212 DOI: 10.3390/cimb44070214] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2022] [Revised: 06/28/2022] [Accepted: 07/01/2022] [Indexed: 11/16/2022] Open
Abstract
The aim of the study presented here was to determine if there is a correlation between the presence of specific protein domains within tree nut allergens or tree nut allergen epitopes and the frequency of bioactive fragments and the predicted susceptibility to enzymatic digestion in allergenic proteins from tree nuts of cashew (Anacardium occidentale), pecan (Carya illinoinensis), English walnut (Juglans regia) and pistachio (Pistacia vera) plants. These bioactive peptides are distributed along the length of the protein and are not enriched in IgE epitope sequences. Classification of proteins as bioactive peptide precursors based on the presence of specific protein domains may be a promising approach. Proteins possessing a vicilin, N-terminal family domain, or napin domain contain a relatively low occurrence of bioactive fragments. In contrast, proteins possessing the cupin 1 domain without the vicilin N-terminal family domain contain a relatively high total frequency of bioactive fragments and predicted release of bioactive fragments by the joint action of pepsin, trypsin, and chymotrypsin. This approach could be utilized in food science to simplify the selection of protein domains enriched for bioactive peptides.
Collapse
|
5
|
Abstract
Enzymes are predominantly proteins able to effectively and selectively catalyze highly complex biochemical reactions in mild reaction conditions. Nevertheless, they are limited to the arsenal of reactions that have emerged during natural evolution in compliance with their intrinsic nature, three-dimensional structures and dynamics. They optimally work in physiological conditions for a limited range of reactions, and thus exhibit a low tolerance for solvent and temperature conditions. The de novo design of synthetic highly stable enzymes able to catalyze a broad range of chemical reactions in variable conditions is a great challenge, which requires the development of programmable and finely tunable artificial tools. Interestingly, over the last two decades, chemists developed protein secondary structure mimics to achieve some desirable features of proteins, which are able to interfere with the biological processes. Such non-natural oligomers, so called foldamers, can adopt highly stable and predictable architectures and have extensively demonstrated their attractiveness for widespread applications in fields from biomedical to material science. Foldamer science was more recently considered to provide original solutions to the de novo design of artificial enzymes. This review covers recent developments related to peptidomimetic foldamers with catalytic properties and the principles that have guided their design.
Collapse
|
6
|
Iyer MS, Joshi AG, Sowdhamini R. Genome-wide survey of remote homologues for protein domain superfamilies of known structure reveals unequal distribution across structural classes. Mol Omics 2018; 14:266-280. [PMID: 29971307 DOI: 10.1039/c8mo00008e] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
Domains are the basic building blocks of proteins which can combine to give rise to different domain architectures. Annotation of domains in a sequence is the first step towards understanding the biological function. Since there are a limited number of folds and evolutionarily related proteins have a similar structure, function can be inferred through remote homology. Computational sequence searches were performed for remote homologues on genomes of around ∼160 000 different organisms, starting from nearly 11 000 superfamily queries of known structure. Case studies revealed that most of the associated domains are involved in the same biological process. Using all the proteins predicted to have at least one structural domain, a coverage of 61% of Pfam families was achieved which is higher than the existing methods (43.36% by SIFTS). Taxonomic analysis of the proteins revealed 493 superfamilies in all the major kingdoms of life and a few lateral gene transfers between viruses and cellular organisms. The distribution of remote homologues across different classes, folds and superfamilies was studied and reveals that sequences are unequally distributed across structural classes. Finally, domain architectures were computed for the homologues and these data were compiled for each superfamily and organism.
Collapse
Affiliation(s)
- Meenakshi S Iyer
- National Centre for Biological Sciences (TIFR), GKVK Campus, Bellary Road, Bangalore, Karnataka 560 065, India.
| | | | | |
Collapse
|
7
|
Kinjo AR. A unified statistical model of protein multiple sequence alignment integrating direct coupling and insertions. Biophys Physicobiol 2016; 13:45-62. [PMID: 27924257 PMCID: PMC5042171 DOI: 10.2142/biophysico.13.0_45] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2015] [Accepted: 03/18/2016] [Indexed: 12/01/2022] Open
Abstract
The multiple sequence alignment (MSA) of a protein family provides a wealth of information in terms of the conservation pattern of amino acid residues not only at each alignment site but also between distant sites. In order to statistically model the MSA incorporating both short-range and long-range correlations as well as insertions, I have derived a lattice gas model of the MSA based on the principle of maximum entropy. The partition function, obtained by the transfer matrix method with a mean-field approximation, accounts for all possible alignments with all possible sequences. The model parameters for short-range and long-range interactions were determined by a self-consistent condition and by a Gaussian approximation, respectively. Using this model with and without long-range interactions, I analyzed the globin and V-set domains by increasing the “temperature” and by “mutating” a site. The correlations between residue conservation and various measures of the system’s stability indicate that the long-range interactions make the conservation pattern more specific to the structure, and increasingly stabilize better conserved residues.
Collapse
Affiliation(s)
- Akira R Kinjo
- Institute for Protein Research, Osaka University, Suita, Osaka 565-0871, Japan
| |
Collapse
|
8
|
Minkiewicz P, Darewicz M, Iwaniak A, Sokołowska J, Starowicz P, Bucholska J, Hrynkiewicz M. Common Amino Acid Subsequences in a Universal Proteome--Relevance for Food Science. Int J Mol Sci 2015; 16:20748-73. [PMID: 26340620 PMCID: PMC4613229 DOI: 10.3390/ijms160920748] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2015] [Revised: 08/18/2015] [Accepted: 08/24/2015] [Indexed: 02/06/2023] Open
Abstract
A common subsequence is a fragment of the amino acid chain that occurs in more than one protein. Common subsequences may be an object of interest for food scientists as biologically active peptides, epitopes, and/or protein markers that are used in comparative proteomics. An individual bioactive fragment, in particular the shortest fragment containing two or three amino acid residues, may occur in many protein sequences. An individual linear epitope may also be present in multiple sequences of precursor proteins. Although recent recommendations for prediction of allergenicity and cross-reactivity include not only sequence identity, but also similarities in secondary and tertiary structures surrounding the common fragment, local sequence identity may be used to screen protein sequence databases for potential allergens in silico. The main weakness of the screening process is that it overlooks allergens and cross-reactivity cases without identical fragments corresponding to linear epitopes. A single peptide may also serve as a marker of a group of allergens that belong to the same family and, possibly, reveal cross-reactivity. This review article discusses the benefits for food scientists that follow from the common subsequences concept.
Collapse
Affiliation(s)
- Piotr Minkiewicz
- Department of Food Biochemistry, University of Warmia and Mazury in Olsztyn, Plac Cieszyński 1, Olsztyn-Kortowo 10-726, Poland.
| | - Małgorzata Darewicz
- Department of Food Biochemistry, University of Warmia and Mazury in Olsztyn, Plac Cieszyński 1, Olsztyn-Kortowo 10-726, Poland.
| | - Anna Iwaniak
- Department of Food Biochemistry, University of Warmia and Mazury in Olsztyn, Plac Cieszyński 1, Olsztyn-Kortowo 10-726, Poland.
| | - Jolanta Sokołowska
- Department of Food Biochemistry, University of Warmia and Mazury in Olsztyn, Plac Cieszyński 1, Olsztyn-Kortowo 10-726, Poland.
| | - Piotr Starowicz
- Department of Food Biochemistry, University of Warmia and Mazury in Olsztyn, Plac Cieszyński 1, Olsztyn-Kortowo 10-726, Poland.
| | - Justyna Bucholska
- Department of Food Biochemistry, University of Warmia and Mazury in Olsztyn, Plac Cieszyński 1, Olsztyn-Kortowo 10-726, Poland.
| | - Monika Hrynkiewicz
- Department of Food Biochemistry, University of Warmia and Mazury in Olsztyn, Plac Cieszyński 1, Olsztyn-Kortowo 10-726, Poland.
| |
Collapse
|
9
|
Das S, Lee D, Sillitoe I, Dawson NL, Lees JG, Orengo CA. Functional classification of CATH superfamilies: a domain-based approach for protein function annotation. Bioinformatics 2015; 31:3460-7. [PMID: 26139634 PMCID: PMC4612221 DOI: 10.1093/bioinformatics/btv398] [Citation(s) in RCA: 62] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2015] [Accepted: 06/24/2015] [Indexed: 11/18/2022] Open
Abstract
Motivation: Computational approaches that can predict protein functions are essential to bridge the widening function annotation gap especially since <1.0% of all proteins in UniProtKB have been experimentally characterized. We present a domain-based method for protein function classification and prediction of functional sites that exploits functional sub-classification of CATH superfamilies. The superfamilies are sub-classified into functional families (FunFams) using a hierarchical clustering algorithm supervised by a new classification method, FunFHMMer. Results: FunFHMMer generates more functionally coherent groupings of protein sequences than other domain-based protein classifications. This has been validated using known functional information. The conserved positions predicted by the FunFams are also found to be enriched in known functional residues. Moreover, the functional annotations provided by the FunFams are found to be more precise than other domain-based resources. FunFHMMer currently identifies 110 439 FunFams in 2735 superfamilies which can be used to functionally annotate > 16 million domain sequences. Availability and implementation: All FunFam annotation data are made available through the CATH webpages (http://www.cathdb.info). The FunFHMMer webserver (http://www.cathdb.info/search/by_funfhmmer) allows users to submit query sequences for assignment to a CATH FunFam. Contact:sayoni.das.12@ucl.ac.uk Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sayoni Das
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - David Lee
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - Ian Sillitoe
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - Natalie L Dawson
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - Jonathan G Lees
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - Christine A Orengo
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| |
Collapse
|
10
|
Piovesan D, Giollo M, Leonardi E, Ferrari C, Tosatto SCE. INGA: protein function prediction combining interaction networks, domain assignments and sequence similarity. Nucleic Acids Res 2015; 43:W134-40. [PMID: 26019177 PMCID: PMC4489281 DOI: 10.1093/nar/gkv523] [Citation(s) in RCA: 56] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2015] [Accepted: 05/07/2015] [Indexed: 01/10/2023] Open
Abstract
Identifying protein functions can be useful for numerous applications in biology. The prediction of gene ontology (GO) functional terms from sequence remains however a challenging task, as shown by the recent CAFA experiments. Here we present INGA, a web server developed to predict protein function from a combination of three orthogonal approaches. Sequence similarity and domain architecture searches are combined with protein-protein interaction network data to derive consensus predictions for GO terms using functional enrichment. The INGA server can be queried both programmatically through RESTful services and through a web interface designed for usability. The latter provides output supporting the GO term predictions with the annotating sequences. INGA is validated on the CAFA-1 data set and was recently shown to perform consistently well in the CAFA-2 blind test. The INGA web server is available from URL: http://protein.bio.unipd.it/inga.
Collapse
Affiliation(s)
- Damiano Piovesan
- Department of Biomedical Sciences, University of Padua, Padua 35121, Italy
| | - Manuel Giollo
- Department of Biomedical Sciences, University of Padua, Padua 35121, Italy Department of Information Engineering, University of Padua, Padua 35121, Italy
| | - Emanuela Leonardi
- Department of Women's and Children's Health, University of Padua, Padua 35128, Italy
| | - Carlo Ferrari
- Department of Information Engineering, University of Padua, Padua 35121, Italy
| | - Silvio C E Tosatto
- Department of Biomedical Sciences, University of Padua, Padua 35121, Italy CNR Institute of Neuroscience, Padua 35121, Italy
| |
Collapse
|
11
|
Lu Y, Lu Y, Deng J, Peng H, Lu H, Lu LJ. A novel essential domain perspective for exploring gene essentiality. Bioinformatics 2015; 31:2921-9. [PMID: 26002906 DOI: 10.1093/bioinformatics/btv312] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2015] [Accepted: 05/13/2015] [Indexed: 02/05/2023] Open
Abstract
MOTIVATION Genes with indispensable functions are identified as essential; however, the traditional gene-level studies of essentiality have several limitations. In this study, we characterized gene essentiality from a new perspective of protein domains, the independent structural or functional units of a polypeptide chain. RESULTS To identify such essential domains, we have developed an Expectation-Maximization (EM) algorithm-based Essential Domain Prediction (EDP) Model. With simulated datasets, the model provided convergent results given different initial values and offered accurate predictions even with noise. We then applied the EDP model to six microbial species and predicted 1879 domains to be essential in at least one species, ranging 10-23% in each species. The predicted essential domains were more conserved than either non-essential domains or essential genes. Comparing essential domains in prokaryotes and eukaryotes revealed an evolutionary distance consistent with that inferred from ribosomal RNA. When utilizing these essential domains to reproduce the annotation of essential genes, we received accurate results that suggest protein domains are more basic units for the essentiality of genes. Furthermore, we presented several examples to illustrate how the combination of essential and non-essential domains can lead to genes with divergent essentiality. In summary, we have described the first systematic analysis on gene essentiality on the level of domains. CONTACT huilu.bioinfo@gmail.com or Long.Lu@cchmc.org SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yao Lu
- Shanghai Institute of Medical Genetics, Shanghai Children's Hospital, Shanghai Jiao Tong University, 24/1400 Beijing (W) Road, Shanghai 200040, People's Republic of China
| | - Yulan Lu
- State Key Laboratory of Genetic Engineering Institute of Biostatistics, School of Life Science, Fudan University, Shanghai 200433, People's Republic of China
| | - Jingyuan Deng
- Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH 45229, USA
| | - Hai Peng
- Institute for Systems Biology, Jianghan University, Wuhan, Hubei, People's Republic of China
| | - Hui Lu
- Shanghai Institute of Medical Genetics, Shanghai Children's Hospital, Shanghai Jiao Tong University, 24/1400 Beijing (W) Road, Shanghai 200040, People's Republic of China, Department of Bioengineering (MC 063), University of Illinois at Chicago, Chicago, IL 60607-7052, USA and Collaborative Innovation Center for Biotherapy, West China Hospital, Sichuan University, Chengdu, China
| | - Long Jason Lu
- Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH 45229, USA, Institute for Systems Biology, Jianghan University, Wuhan, Hubei, People's Republic of China
| |
Collapse
|
12
|
Das S, Sillitoe I, Lee D, Lees JG, Dawson NL, Ward J, Orengo CA. CATH FunFHMMer web server: protein functional annotations using functional family assignments. Nucleic Acids Res 2015; 43:W148-53. [PMID: 25964299 PMCID: PMC4489299 DOI: 10.1093/nar/gkv488] [Citation(s) in RCA: 48] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2015] [Accepted: 05/02/2015] [Indexed: 12/20/2022] Open
Abstract
The widening function annotation gap in protein databases and the increasing number and diversity of the proteins being sequenced presents new challenges to protein function prediction methods. Multidomain proteins complicate the protein sequence–structure–function relationship further as new combinations of domains can expand the functional repertoire, creating new proteins and functions. Here, we present the FunFHMMer web server, which provides Gene Ontology (GO) annotations for query protein sequences based on the functional classification of the domain-based CATH-Gene3D resource. Our server also provides valuable information for the prediction of functional sites. The predictive power of FunFHMMer has been validated on a set of 95 proteins where FunFHMMer performs better than BLAST, Pfam and CDD. Recent validation by an independent international competition ranks FunFHMMer as one of the top function prediction methods in predicting GO annotations for both the Biological Process and Molecular Function Ontology. The FunFHMMer web server is available at http://www.cathdb.info/search/by_funfhmmer.
Collapse
Affiliation(s)
- Sayoni Das
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - Ian Sillitoe
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - David Lee
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - Jonathan G Lees
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - Natalie L Dawson
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - John Ward
- Department of Biochemical Engineering, UCL, Gower Street, WC1E 6BT, UK
| | - Christine A Orengo
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| |
Collapse
|
13
|
Hunter PJ, de Bono B. Biophysical constraints on the evolution of tissue structure and function. J Physiol 2015; 592:2389-401. [PMID: 24882821 DOI: 10.1113/jphysiol.2014.273235] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
Phylogenetic analyses based on models of molecular sequence evolution have driven to industrial scale the generation, cataloguing and modelling of nucleic acid and polypeptide structure. The recent application of these techniques to study the evolution of protein interaction networks extends this analytical rigour to the study of nucleic acid and protein function. Can we further extend phylogenetic analysis of protein networks to the study of tissue structure and function? If the study of tissue phylogeny is to join up with mainstream efforts in the molecular evolution domain, the continuum field description of tissue biophysics must be linked to discrete descriptions of molecular biochemistry. In support of this goal we discuss tissue units, and biophysical constraints to molecular function associated with these units, to present a rationale with which to model tissue evolution. Our rationale combines a multiscale hierarchy of functional tissue units (FTUs) with the corresponding application of physical laws to describe molecular interaction networks and flow processes over continuum fields within these units. Non-dimensional numbers, derived from the equations governing biophysical processes in FTUs, are proposed as metrics for comparative studies across individuals, species or evolutionary time. We also outline the challenges inherent to the systematic cataloguing and phylogenetic analysis of tissue features relevant to the maintenance and regulation of molecular interaction networks. These features are key to understanding the core biophysical constraints on tissue evolution.
Collapse
Affiliation(s)
- P J Hunter
- Auckland Bioengineering Institute, University of Auckland, Auckland, New Zealand University of Oxford, Oxford, UK
| | - B de Bono
- Auckland Bioengineering Institute, University of Auckland, Auckland, New Zealand University College London, London, UK
| |
Collapse
|
14
|
Shi JY, Yiu SM, Zhang YN, Chin FYL. Effective moment feature vectors for protein domain structures. PLoS One 2014; 8:e83788. [PMID: 24391828 DOI: 10.1371/journal.pone.0083788] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2013] [Accepted: 11/08/2013] [Indexed: 11/19/2022] Open
Abstract
Imaging processing techniques have been shown to be useful in studying protein domain structures. The idea is to represent the pairwise distances of any two residues of the structure in a 2D distance matrix (DM). Features and/or submatrices are extracted from this DM to represent a domain. Existing approaches, however, may involve a large number of features (100-400) or complicated mathematical operations. Finding fewer but more effective features is always desirable. In this paper, based on some key observations on DMs, we are able to decompose a DM image into four basic binary images, each representing the structural characteristics of a fundamental secondary structure element (SSE) or a motif in the domain. Using the concept of moments in image processing, we further derive 45 structural features based on the four binary images. Together with 4 features extracted from the basic images, we represent the structure of a domain using 49 features. We show that our feature vectors can represent domain structures effectively in terms of the following. (1) We show a higher accuracy for domain classification. (2) We show a clear and consistent distribution of domains using our proposed structural vector space. (3) We are able to cluster the domains according to our moment features and demonstrate a relationship between structural variation and functional diversity.
Collapse
Affiliation(s)
- Jian-Yu Shi
- School of Life Science, Northwestern Polytechnical University, Xi'an, Shaanxi Province, China ; Department of Computer Science, The University of Hong Kong, Hong Kong, China
| | - Siu-Ming Yiu
- Department of Computer Science, The University of Hong Kong, Hong Kong, China
| | - Yan-Ning Zhang
- School of Computer Science, Northwestern Polytechnical University, Xi'an, Shaanxi Province, China
| | | |
Collapse
|
15
|
Arumugam G, Nair AG, Hariharaputran S, Ramanathan S. Rebelling for a reason: protein structural "outliers". PLoS One 2013; 8:e74416. [PMID: 24073209 PMCID: PMC3779223 DOI: 10.1371/journal.pone.0074416] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2013] [Accepted: 07/31/2013] [Indexed: 11/29/2022] Open
Abstract
Analysis of structural variation in domain superfamilies can reveal constraints in protein evolution which aids protein structure prediction and classification. Structure-based sequence alignment of distantly related proteins, organized in PASS2 database, provides clues about structurally conserved regions among different functional families. Some superfamily members show large structural differences which are functionally relevant. This paper analyses the impact of structural divergence on function for multi-member superfamilies, selected from the PASS2 superfamily alignment database. Functional annotations within superfamilies, with structural outliers or 'rebels', are discussed in the context of structural variations. Overall, these data reinforce the idea that functional similarities cannot be extrapolated from mere structural conservation. The implication for fold-function prediction is that the functional annotations can only be inherited with very careful consideration, especially at low sequence identities.
Collapse
Affiliation(s)
- Gandhimathi Arumugam
- National Centre for Biological Sciences, Tata Institute of Fundamental Research, Gandhi Krishi Vigyana Kendra Campus, Bangalore, India
| | - Anu G. Nair
- National Centre for Biological Sciences, Tata Institute of Fundamental Research, Gandhi Krishi Vigyana Kendra Campus, Bangalore, India
| | - Sridhar Hariharaputran
- National Centre for Biological Sciences, Tata Institute of Fundamental Research, Gandhi Krishi Vigyana Kendra Campus, Bangalore, India
| | - Sowdhamini Ramanathan
- National Centre for Biological Sciences, Tata Institute of Fundamental Research, Gandhi Krishi Vigyana Kendra Campus, Bangalore, India
| |
Collapse
|
16
|
Sadowski MI. Prediction of protein domain boundaries from inverse covariances. Proteins 2013; 81:253-60. [PMID: 22987736 PMCID: PMC3563215 DOI: 10.1002/prot.24181] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2012] [Revised: 08/10/2012] [Accepted: 09/04/2012] [Indexed: 01/04/2023]
Abstract
It has been known even since relatively few structures had been solved that longer protein chains often contain multiple domains, which may fold separately and play the role of reusable functional modules found in many contexts. In many structural biology tasks, in particular structure prediction, it is of great use to be able to identify domains within the structure and analyze these regions separately. However, when using sequence data alone this task has proven exceptionally difficult, with relatively little improvement over the naive method of choosing boundaries based on size distributions of observed domains. The recent significant improvement in contact prediction provides a new source of information for domain prediction. We test several methods for using this information including a kernel smoothing-based approach and methods based on building alpha-carbon models and compare performance with a length-based predictor, a homology search method and four published sequence-based predictors: DOMCUT, DomPRO, DLP-SVM, and SCOOBY-DOmain. We show that the kernel-smoothing method is significantly better than the other ab initio predictors when both single-domain and multidomain targets are considered and is not significantly different to the homology-based method. Considering only multidomain targets the kernel-smoothing method outperforms all of the published methods except DLP-SVM. The kernel smoothing method therefore represents a potentially useful improvement to ab initio domain prediction.
Collapse
Affiliation(s)
- Michael I Sadowski
- MRC National Institute for Medical Research, The Ridgeway, Mill Hill, London, United Kingdom.
| |
Collapse
|
17
|
Residue mutations and their impact on protein structure and function: detecting beneficial and pathogenic changes. Biochem J 2013; 449:581-94. [DOI: 10.1042/bj20121221] [Citation(s) in RCA: 131] [Impact Index Per Article: 11.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
The present review focuses on the evolution of proteins and the impact of amino acid mutations on function from a structural perspective. Proteins evolve under the law of natural selection and undergo alternating periods of conservative evolution and of relatively rapid change. The likelihood of mutations being fixed in the genome depends on various factors, such as the fitness of the phenotype or the position of the residues in the three-dimensional structure. For example, co-evolution of residues located close together in three-dimensional space can occur to preserve global stability. Whereas point mutations can fine-tune the protein function, residue insertions and deletions (‘decorations’ at the structural level) can sometimes modify functional sites and protein interactions more dramatically. We discuss recent developments and tools to identify such episodic mutations, and examine their applications in medical research. Such tools have been tested on simulated data and applied to real data such as viruses or animal sequences. Traditionally, there has been little if any cross-talk between the fields of protein biophysics, protein structure–function and molecular evolution. However, the last several years have seen some exciting developments in combining these approaches to obtain an in-depth understanding of how proteins evolve. For example, a better understanding of how structural constraints affect protein evolution will greatly help us to optimize our models of sequence evolution. The present review explores this new synthesis of perspectives.
Collapse
|
18
|
Joshi AG, Raghavender US, Sowdhamini R. Improved performance of sequence search approaches in remote homology detection. F1000Res 2013; 2:93. [PMID: 25469226 PMCID: PMC4240247 DOI: 10.12688/f1000research.2-93.v2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 06/27/2014] [Indexed: 11/20/2022] Open
Abstract
The protein sequence space is vast and diverse, spanning across different families. Biologically meaningful relationships exist between proteins at superfamily level. However, it is highly challenging to establish convincing relationships at the superfamily level by means of simple sequence searches. It is necessary to design a rigorous sequence search strategy to establish remote homology relationships and achieve high coverage. We have used iterative profile-based methods, along with constraints of sequence motifs, to specify search directions. We address the importance of multiple start points (queries) to achieve high coverage at protein superfamily level. We have devised strategies to employ a structural regime to search sequence space with good specificity and sensitivity. We employ two well-known sequence search methods, PSI-BLAST and PHI-BLAST, with multiple queries and multiple patterns to enhance homologue identification at the structural superfamily level. The study suggests that multiple queries improve sensitivity, while a pattern-constrained iterative sequence search becomes stringent at the initial stages, thereby driving the search in a specific direction and also achieves high coverage. This data mining approach has been applied to the entire structural superfamily database.
Collapse
Affiliation(s)
- Adwait Govind Joshi
- National Centre for Biological Sciences (Tata Institute of Fundamental Research), Gandhi Krishi Vignyan Kendra Campus, Bangalore, 560065, India ; Manipal University, Manipal, Karnataka, 576104, India
| | - Upadhyayula Surya Raghavender
- National Centre for Biological Sciences (Tata Institute of Fundamental Research), Gandhi Krishi Vignyan Kendra Campus, Bangalore, 560065, India
| | - Ramanathan Sowdhamini
- National Centre for Biological Sciences (Tata Institute of Fundamental Research), Gandhi Krishi Vignyan Kendra Campus, Bangalore, 560065, India
| |
Collapse
|
19
|
Minkiewicz P, Bucholska J, Darewicz M, Borawska J. Epitopic hexapeptide sequences from Baltic cod parvalbumin beta (allergen Gad c 1) are common in the universal proteome. Peptides 2012; 38:105-9. [PMID: 22940202 DOI: 10.1016/j.peptides.2012.08.011] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/18/2012] [Revised: 08/14/2012] [Accepted: 08/14/2012] [Indexed: 01/25/2023]
Abstract
The aim of this study was to analyze the distribution of hexapeptide fragments considered as epitopes of Baltic cod parvalbumin beta (allergen Gad c 1) in the universal proteome. Cod (Gadus morhua subsp. callarias) parvalbumin hexapeptides cataloged in the Immune Epitope Database were used as query sequences. The UniProt database was screened using the WU-BLAST 2 program. The distribution of hexapeptide fragments was investigated in various protein families, classified according to the presence of the appropriate domains, and in proteins of plant, animal and microbial species. Hexapeptides from cod parvalbumin were found in the proteins of plants and animals which are food sources, microorganisms with various applications in food technology and biotechnology, microorganisms which are human symbionts and commensals as well as human pathogens. In the last case possible coverage between epitopes from pathogens and allergens should be avoided during vaccine design.
Collapse
Affiliation(s)
- Piotr Minkiewicz
- University of Warmia and Mazury in Olsztyn, Chair of Food Biochemistry, Olsztyn-Kortowo, Poland.
| | | | | | | |
Collapse
|
20
|
Searls DB. A primer in macromolecular linguistics. Biopolymers 2012; 99:203-17. [PMID: 23034580 DOI: 10.1002/bip.22101] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2012] [Accepted: 05/25/2012] [Indexed: 01/01/2023]
Abstract
Polymeric macromolecules, when viewed abstractly as strings of symbols, can be treated in terms of formal language theory, providing a mathematical foundation for characterizing such strings both as collections and in terms of their individual structures. In addition this approach offers a framework for analysis of macromolecules by tools and conventions widely used in computational linguistics. This article introduces the ways that linguistics can be and has been applied to molecular biology, covering the relevant formal language theory at a relatively nontechnical level. Analogies between macromolecules and human natural language are used to provide intuitive insights into the relevance of grammars, parsing, and analysis of language complexity to biology.
Collapse
|
21
|
Alvarez MA, Yan C. A new protein graph model for function prediction. Comput Biol Chem 2012; 37:6-10. [PMID: 22381922 DOI: 10.1016/j.compbiolchem.2012.01.003] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2011] [Revised: 01/02/2012] [Accepted: 01/04/2012] [Indexed: 11/27/2022]
Abstract
As several structural proteomic projects are producing an increasing number of protein structures with unknown function, methods that can reliably predict protein functions from protein structures are in urgent need. In this paper, we present a method to explore the clustering patterns of amino acids on the 3-dimensional space for protein function prediction. First, amino acid residues on a protein structure are clustered into spatial groups using hierarchical agglomerative clustering, based on the distance between them. Second, the protein structure is represented using a graph, where each node denotes a cluster of amino acids. The nodes are labeled with an evolutionary profile derived from the multiple alignment of homologous sequences. Then, a shortest-path graph kernel is used to calculate similarities between the graphs. Finally, a support vector machine using this graph kernel is used to train classifiers for protein function prediction. We applied the proposed method to two separate problems, namely, prediction of enzymes and prediction of DNA-binding proteins. In both cases, the results showed that the proposed method outperformed other state-of-the-art methods.
Collapse
Affiliation(s)
- Marco A Alvarez
- Department of Computer Science, Utah State University, Logan, UT 84322, USA
| | | |
Collapse
|
22
|
Furnham N, Sillitoe I, Holliday GL, Cuff AL, Laskowski RA, Orengo CA, Thornton JM. Exploring the evolution of novel enzyme functions within structurally defined protein superfamilies. PLoS Comput Biol 2012; 8:e1002403. [PMID: 22396634 PMCID: PMC3291543 DOI: 10.1371/journal.pcbi.1002403] [Citation(s) in RCA: 69] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2011] [Accepted: 01/09/2012] [Indexed: 11/18/2022] Open
Abstract
In order to understand the evolution of enzyme reactions and to gain an overview of biological catalysis we have combined sequence and structural data to generate phylogenetic trees in an analysis of 276 structurally defined enzyme superfamilies, and used these to study how enzyme functions have evolved. We describe in detail the analysis of two superfamilies to illustrate different paradigms of enzyme evolution. Gathering together data from all the superfamilies supports and develops the observation that they have all evolved to act on a diverse set of substrates, whilst the evolution of new chemistry is much less common. Despite that, by bringing together so much data, we can provide a comprehensive overview of the most common and rare types of changes in function. Our analysis demonstrates on a larger scale than previously studied, that modifications in overall chemistry still occur, with all possible changes at the primary level of the Enzyme Commission (E.C.) classification observed to a greater or lesser extent. The phylogenetic trees map out the evolutionary route taken within a superfamily, as well as all the possible changes within a superfamily. This has been used to generate a matrix of observed exchanges from one enzyme function to another, revealing the scale and nature of enzyme evolution and that some types of exchanges between and within E.C. classes are more prevalent than others. Surprisingly a large proportion (71%) of all known enzyme functions are performed by this relatively small set of 276 superfamilies. This reinforces the hypothesis that relatively few ancient enzymatic domain superfamilies were progenitors for most of the chemistry required for life. Enzymes, as biological catalysts, are crucial to life. Understanding how enzymes have evolved to perform the wide variety of reactions found across all kingdoms of life is fundamental to a broad range of biological studies, especially those leading to new therapeutics. To unravel the evolution of novel enzyme function requires combining information on protein structure, sequence, phylogeny and chemistry (in terms of interacting small molecules and reaction mechanisms). We have developed a protocol for integrating this wide range of data, which we have applied to a relatively large number of families comprising some very diverse relatives. This has permitted us to present an initial overview of the evolution of novel enzyme functions, in which we observe that some changes in function between relatives are more common than others, with most of the functionality observed in nature confined to relatively few families. Moreover, we are able to identify the evolutionary route taken within a superfamily to change the enzyme function from one reaction to another. This information may help in predicting the function of an enzyme that has yet to be experimentally characterised as well as in designing new enzymes for industrial and medical purposes.
Collapse
Affiliation(s)
- Nicholas Furnham
- EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom.
| | | | | | | | | | | | | |
Collapse
|
23
|
Kinjo AR, Nakamura H. Composite structural motifs of binding sites for delineating biological functions of proteins. PLoS One 2012; 7:e31437. [PMID: 22347478 PMCID: PMC3275580 DOI: 10.1371/journal.pone.0031437] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2011] [Accepted: 01/08/2012] [Indexed: 11/19/2022] Open
Abstract
Most biological processes are described as a series of interactions between proteins and other molecules, and interactions are in turn described in terms of atomic structures. To annotate protein functions as sets of interaction states at atomic resolution, and thereby to better understand the relation between protein interactions and biological functions, we conducted exhaustive all-against-all atomic structure comparisons of all known binding sites for ligands including small molecules, proteins and nucleic acids, and identified recurring elementary motifs. By integrating the elementary motifs associated with each subunit, we defined composite motifs that represent context-dependent combinations of elementary motifs. It is demonstrated that function similarity can be better inferred from composite motif similarity compared to the similarity of protein sequences or of individual binding sites. By integrating the composite motifs associated with each protein function, we define meta-composite motifs each of which is regarded as a time-independent diagrammatic representation of a biological process. It is shown that meta-composite motifs provide richer annotations of biological processes than sequence clusters. The present results serve as a basis for bridging atomic structures to higher-order biological phenomena by classification and integration of binding site structures.
Collapse
Affiliation(s)
- Akira R Kinjo
- Institute for Protein Research, Osaka University, Suita, Osaka, Japan.
| | | |
Collapse
|
24
|
Mavridis L, Ghoorah AW, Venkatraman V, Ritchie DW. Representing and comparing protein folds and fold families using three-dimensional shape-density representations. Proteins 2011; 80:530-45. [DOI: 10.1002/prot.23218] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2011] [Revised: 09/02/2011] [Accepted: 09/04/2011] [Indexed: 11/11/2022]
|
25
|
Practical applications of structural genomics technologies for mutagen research. Mutat Res 2011; 722:165-70. [PMID: 21182983 DOI: 10.1016/j.mrgentox.2010.12.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2010] [Accepted: 12/10/2010] [Indexed: 11/23/2022]
Abstract
Here we present a perspective on a range of practical uses of structural genomics for mutagen research. Structural genomics is an overloaded term and requires some definition to bound the discussion; we give a brief description of public and private structural genomics endeavors, along with some of their objectives, their activities, their capabilities, and their limitations. We discuss how structural genomics might impact mutagen research in three different scenarios: at a structural genomics center, at a lab with modest resources that also conducts structural biology research, and at a lab that is conducting mutagen research without in-house experimental structural biology. Applications span functional annotation of single genes or SNP, to constructing gene networks and pathways, to an integrated systems biology approach. Structural genomics centers can take advantage of systems biology models to target high value targets for structure determination and in turn extend systems models to better understand systems biology diseases or phenomenon. Individual investigator run structural biology laboratories can collaborate with structural genomics centers, but can also take advantage of technical advances and tools developed by structural genomics centers and can employ a structural genomics approach to advancing biological understanding. Individual investigator-run non-structural biology laboratories can also collaborate with structural genomics centers, possibly influencing targeting decisions, but can also use structure based annotation tools enabled by the growing coverage of protein fold space provided by structural genomics. Better functional annotation can inform pathway and systems biology models.
Collapse
|
26
|
Protein disorder--a breakthrough invention of evolution? Curr Opin Struct Biol 2011; 21:412-8. [PMID: 21514145 DOI: 10.1016/j.sbi.2011.03.014] [Citation(s) in RCA: 112] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2011] [Revised: 03/29/2011] [Accepted: 03/29/2011] [Indexed: 11/21/2022]
Abstract
As an operational definition, we refer to regions in proteins that do not adopt regular three-dimensional structures in isolation, as disordered regions. An antipode to disorder would be 'well-structured' rather than 'ordered'. Here, we argue for the following three hypotheses. Firstly, it is more useful to picture disorder as a distinct phenomenon in structural biology than as an extreme example of protein flexibility. Secondly, there are many very different flavors of protein disorder, nevertheless, it seems advantageous to portray the universe of all possible proteins in terms of two main types: well-structured, disordered. There might be a third type 'other' but we have so far no positive evidence for this. Thirdly, nature uses protein disorder as a tool to adapt to different environments. Protein disorder is evolutionarily conserved and this maintenance of disorder is highly nontrivial. Increasingly integrating protein disorder into the toolbox of a living cell was a crucial step in the evolution from simple bacteria to complex eukaryotes. We need new advanced computational methods to study this new milestone in the advance of protein biology.
Collapse
|
27
|
Dessailly BH, Redfern OC, Cuff AL, Orengo CA. Detailed analysis of function divergence in a large and diverse domain superfamily: toward a refined protocol of function classification. Structure 2011; 18:1522-35. [PMID: 21070951 DOI: 10.1016/j.str.2010.08.017] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2010] [Revised: 08/06/2010] [Accepted: 08/13/2010] [Indexed: 10/18/2022]
Abstract
Some superfamilies contain large numbers of protein domains with very different functions. The ability to refine the functional classification of domains within these superfamilies is necessary for better understanding the evolution of functions and to guide function prediction of new relatives. To achieve this, a suitable starting point is the detailed analysis of functional divisions and mechanisms of functional divergence in a single superfamily. Here, we present such a detailed analysis in the superfamily of HUP domains. A biologically meaningful functional classification of HUP domains is obtained manually. Mechanisms of function diversification are investigated in detail using this classification. We observe that structural motifs play an important role in shaping broad functional divergence, whereas residue-level changes shape diversity at a more specific level. In parallel we examine the ability of an automated protocol to capture the biologically meaningful classification, with a view to automatically extending this classification in the future.
Collapse
Affiliation(s)
- Benoit H Dessailly
- Department of Structural and Molecular Biology, University College of London, Gower Street, London WC1E6BT, UK.
| | | | | | | |
Collapse
|
28
|
The challenge of annotating protein sequences: The tale of eight domains of unknown function in Pfam. Comput Biol Chem 2010; 34:210-4. [PMID: 20537955 DOI: 10.1016/j.compbiolchem.2010.04.001] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2010] [Revised: 04/09/2010] [Accepted: 04/25/2010] [Indexed: 11/21/2022]
Abstract
The Pfam database is an important tool in genome annotation, since it provides a collection of curated protein families. However, a subset of these families, known as domains of unknown function (DUFs), remains poorly characterized. We have related sequences from DUF404, DUF407, DUF482, DUF608, DUF810, DUF853, DUF976 and DUF1111 to homologs in PDB, within the midnight zone (9-20%) of sequence identity. These relationships were extended to provide functional annotation by sequence analysis and model building. Also described are examples of residue plasticity within enzyme active sites, and change of function within homologous sequences of a DUF.
Collapse
|
29
|
The evolution of protein functions and networks: a family-centric approach. Biochem Soc Trans 2009; 37:745-50. [PMID: 19614587 DOI: 10.1042/bst0370745] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
The study of superfamilies of protein domains using a combination of structure, sequence and function data provides insights into deep evolutionary history. In the present paper, analyses of functional diversity within such superfamilies as defined in the CATH-Gene3D resource are described. These analyses focus on structure-function relationships in very large and diverse superfamilies, and on the evolution of domain superfamily members in protein-protein complexes.
Collapse
|
30
|
Jain P, Hirst JD. Exploring protein structural dissimilarity to facilitate structure classification. BMC STRUCTURAL BIOLOGY 2009; 9:60. [PMID: 19765314 PMCID: PMC2754988 DOI: 10.1186/1472-6807-9-60] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/27/2009] [Accepted: 09/19/2009] [Indexed: 12/04/2022]
Abstract
BACKGROUND Classification of newly resolved protein structures is important in understanding their architectural, evolutionary and functional relatedness to known protein structures. Among various efforts to improve the database of Structural Classification of Proteins (SCOP), automation has received particular attention. Herein, we predict the deepest SCOP structural level that an unclassified protein shares with classified proteins with an equal number of secondary structure elements (SSEs). RESULTS We compute a coefficient of dissimilarity (Omega) between proteins, based on structural and sequence-based descriptors characterising the respective constituent SSEs. For a set of 1,661 pairs of proteins with sequence identity up to 35%, the performance of Omega in predicting shared Class, Fold and Super-family levels is comparable to that of DaliLite Z score and shows a greater than four-fold increase in the true positive rate (TPR) for proteins sharing the Family level. On a larger set of 600 domains representing 200 families, the performance of Z score improves in predicting a shared Family, but still only achieves about half of the TPR of Omega. The TPR for structures sharing a Super-family is lower than in the first dataset, but Omega performs slightly better than Z score. Overall, the sensitivity of Omega in predicting common Fold level is higher than that of the DaliLite Z score. CONCLUSION Classification to a deeper level in the hierarchy is specific and difficult. So the efficiency of Omega may be attractive to the curators and the end-users of SCOP. We suggest Omega may be a better measure for structure classification than the DaliLite Z score, with the caveat that currently we are restricted to comparing structures with equal number of SSEs.
Collapse
Affiliation(s)
- Pooja Jain
- School of Chemistry, The University of Nottingham, University Park, Nottingham, NG7 2RD, UK
| | - Jonathan D Hirst
- School of Chemistry, The University of Nottingham, University Park, Nottingham, NG7 2RD, UK
| |
Collapse
|