1
|
Abstract
The use of macromolecular structures is widespread for a variety of applications, from teaching protein structure principles all the way to ligand optimization in drug development. Applying data mining techniques on these experimentally determined structures requires a highly uniform, standardized structural data source. The Protein Data Bank (PDB) has evolved over the years toward becoming the standard resource for macromolecular structures. However, the process selecting the data most suitable for specific applications is still very much based on personal preferences and understanding of the experimental techniques used to obtain these models. In this chapter, we will first explain the challenges with data standardization, annotation, and uniformity in the PDB entries determined by X-ray crystallography. We then discuss the specific effect that crystallographic data quality and model optimization methods have on structural models and how validation tools can be used to make informed choices. We also discuss specific advantages of using the PDB_REDO databank as a resource for structural data. Finally, we will provide guidelines on how to select the most suitable protein structure models for detailed analysis and how to select a set of structure models suitable for data mining.
Collapse
Affiliation(s)
- Bart van Beusekom
- Department of Biochemistry, Netherlands Cancer Institute, Plesmanlaan 121, 1066 CX, Amsterdam, The Netherlands
| | - Anastassis Perrakis
- Department of Biochemistry, Netherlands Cancer Institute, Plesmanlaan 121, 1066 CX, Amsterdam, The Netherlands
| | - Robbie P Joosten
- Department of Biochemistry, Netherlands Cancer Institute, Plesmanlaan 121, 1066 CX, Amsterdam, The Netherlands.
| |
Collapse
|
2
|
Touw WG, Baakman C, Black J, te Beek TAH, Krieger E, Joosten RP, Vriend G. A series of PDB-related databanks for everyday needs. Nucleic Acids Res 2014; 43:D364-8. [PMID: 25352545 PMCID: PMC4383885 DOI: 10.1093/nar/gku1028] [Citation(s) in RCA: 623] [Impact Index Per Article: 62.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
We present a series of databanks (http://swift.cmbi.ru.nl/gv/facilities/) that hold information that is computationally derived from Protein Data Bank (PDB) entries and that might augment macromolecular structure studies. These derived databanks run parallel to the PDB, i.e. they have one entry per PDB entry. Several of the well-established databanks such as HSSP, PDBREPORT and PDB_REDO have been updated and/or improved. The software that creates the DSSP databank, for example, has been rewritten to better cope with π-helices. A large number of databanks have been added to aid computational structural biology; some examples are lists of residues that make crystal contacts, lists of contacting residues using a series of contact definitions or lists of residue accessibilities. PDB files are not the optimal presentation of the underlying data for many studies. We therefore made a series of databanks that hold PDB files in an easier to use or more consistent representation. The BDB databank holds X-ray PDB files with consistently represented B-factors. We also added several visualization tools to aid the users of our databanks.
Collapse
Affiliation(s)
- Wouter G Touw
- Centre for Molecular and Biomolecular Informatics, CMBI, Radboud university medical center, Geert Grooteplein Zuid 26-28 6525 GA Nijmegen, The Netherlands
| | - Coos Baakman
- Centre for Molecular and Biomolecular Informatics, CMBI, Radboud university medical center, Geert Grooteplein Zuid 26-28 6525 GA Nijmegen, The Netherlands
| | - Jon Black
- Centre for Molecular and Biomolecular Informatics, CMBI, Radboud university medical center, Geert Grooteplein Zuid 26-28 6525 GA Nijmegen, The Netherlands
| | - Tim A H te Beek
- Bio-Prodict BV, Nieuwe Marktstraat 54E, 6511 AA Nijmegen, The Netherlands
| | - E Krieger
- Centre for Molecular and Biomolecular Informatics, CMBI, Radboud university medical center, Geert Grooteplein Zuid 26-28 6525 GA Nijmegen, The Netherlands
| | - Robbie P Joosten
- Centre for Molecular and Biomolecular Informatics, CMBI, Radboud university medical center, Geert Grooteplein Zuid 26-28 6525 GA Nijmegen, The Netherlands Department of Biochemistry, Netherlands Cancer Institute, Plesmanlaan 121, Amsterdam 1066 CX, The Netherlands
| | - Gert Vriend
- Centre for Molecular and Biomolecular Informatics, CMBI, Radboud university medical center, Geert Grooteplein Zuid 26-28 6525 GA Nijmegen, The Netherlands
| |
Collapse
|
3
|
Dey S, Pal A, Guharoy M, Sonavane S, Chakrabarti P. Characterization and prediction of the binding site in DNA-binding proteins: improvement of accuracy by combining residue composition, evolutionary conservation and structural parameters. Nucleic Acids Res 2012; 40:7150-61. [PMID: 22641851 PMCID: PMC3424558 DOI: 10.1093/nar/gks405] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
We present a set of four parameters that in combination can predict DNA-binding residues on protein structures to a high degree of accuracy. These are the number of evolutionary conserved residues (Ncons) and their spatial clustering (ρe), hydrogen bond donor capability (Dp) and residue propensity (Rp). We first used these parameters to characterize 130 interfaces in a set of 126 DNA-binding proteins (DBPs). The applicability of these parameters both individually and in combination, to distinguish the true binding region from the rest of the protein surface was then analyzed. Rp shows the best performance identifying the true interface with the top rank in 83% cases. Importantly, we also used the unbound-bound test cases of the protein–DNA docking benchmark to test the efficacy of our method. When applied to the unbound form of the DBPs, Rp can distinguish 86% cases. Finally, we have applied the SVM approach for recognizing the interface region using the above parameters along with the individual amino acid composition as attributes. The accuracy of prediction is 90.5% for the bound structures and 93.6% for the unbound form of the proteins.
Collapse
Affiliation(s)
- Sucharita Dey
- Bioinformatics Centre, Bose Institute, P-1/12 CIT Scheme VIIM, Kolkata 700 054, India
| | | | | | | | | |
Collapse
|
4
|
Juritz E, Palopoli N, Fornasari MS, Fernandez-Alberti S, Parisi G. Protein Conformational Diversity Modulates Sequence Divergence. Mol Biol Evol 2012; 30:79-87. [DOI: 10.1093/molbev/mss080] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
|
5
|
Rational design of DNA sequence-specific zinc fingers. FEBS Lett 2012; 586:918-23. [DOI: 10.1016/j.febslet.2012.02.025] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2011] [Revised: 02/10/2012] [Accepted: 02/15/2012] [Indexed: 11/22/2022]
|
6
|
Satagopam VP, Theodoropoulou MC, Stampolakis CK, Pavlopoulos GA, Papandreou NC, Bagos PG, Schneider R, Hamodrakas SJ. GPCRs, G-proteins, effectors and their interactions: human-gpDB, a database employing visualization tools and data integration techniques. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2010; 2010:baq019. [PMID: 20689020 PMCID: PMC2931634 DOI: 10.1093/database/baq019] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
G-protein coupled receptors (GPCRs) are a major family of membrane receptors in eukaryotic cells. They play a crucial role in the communication of a cell with the environment. Ligands bind to GPCRs on the outside of the cell, activating them by causing a conformational change, and allowing them to bind to G-proteins. Through their interaction with G-proteins, several effector molecules are activated leading to many kinds of cellular and physiological responses. The great importance of GPCRs and their corresponding signal transduction pathways is indicated by the fact that they take part in many diverse disease processes and that a large part of efforts towards drug development today is focused on them. We present Human-gpDB, a database which currently holds information about 713 human GPCRs, 36 human G-proteins and 99 human effectors. The collection of information about the interactions between these molecules was done manually and the current version of Human-gpDB holds information for about 1663 connections between GPCRs and G-proteins and 1618 connections between G-proteins and effectors. Major advantages of Human-gpDB are the integration of several external data sources and the support of advanced visualization techniques. Human-gpDB is a simple, yet a powerful tool for researchers in the life sciences field as it integrates an up-to-date, carefully curated collection of human GPCRs, G-proteins, effectors and their interactions. The database may be a reference guide for medical and pharmaceutical research, especially in the areas of understanding human diseases and chemical and drug discovery. Database URLs: http://schneider.embl.de/human_gpdb; http://bioinformatics.biol.uoa.gr/human_gpdb/
Collapse
Affiliation(s)
- Venkata P Satagopam
- Structural and Computational Biology Unit, EMBL, Meyerhofstrasse 1, Heidelberg D69117, Germany
| | | | | | | | | | | | | | | |
Collapse
|
7
|
Dey S, Pal A, Chakrabarti P, Janin J. The subunit interfaces of weakly associated homodimeric proteins. J Mol Biol 2010; 398:146-60. [PMID: 20156457 DOI: 10.1016/j.jmb.2010.02.020] [Citation(s) in RCA: 79] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2009] [Revised: 02/10/2010] [Accepted: 02/10/2010] [Indexed: 02/07/2023]
Abstract
We analyzed subunit interfaces in 315 homodimers with an X-ray structure in the Protein Data Bank, validated by checking the literature for data that indicate that the proteins are dimeric in solution and that, in the case of the "weak" dimers, the homodimer is in equilibrium with the monomer. The interfaces of the 42 weak dimers, which are smaller by a factor of 2.4 on average than in the remainder of the set, are comparable in size with antibody-antigen or protease-inhibitor interfaces. Nevertheless, they are more hydrophobic than in the average transient protein-protein complex and similar in amino acid composition to the other homodimer interfaces. The mean numbers of interface hydrogen bonds and hydration water molecules per unit area are also similar in homodimers and transient complexes. Parameters related to the atomic packing suggest that many of the weak dimer interfaces are loosely packed, and we suggest that this contributes to their low stability. To evaluate the evolutionary selection pressure on interface residues, we calculated the Shannon entropy of homologous amino acid sequences at 60% sequence identity. In 93% of the homodimers, the interface residues are better conserved than the residues on the protein surface. The weak dimers display the same high degree of interface conservation as other homodimers, but their homologs may be heterodimers as well as homodimers. Their interfaces may be good models in terms of their size, composition, and evolutionary conservation for the labile subunit contacts that allow protein assemblies to share and exchange components, allosteric proteins to undergo quaternary structure transitions, and molecular machines to operate in the cell.
Collapse
Affiliation(s)
- Sucharita Dey
- Bioinformatics Centre, Bose Institute, P-1/12 CIT Scheme VIIM, Calcutta 700 054, India
| | | | | | | |
Collapse
|
8
|
Galperin MY, Cochrane GR. Nucleic Acids Research annual Database Issue and the NAR online Molecular Biology Database Collection in 2009. Nucleic Acids Res 2008; 37:D1-4. [PMID: 19033364 PMCID: PMC2686608 DOI: 10.1093/nar/gkn942] [Citation(s) in RCA: 81] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/05/2022] Open
Abstract
The current issue of Nucleic Acids Research includes descriptions of 179 databases, of which 95 are new. These databases (along with several molecular biology databases described in other journals) have been included in the Nucleic Acids Research online Molecular Biology Database Collection, bringing the total number of databases in the collection to 1170. In this introductory comment, we briefly describe some of these new databases and review the principles guiding the selection of databases for inclusion in the Nucleic Acids Research annual Database Issue and the Nucleic Acids Research online Molecular Biology Database Collection. The complete database list and summaries are available online at the Nucleic Acids Research web site (http://nar.oxfordjournals.org/).
Collapse
Affiliation(s)
- Michael Y Galperin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
| | | |
Collapse
|
9
|
The UniProtKB/Swiss-Prot knowledgebase and its Plant Proteome Annotation Program. J Proteomics 2008; 72:567-73. [PMID: 19084081 DOI: 10.1016/j.jprot.2008.11.010] [Citation(s) in RCA: 66] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2008] [Revised: 11/04/2008] [Accepted: 11/10/2008] [Indexed: 11/21/2022]
Abstract
The UniProt knowledgebase, UniProtKB, is the main product of the UniProt consortium. It consists of two sections, UniProtKB/Swiss-Prot, the manually curated section, and UniProtKB/TrEMBL, the computer translation of the EMBL/GenBank/DDBJ nucleotide sequence database. Taken together, these two sections cover all the proteins characterized or inferred from all publicly available nucleotide sequences. The Plant Proteome Annotation Program (PPAP) of UniProtKB/Swiss-Prot focuses on the manual annotation of plant-specific proteins and protein families. Our major effort is currently directed towards the two model plants Arabidopsis thaliana and Oryza sativa. In UniProtKB/Swiss-Prot, redundancy is minimized by merging all data from different sources in a single entry. The proposed protein sequence is frequently modified after comparison with ESTs, full length transcripts or homologous proteins from other species. The information present in manually curated entries allows the reconstruction of all described isoforms. The annotation also includes proteomics data such as PTM and protein identification MS experimental results. UniProtKB and the other products of the UniProt consortium are accessible online at www.uniprot.org.
Collapse
|
10
|
Abstract
AbstractProtein–protein recognition plays an essential role in structure and function. Specific non-covalent interactions stabilize the structure of macromolecular assemblies, exemplified in this review by oligomeric proteins and the capsids of icosahedral viruses. They also allow proteins to form complexes that have a very wide range of stability and lifetimes and are involved in all cellular processes. We present some of the structure-based computational methods that have been developed to characterize the quaternary structure of oligomeric proteins and other molecular assemblies and analyze the properties of the interfaces between the subunits. We compare the size, the chemical and amino acid compositions and the atomic packing of the subunit interfaces of protein–protein complexes, oligomeric proteins, viral capsids and protein–nucleic acid complexes. These biologically significant interfaces are generally close-packed, whereas the non-specific interfaces between molecules in protein crystals are loosely packed, an observation that gives a structural basis to specific recognition. A distinction is made within each interface between a core that contains buried atoms and a solvent accessible rim. The core and the rim differ in their amino acid composition and their conservation in evolution, and the distinction helps correlating the structural data with the results of site-directed mutagenesis and in vitro studies of self-assembly.
Collapse
|
11
|
Abstract
To evaluate the evolutionary constraints placed on viral proteins by the structure and assembly of the capsid, we calculate Shannon entropies in the aligned sequences of 45 polypeptide chains in 32 icosahedral viruses, and relate these entropies to the residue location in the three-dimensional structure of the capsids. Three categories of residues have entropies lower than the chain average implying that they are better conserved than average: residues that are buried within a subunit (the protein core), residues that contain atoms buried at an interface between subunits (the interface core), and residues that contribute to several such interfaces. The interface core is also conserved in homomeric proteins and in transient protein-protein complexes, which have only one interface whereas capsids have many. In capsids, the subunit interfaces implicate most of the polypeptide chain: on average, 66% of the capsid residues are at an interface, 34% at more than one, and 47% at the interface core. Nevertheless, we observe that the degree of residue conservation can vary widely between interfaces within a capsid and between regions within an interface. The interfaces and regions of interfaces that show a low sequence variability are likely to play major roles in the self-assembly of the capsid, with implications on its mechanism that we discuss taking adeno-associated virus as an example.
Collapse
Affiliation(s)
- Ranjit P Bahadur
- Yeast Structural Genomics, IBBMC Université Paris-Sud, CNRS UMR 8619, 91405-Orsay, France
| | | |
Collapse
|
12
|
Codoñer FM, Fares MA. Why should we care about molecular coevolution? Evol Bioinform Online 2008; 4:29-38. [PMID: 19204805 PMCID: PMC2614197] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open
Abstract
Non-independent evolution of amino acid sites has become a noticeable limitation of most methods aimed at identifying selective constraints at functionally important amino acid sites or protein regions. The need for a generalised framework to account for non-independence of amino acid sites has fuelled the design and development of new mathematical models and computational tools centred on resolving this problem. Molecular coevolution is one of the most active areas of research, with an increasing rate of new models and methods being developed everyday. Both parametric and non-parametric methods have been developed to account for correlated variability of amino acid sites. These methods have been utilised for detecting phylogenetic, functional and structural coevolution as well as to identify surfaces of amino acid sites involved in protein-protein interactions. Here we discuss and briefly describe these methods, and identify their advantages and limitations.
Collapse
Affiliation(s)
- Francisco M. Codoñer
- Evolutionary Genetics and Bioinformatics Laboratory, Department of Genetics, Smurfit Institute of Genetics, University of Dublin, Trinity College, Institute of Immunology, Biology Department, National University of Ireland Maynooth
| | - Mario A. Fares
- Evolutionary Genetics and Bioinformatics Laboratory, Department of Genetics, Smurfit Institute of Genetics, University of Dublin, Trinity College,Correspondence:
| |
Collapse
|
13
|
Abstract
Non-independent evolution of amino acid sites has become a noticeable limitation of most methods aimed at identifying selective constraints at functionally important amino acid sites or protein regions. The need for a generalised framework to account for non-independence of amino acid sites has fuelled the design and development of new mathematical models and computational tools centred on resolving this problem. Molecular coevolution is one of the most active areas of research, with an increasing rate of new models and methods being developed everyday. Both parametric and non-parametric methods have been developed to account for correlated variability of amino acid sites. These methods have been utilised for detecting phylogenetic, functional and structural coevolution as well as to identify surfaces of amino acid sites involved in protein-protein interactions. Here we discuss and briefly describe these methods, and identify their advantages and limitations.
Collapse
Affiliation(s)
- Francisco M. Codoñer
- Evolutionary Genetics and Bioinformatics Laboratory, Department of Genetics, Smurfit Institute of Genetics, University of Dublin, Trinity College
- Institute of Immunology, Biology Department, National University of Ireland Maynooth
| | - Mario A. Fares
- Evolutionary Genetics and Bioinformatics Laboratory, Department of Genetics, Smurfit Institute of Genetics, University of Dublin, Trinity College
| |
Collapse
|
14
|
Dobson RJ, Munroe PB, Caulfield MJ, Saqi MAS. Predicting deleterious nsSNPs: an analysis of sequence and structural attributes. BMC Bioinformatics 2006; 7:217. [PMID: 16630345 PMCID: PMC1489951 DOI: 10.1186/1471-2105-7-217] [Citation(s) in RCA: 65] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2005] [Accepted: 04/21/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND There has been an explosion in the number of single nucleotide polymorphisms (SNPs) within public databases. In this study we focused on non-synonymous protein coding single nucleotide polymorphisms (nsSNPs), some associated with disease and others which are thought to be neutral. We describe the distribution of both types of nsSNPs using structural and sequence based features and assess the relative value of these attributes as predictors of function using machine learning methods. We also address the common problem of balance within machine learning methods and show the effect of imbalance on nsSNP function prediction. We show that nsSNP function prediction can be significantly improved by 100% undersampling of the majority class. The learnt rules were then applied to make predictions of function on all nsSNPs within Ensembl. RESULTS The measure of prediction success is greatly affected by the level of imbalance in the training dataset. We found the balanced dataset that included all attributes produced the best prediction. The performance as measured by the Matthews correlation coefficient (MCC) varied between 0.49 and 0.25 depending on the imbalance. As previously observed, the degree of sequence conservation at the nsSNP position is the single most useful attribute. In addition to conservation, structural predictions made using a balanced dataset can be of value. CONCLUSION The predictions for all nsSNPs within Ensembl, based on a balanced dataset using all attributes, are available as a DAS annotation. Instructions for adding the track to Ensembl are at http://www.brightstudy.ac.uk/das_help.html.
Collapse
Affiliation(s)
- Richard J Dobson
- Clinical Pharmacology, The William Harvey Research Institute, Bart's and the London School of Medicine and Dentistry, Queen Mary University of London, Charterhouse Square, London EC1M 6BQ, UK
| | - Patricia B Munroe
- Clinical Pharmacology, The William Harvey Research Institute, Bart's and the London School of Medicine and Dentistry, Queen Mary University of London, Charterhouse Square, London EC1M 6BQ, UK
| | - Mark J Caulfield
- Clinical Pharmacology, The William Harvey Research Institute, Bart's and the London School of Medicine and Dentistry, Queen Mary University of London, Charterhouse Square, London EC1M 6BQ, UK
| | - Mansoor AS Saqi
- Bioinformatics, Institute of Cell and Molecular Science, Bart's and the London School of Medicine and Dentistry, Queen Mary University of London, Charterhouse Square, London EC1M 6BQ, UK
| |
Collapse
|
15
|
Pugalenthi G, Bhaduri A, Sowdhamini R. GenDiS: Genomic Distribution of protein structural domain Superfamilies. Nucleic Acids Res 2005; 33:D252-5. [PMID: 15608190 PMCID: PMC540041 DOI: 10.1093/nar/gki087] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Several proteins that have substantially diverged during evolution retain similar three-dimensional structures and biological function inspite of poor sequence identity. The database on Genomic Distribution of protein structural domain Superfamilies (GenDiS) provides record for the distribution of 4001 protein domains organized as 1194 structural superfamilies across 18 997 genomes at various levels of hierarchy in taxonomy. GenDiS database provides a survey of protein domains enlisted in sequence databases employing a 3-fold sequence search approach. Lineage-specific literature is obtained from the taxonomy database for individual protein members to provide a platform for performing genomic and phyletic studies across organisms. The database documents residual properties and provides alignments for the various superfamily members in genomes, offering insights into the rational design of experiments and for the better understanding of a superfamily. GenDiS database can be accessed at http://www.ncbs.res.in/~faculty/mini/gendis/home.html.
Collapse
Affiliation(s)
- Ganesan Pugalenthi
- National Centre for Biological Sciences, Tata Institute of Fundamental Research, UAS-GKVK campus, Bellary Road, Bangalore 560 065, Karnataka, India
| | | | | |
Collapse
|
16
|
Worthey EA, Myler PJ. Protozoan genomes: gene identification and annotation. Int J Parasitol 2005; 35:495-512. [PMID: 15826642 DOI: 10.1016/j.ijpara.2005.02.008] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2004] [Revised: 01/25/2005] [Accepted: 02/06/2005] [Indexed: 12/01/2022]
Abstract
The draft sequence of several complete protozoan genomes is now available and genome projects are ongoing for a number of other species. Different strategies are being implemented to identify and annotate protein coding and RNA genes in these genomes, as well as study their genomic architecture. Since the genomes vary greatly in size, GC-content, nucleotide composition, and degree of repetitiveness, genome structure is often a factor in choosing the methodology utilised for annotation. In addition, the approach taken is dictated, to a greater or lesser extent, by the particular reasons for carrying out genome-wide analyses and the level of funding available for projects. Nevertheless, these projects have provided a plethora of material that will aid in understanding the biology and evolution of these parasites, as well as identifying new targets that can be used to design urgently required drug treatments for the diseases they cause.
Collapse
Affiliation(s)
- E A Worthey
- Seattle Biomedical Research Institute, 307 Westlake Ave N., Seattle, WA 98109-2591, USA
| | | |
Collapse
|
17
|
Pazos F, Sternberg MJE. Automated prediction of protein function and detection of functional sites from structure. Proc Natl Acad Sci U S A 2004; 101:14754-9. [PMID: 15456910 PMCID: PMC522026 DOI: 10.1073/pnas.0404569101] [Citation(s) in RCA: 139] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2004] [Indexed: 11/18/2022] Open
Abstract
Current structural genomics projects are yielding structures for proteins whose functions are unknown. Accordingly, there is a pressing requirement for computational methods for function prediction. Here we present PHUNCTIONER, an automatic method for structure-based function prediction using automatically extracted functional sites (residues associated to functions). The method relates proteins with the same function through structural alignments and extracts 3D profiles of conserved residues. Functional features to train the method are extracted from the Gene Ontology (GO) database. The method extracts these features from the entire GO hierarchy and hence is applicable across the whole range of function specificity. 3D profiles associated with 121 GO annotations were extracted. We tested the power of the method both for the prediction of function and for the extraction of functional sites. The success of function prediction by our method was compared with the standard homology-based method. In the zone of low sequence similarity (approximately 15%), our method assigns the correct GO annotation in 90% of the protein structures considered, approximately 20% higher than inheritance of function from the closest homologue.
Collapse
Affiliation(s)
- Florencio Pazos
- Structural Bioinformatics Group, Biochemistry Building, Department of Biological Sciences, Imperial College London, London SW7 2AZ, UK
| | | |
Collapse
|
18
|
Emes RD, Beatson SA, Ponting CP, Goodstadt L. Evolution and comparative genomics of odorant- and pheromone-associated genes in rodents. Genome Res 2004; 14:591-602. [PMID: 15060000 PMCID: PMC383303 DOI: 10.1101/gr.1940604] [Citation(s) in RCA: 58] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Chemical cues influence a range of behavioral responses in rodents. The involvement of protein odorants and odorant receptors in mediating reproductive behavior, foraging, and predator avoidance suggests that their genes may have been subject to adaptive evolution. We have estimated the consequences of selection on rodent pheromones, their receptors, and olfactory receptors. These families were chosen on the basis of multiple gene duplications since the common ancestor of rat and mouse. For each family, codons were identified that are likely to have been subject to adaptive evolution. The majority of such sites are situated on the solvent-accessible surfaces of putative pheromones and the lumenal portions of their likely receptors. We predict that these contribute to physicochemical and functional diversity within pheromone-receptor interaction sites.
Collapse
Affiliation(s)
- Richard D Emes
- MRC Functional Genetics Unit, Department of Human Anatomy and Genetics, University of Oxford, Oxford OX1 3QX, UK
| | | | | | | |
Collapse
|
19
|
Oberg KA, Ruysschaert JM, Goormaghtigh E. Rationally selected basis proteins: a new approach to selecting proteins for spectroscopic secondary structure analysis. Protein Sci 2003; 12:2015-31. [PMID: 12931000 PMCID: PMC2323998 DOI: 10.1110/ps.0354703] [Citation(s) in RCA: 36] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
Protein basis sets have been extensively used as reference data for the determination of protein structure with optical methods such as circular dichroism and infrared spectroscopies. We have taken a new approach to basis protein selection by utilizing three crystal structure classification databases: CATH, SCOP, and PDB_SELECT. Through the use of the information available in these and other online resources, we identified 115 commercially available proteins as potential basis set candidates. By carefully screening the quality of the crystal structures and commercial protein preparations, we obtained a final set of 50 rationally selected proteins (RaSP50) that has been optimized for use in spectroscopic protein structure determination studies. These proteins span the full range of known protein folds as well as alpha-helix and beta-sheet contents, and they represent a more comprehensive variety of fold types than any previous reference set. This report includes a detailed presentation of the reasoning behind the rational protein selection process, a description of the properties of the RaSP50 set, and a discussion of the types of structural and spectral variations that are represented in the set.
Collapse
Affiliation(s)
- Keith A Oberg
- Structural Biology and Bioinformatics Center, Structure and Function of Biological Membranes Laboratory, Free University of Brussels (ULB), B-1050 Brussels, Belgium
| | | | | |
Collapse
|
20
|
del Sol A, del Sol Mesa A, Pazos F, Valencia A. Automatic methods for predicting functionally important residues. J Mol Biol 2003; 326:1289-302. [PMID: 12589769 DOI: 10.1016/s0022-2836(02)01451-1] [Citation(s) in RCA: 169] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Sequence analysis is often the first guide for the prediction of residues in a protein family that may have functional significance. A few methods have been proposed which use the division of protein families into subfamilies in the search for those positions that could have some functional significance for the whole family, but at the same time which exhibit the specificity of each subfamily ("Tree-determinant residues"). However, there are still many unsolved questions like the best division of a protein family into subfamilies, or the accurate detection of sequence variation patterns characteristic of different subfamilies. Here we present a systematic study in a significant number of protein families, testing the statistical meaning of the Tree-determinant residues predicted by three different methods that represent the range of available approaches. The first method takes as a starting point a phylogenetic representation of a protein family and, following the principle of Relative Entropy from Information Theory, automatically searches for the optimal division of the family into subfamilies. The second method looks for positions whose mutational behavior is reminiscent of the mutational behavior of the full-length proteins, by directly comparing the corresponding distance matrices. The third method is an automation of the analysis of distribution of sequences and amino acid positions in the corresponding multidimensional spaces using a vector-based principal component analysis. These three methods have been tested on two non-redundant lists of protein families: one composed by proteins that bind a variety of ligand groups, and the other composed by proteins with annotated functionally relevant sites. In most cases, the residues predicted by the three methods show a clear tendency to be close to bound ligands of biological relevance and to those amino acids described as participants in key aspects of protein function. These three automatic methods provide a wide range of possibilities for biologists to analyze their families of interest, in a similar way to the one presented here for the family of proteins related with ras-p21.
Collapse
Affiliation(s)
- Antonio del Sol
- Protein Design Group, National Center for Biotechnology, Cantoblanco, Madrid 28049, Spain
| | | | | | | |
Collapse
|
21
|
Melo FR, Rigden DJ, Franco OL, Mello LV, Ary MB, Grossi de Sá MF, Bloch C. Inhibition of trypsin by cowpea thionin: characterization, molecular modeling, and docking. Proteins 2002; 48:311-9. [PMID: 12112698 DOI: 10.1002/prot.10142] [Citation(s) in RCA: 95] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
Higher plants produce several families of proteins with toxic properties, which act as defense compounds against pests and pathogens. The thionin family represents one family and comprises low molecular mass cysteine-rich proteins, usually basic and distributed in different plant tissues. Here, we report the purification and characterization of a new thionin from cowpea (Vigna unguiculata) with proteinase inhibitory activity. Cowpea thionin inhibits trypsin, but not chymotrypsin, binding with a stoichiometry of 1:1 as shown with the use of mass spectrometry. Previous annotations of thionins as proteinase inhibitors were based on their erroneous identification as homologues of Bowman-Birk family inhibitors. Molecular modeling experiments were used to propose a mode of docking of cowpea thionin with trypsin. Consideration of the dynamic properties of the cowpea thionin was essential to arrive at a model with favorable interface characteristics comparable with structures of trypsin-inhibitor complexes determined by X-ray crystallography. In the final model, Lys11 occupies the S1 specificity pocket of trypsin as part of a canonical style interaction.
Collapse
Affiliation(s)
- Francislete R Melo
- Departamento de Biologia Celular, Universidade de Brasília, Brasília-DF, Brasil.
| | | | | | | | | | | | | |
Collapse
|
22
|
Pazos F, Valencia A. In silico two-hybrid system for the selection of physically interacting protein pairs. Proteins 2002; 47:219-27. [PMID: 11933068 DOI: 10.1002/prot.10074] [Citation(s) in RCA: 183] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Deciphering the interaction links between proteins has become one of the main tasks of experimental and bioinformatic methodologies. Reconstruction of complex networks of interactions in simple cellular systems by integrating predicted interaction networks with available experimental data is becoming one of the most demanding needs in the postgenomic era. On the basis of the study of correlated mutations in multiple sequence alignments, we propose a new method (in silico two-hybrid, i2h) that directly addresses the detection of physically interacting protein pairs and identifies the most likely sequence regions involved in the interactions. We have applied the system to several test sets, showing that it can discriminate between true and false interactions in a significant number of cases. We have also analyzed a large collection of E. coli protein pairs as a first step toward the virtual reconstruction of its complete interaction network.
Collapse
|
23
|
Mallika V, Bhaduri A, Sowdhamini R. PASS2: a semi-automated database of protein alignments organised as structural superfamilies. Nucleic Acids Res 2002; 30:284-8. [PMID: 11752316 PMCID: PMC99156 DOI: 10.1093/nar/30.1.284] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
PASS2 is a nearly automated version of CAMPASS and contains sequence alignments of proteins grouped at the level of superfamilies. This database has been created to fall in correspondence with SCOP database (1.53 release) and currently consists of 110 multi-member superfamilies and 613 superfamilies corresponding to single members. In multi-member superfamilies, protein chains with no more than 25% sequence identity have been considered for the alignment and hence the database aims to address sequence alignments which represent 26 219 protein domains under the SCOP 1.53 release. Structure-based sequence alignments have been obtained by COMPARER and the initial equivalences are provided automatically from a MALIGN alignment and subsequently augmented using STAMP4.0. The final sequence alignments have been annotated for the structural features using JOY4.0. Several interesting links are provided to other related databases and genome sequence relatives. Availability of reliable sequence alignments of distantly related proteins, despite poor sequence identity and single-member superfamilies, permit better sampling of structures in libraries for fold recognition of new sequences and for the understanding of protein structure-function relationships of individual superfamilies. The database can be queried by keywords and also by sequence search, interfaced by PSI-BLAST methods. Structure-annotated sequence alignments and several structural accessory files can be retrieved for all the superfamilies including the user-input sequence. The database can be accessed from http://www.ncbs.res.in/%7Efaculty/mini/campass/pass.html.
Collapse
Affiliation(s)
- V Mallika
- National Centre for Biological Sciences, UAS-GKVK Campus, Bangalore 560 065, India
| | | | | |
Collapse
|
24
|
Fariselli P, Olmea O, Valencia A, Casadio R. Prediction of contact maps with neural networks and correlated mutations. PROTEIN ENGINEERING 2001; 14:835-43. [PMID: 11742102 DOI: 10.1093/protein/14.11.835] [Citation(s) in RCA: 149] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Contact maps of proteins are predicted with neural network-based methods, using as input codings of increasing complexity including evolutionary information, sequence conservation, correlated mutations and predicted secondary structures. Neural networks are trained on a data set comprising the contact maps of 173 non-homologous proteins as computed from their well resolved three-dimensional structures. Proteins are selected from the Protein Data Bank database provided that they align with at least 15 similar sequences in the corresponding families. The predictors are trained to learn the association rules between the covalent structure of each protein and its contact map with a standard back propagation algorithm and tested on the same protein set with a cross-validation procedure. Our results indicate that the method can assign protein contacts with an average accuracy of 0.21 and with an improvement over a random predictor of a factor >6, which is higher than that previously obtained with methods only based either on neural networks or on correlated mutations. Furthermore, filtering the network outputs with a procedure based on the residue coordination numbers, the accuracy of predictions increases up to 0.25 for all the proteins, with an 8-fold deviation from a random predictor. These scores are the highest reported so far for predicting protein contact maps.
Collapse
Affiliation(s)
- P Fariselli
- CIRB and Department of Biology, University of Bologna, via Irnerio 42, Bologna, Italy
| | | | | | | |
Collapse
|
25
|
Pazos F, Valencia A. Similarity of phylogenetic trees as indicator of protein-protein interaction. PROTEIN ENGINEERING 2001; 14:609-14. [PMID: 11707606 DOI: 10.1093/protein/14.9.609] [Citation(s) in RCA: 303] [Impact Index Per Article: 13.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Deciphering the network of protein interactions that underlines cellular operations has become one of the main tasks of proteomics and computational biology. Recently, a set of bioinformatics approaches has emerged for the prediction of possible interactions by combining sequence and genomic information. Even though the initial results are very promising, the current methods are still far from perfect. We propose here a new way of discovering possible protein-protein interactions based on the comparison of the evolutionary distances between the sequences of the associated protein families, an idea based on previous observations of correspondence between the phylogenetic trees of associated proteins in systems such as ligands and receptors. Here, we extend the approach to different test sets, including the statistical evaluation of their capacity to predict protein interactions. To demonstrate the possibilities of the system to perform large-scale predictions of interactions, we present the application to a collection of more than 67 000 pairs of E.coli proteins, of which 2742 are predicted to correspond to interacting proteins.
Collapse
Affiliation(s)
- F Pazos
- Protein Design Group, CNB-CSIC, Cantoblanco, E-28049 Madrid, Spain
| | | |
Collapse
|
26
|
Bonneau R, Strauss CE, Baker D. Improving the performance of Rosetta using multiple sequence alignment information and global measures of hydrophobic core formation. Proteins 2001; 43:1-11. [PMID: 11170209 DOI: 10.1002/1097-0134(20010401)43:1<1::aid-prot1012>3.0.co;2-a] [Citation(s) in RCA: 67] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
This study explores the use of multiple sequence alignment (MSA) information and global measures of hydrophobic core formation for improving the Rosetta ab initio protein structure prediction method. The most effective use of the MSA information is achieved by carrying out independent folding simulations for a subset of the homologous sequences in the MSA and then identifying the free energy minima common to all folded sequences via simultaneous clustering of the independent folding runs. Global measures of hydrophobic core formation, using ellipsoidal rather than spherical representations of the hydrophobic core, are found to be useful in removing non-native conformations before cluster analysis. Through this combination of MSA information and global measures of protein core formation, we significantly increase the performance of Rosetta on a challenging test set. Proteins 2001;43:1-11.
Collapse
Affiliation(s)
- R Bonneau
- Department of Biochemistry, Box 357350, University of Washington, Seattle, Washington, USA
| | | | | |
Collapse
|
27
|
Elcock AH, McCammon JA. Identification of protein oligomerization states by analysis of interface conservation. Proc Natl Acad Sci U S A 2001; 98:2990-4. [PMID: 11248019 PMCID: PMC30594 DOI: 10.1073/pnas.061411798] [Citation(s) in RCA: 100] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
The discrimination of true oligomeric protein-protein contacts from nonspecific crystal contacts remains problematic. Criteria that have been used previously base the assignment of oligomeric state on consideration of the area of the interface and/or the results of scoring functions based on statistical potentials. Both techniques have a high success rate but fail in more than 10% of cases. More importantly, the oligomeric states of several proteins are incorrectly assigned by both methods. Here we test the hypothesis that true oligomeric contacts should be identifiable on the basis of an increased degree of conservation of the residues involved in the interface. By quantifying the degree of conservation of the interface and comparing it with that of the remainder of the protein surface, we develop a new criterion that provides a highly effective complement to existing methods.
Collapse
Affiliation(s)
- A H Elcock
- Department of Biochemistry, University of Iowa, Iowa City, IA 52242-1109, USA.
| | | |
Collapse
|
28
|
Mandel-Gutfreund Y, Zaremba SM, Gregoret LM. Contributions of residue pairing to beta-sheet formation: conservation and covariation of amino acid residue pairs on antiparallel beta-strands. J Mol Biol 2001; 305:1145-59. [PMID: 11162120 DOI: 10.1006/jmbi.2000.4364] [Citation(s) in RCA: 41] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
In an effort to better understand beta-sheet assembly, we have investigated the evolutionary behavior of neighboring residues on adjacent antiparallel beta-strands. Residue pairs were classified according to solvent exposure as well as by whether their backbone NH and C==O groups are hydrogen bonded. The conservation and covariation of 19,241 pairs in 219 sequence alignments was analyzed. Buried pairs were found to be the most conserved, while stronger covariation was detected in the solvent-exposed pairs. However, residues on neighboring strands showed a degree of conservation and covariation similar to that of well-separated residues on the same strand, suggesting that evolutionary pressure to maintain complementarity between pairs on neighboring strands is weak. Moreover, in spite of the preference of certain amino acid pairs to occupy neighboring positions on adjacent strands, such favored pairs are neither more strongly mutually conserved nor covary more strongly than pairs of the same type in non-interacting positions. Although the beta-sheet pairs did not show outstanding evolutionary coupling, in many protein families significant conservation and covariation patterns were detected for some of the residue pairs. Overall, the weak evolutionary conservation and covariation of the beta-sheet pairs indicates that sheet structure is unlikely to be dictated by specific side-chain interactions.
Collapse
Affiliation(s)
- Y Mandel-Gutfreund
- Department of Chemistry and Biochemistry, University of California, Santa Cruz 95064, USA
| | | | | |
Collapse
|
29
|
Rychlewski L, Jaroszewski L, Li W, Godzik A. Comparison of sequence profiles. Strategies for structural predictions using sequence information. Protein Sci 2000; 9:232-41. [PMID: 10716175 PMCID: PMC2144550 DOI: 10.1110/ps.9.2.232] [Citation(s) in RCA: 385] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Abstract
Distant homologies between proteins are often discovered only after three-dimensional structures of both proteins are solved. The sequence divergence for such proteins can be so large that simple comparison of their sequences fails to identify any similarity. New generation of sensitive alignment tools use averaged sequences of entire homologous families (profiles) to detect such homologies. Several algorithms, including the newest generation of BLAST algorithms and BASIC, an algorithm used in our group to assign fold predictions for proteins from several genomes, are compared to each other on the large set of structurally similar proteins with little sequence similarity. Proteins in the benchmark are classified according to the level of their similarity, which allows us to demonstrate that most of the improvement of the new algorithms is achieved for proteins with strong functional similarities, with almost no progress in recognizing distant fold similarities. It is also shown that details of profile calculation strongly influence its sensitivity in recognizing distant homologies. The most important choice is how to include information from diverging members of the family, avoiding generating false predictions, while accounting for entire sequence divergence within a family. PSI-BLAST takes a conservative approach, deriving a profile from core members of the family, providing a solid improvement without almost any false predictions. BASIC strives for better sensitivity by increasing the weight of divergent family members and paying the price in lower reliability. A new FFAS algorithm introduced here uses a new procedure for profile generation that takes into account all the relations within the family and matches BASIC sensitivity with PSI-BLAST like reliability.
Collapse
Affiliation(s)
- L Rychlewski
- San Diego Supercomputer Center, La Jolla, California 92093, USA
| | | | | | | |
Collapse
|
30
|
Olmea O, Rost B, Valencia A. Effective use of sequence correlation and conservation in fold recognition. J Mol Biol 1999; 293:1221-39. [PMID: 10547297 DOI: 10.1006/jmbi.1999.3208] [Citation(s) in RCA: 125] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Protein families are a rich source of information; sequence conservation and sequence correlation are two of the main properties that can be derived from the analysis of multiple sequence alignments. Sequence conservation is related to the direct evolutionary pressure to retain the chemical characteristics of some positions in order to maintain a given function. Sequence correlation is attributed to the small sequence adjustments needed to maintain protein stability against constant mutational drift. Here, we showed that sequence conservation and correlation were each frequently informative enough to detect incorrectly folded proteins. Furthermore, combining conservation, correlation, and polarity, we achieved an almost perfect discrimination between native and incorrectly folded proteins. Thus, we made use of this information for threading by evaluating the models suggested by a threading method according to the degree of proximity of the corresponding correlated, conserved, and apolar residues. The results showed that the fold recognition capacity of a given threading approach could be improved almost fourfold by selecting the alignments that score best under the three different sequence-based approaches.
Collapse
Affiliation(s)
- O Olmea
- Protein Design Group, CNB-CSIC, Cantoblanco, Madrid, E-28049, Spain
| | | | | |
Collapse
|
31
|
Lebeda FJ, Olson MA. Prediction of a conserved, neutralizing epitope in ribosome-inactivating proteins. Int J Biol Macromol 1999; 24:19-26. [PMID: 10077268 DOI: 10.1016/s0141-8130(98)00059-2] [Citation(s) in RCA: 38] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
The secondary structures, side-chain solvent accessibilities, and superpositioned crystal structures of the A-chain of ricin and four other plant rRNA N-glycosidases (ribosome-inactivating proteins, RIPs) were examined. Previously, a 26-residue fragment from the A-chain of ricin was determined to bind to a neutralizing monoclonal antibody. The region in the native ricin A-chain, to which this peptide corresponds, is solvent-exposed and contains a negatively charged residue that has been hypothesized to participate in the toxin's function, namely, rRNA binding and/or enzymatic activity. This region appears to be conserved in all of the structurally defined plant RIPs examined. Moreover, other plant RIPs, whose tertiary structures are, as yet, unknown, were predicted to have an analogous, solvent-exposed region containing a conserved, negatively charged residue. By analogy, these conserved structural and functional features lead to the suggestion that this exposed region represents a logical starting point for experiments designed to locate neutralizing epitopes in these RIPs. In contrast, the tertiary structure of the analogous region in a bacteria-derived RIP (Shiga toxin) is a less solvent-exposed, truncated loop and is a structure that is not as likely to be a neutralizing epitope. Because most of the amino acid residues are not conserved within this exposed region, these RIPs are predicted to be antigenically distinct.
Collapse
Affiliation(s)
- F J Lebeda
- Department of Cell Biology and Biochemistry, US Army Medical Research Institute of Infectious Diseases, Frederick, MD 21702-5011, USA.
| | | |
Collapse
|
32
|
Tsugita A, Kamo M, Miyazaki K, Takayama M, Kawakami T, Shen R, Nozawa T. Additional possible tools for identification of proteins on one- or two-dimensional electrophoresis. Electrophoresis 1998; 19:928-38. [PMID: 9638939 DOI: 10.1002/elps.1150190608] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Additional, essentially chemical, identification methods of proteins in polyacrylamide gel electrophoresis are described. Two cleavages of peptide bonds were used at the C-side of aspartic acid with a 0.2% pentafluoropropionic acid (PFPA) aqueous vapor at 90 degrees C for 4-16 h, and the N-side of serine/threonine with an S-ethyl trifluorothioacetate vapor at 50 degrees C for 6-24 h. The products were analyzed by mass spectrometry-peptide mass fingerprinting. A new type of C-terminal sequencing at multisites of protein was introduced. An aqueous vapor of 90% PFPA at 90 degrees C for 2-16 h provided cleavages at the C-side of aspartic acid and the N-side of serine/threonine and simultaneous successive truncation at the C-termini of the cleaved fragments. The product resulted in C-terminal sequences at multisites in proteins by mass spectrometric analysis. The following chemical deblocking methods were used. Anhydrous hydrazine vapor at -5 degrees C for 8 h deblocked the N-formyl group, and the vapor at 20 degrees C for 4 h deblocked pyrrolidone carboxylate. N-acetylserine/threonine was deblocked by aqueous vapor of 75% PFPA at 50 degrees C for 1 h, followed by reaction with p-sulfophenylisothiocyanate at pH 6.0. These methods were applied to a variety of protein spots on polyacrylamide gels. A new stepwise C-terminal sequencing of protein from polyacrylamide gels is also described.
Collapse
Affiliation(s)
- A Tsugita
- Research Institute for Biosciences, Science University of Tokyo, Yamazaki, Noda, Japan.
| | | | | | | | | | | | | |
Collapse
|
33
|
Pazos F, Helmer-Citterich M, Ausiello G, Valencia A. Correlated mutations contain information about protein-protein interaction. J Mol Biol 1997; 271:511-23. [PMID: 9281423 DOI: 10.1006/jmbi.1997.1198] [Citation(s) in RCA: 345] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Many proteins have evolved to form specific molecular complexes and the specificity of this interaction is essential for their function. The network of the necessary inter-residue contacts must consequently constrain the protein sequences to some extent. In other words, the sequence of an interacting protein must reflect the consequence of this process of adaptation. It is reasonable to assume that the sequence changes accumulated during the evolution of one of the interacting proteins must be compensated by changes in the other. Here we apply a method for detecting correlated changes in multiple sequence alignments to a set of interacting protein domains and show that positions where changes occur in a correlated fashion in the two interacting molecules tend to be close to the protein-protein interfaces. This leads to the possibility of developing a method for predicting contacting pairs of residues from the sequence alone. Such a method would not need the knowledge of the structure of the interacting proteins, and hence would be both radically different and more widely applicable than traditional docking methods. We indeed demonstrate here that the information about correlated sequence changes is sufficient to single out the right inter-domain docking solution amongst many wrong alternatives of two-domain proteins. The same approach is also used here in one case (haemoglobin) where we attempt to predict the interface of two different proteins rather than two protein domains. Finally, we report here a prediction about the inter-domain contact regions of the heat- shock protein Hsc70 based only on sequence information.
Collapse
Affiliation(s)
- F Pazos
- Protein Design Group CNB-CSIC, Campus U. Autónoma, Madrid, Cantoblanco, 28049, Spain
| | | | | | | |
Collapse
|
34
|
Olmea O, Valencia A. Improving contact predictions by the combination of correlated mutations and other sources of sequence information. FOLDING & DESIGN 1997; 2:S25-32. [PMID: 9218963 DOI: 10.1016/s1359-0278(97)00060-6] [Citation(s) in RCA: 157] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
We have previously developed a method for predicting interresidue contacts using information about correlated mutations in multiple sequence alignments. The predictions generated with this method were clearly better than random but not enough for their use in de novo protein folding experiments. We assess the possibility of improving contact predictions combining information from the following variables: correlated mutations, sequence conservation, sequence separation along the chain, alignment stability, family size, residue-specific contact occupancy and formation of contact networks. The application of a protocol for combining these independent variables leads to contact predictions that are on average two times better than those obtained initially with correlated mutations. Correlated mutations can be effectively combined with other types of information derived from multiple sequence alignments. Among the different variables tried, sequence conservation and contact density are particularly relevant for the combination with correlated mutations.
Collapse
Affiliation(s)
- O Olmea
- Protein Design Group, CNB-CSIC, Campus U Autonoma, Cantoblanco, Madrid, Spain
| | | |
Collapse
|
35
|
Pearson BM, Hernando Y, Payne J, Wolf SS, Kalogeropoulos A, Schweizer M. Sequencing of a 35·71 kb DNA segment on the right arm of yeast chromosome XV reveals regions of similarity to chromosomes I and XIII. Yeast 1996. [DOI: 10.1002/(sici)1097-0061(199609)12:10b<1021::aid-yea981>3.0.co;2-7] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
|
36
|
Pearson BM, Hernando Y, Payne J, Wolf SS, Kalogeropoulos A, Schweizer M. Sequencing of a 35.71 kb DNA segment on the right arm of yeast chromosome XV reveals regions of similarity to chromosomes I and XIII. Yeast 1996; 12:1021-31. [PMID: 8896266 DOI: 10.1002/(sici)1097-0061(199609)12:10b%3c1021::aid-yea981%3e3.0.co;2-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open
Abstract
In a shotgun approach we sequenced the cosmid pEOA284 containing a fragment derived from the right arm of chromosome XV of Saccharomyces cerevisiae. An analysis of the sequence revealed that it contained open reading frames (ORFs) corresponding to the known genes SLY41, SPS4, COT1, FAA1, PMT3, PRO2 and MYO2. Of the 18 unknown ORFs, five are contained totally within, and two, O6105 and O6163, partially overlap other ORFs. ORF O6116 and O6139 have putative introns. Regions of similarity with chromosomes I and XIII have been uncovered. Interestingly, most of the paired ORFs encode proteins of the same gene family. The relatedness of these ORFs suggests gene duplication.
Collapse
Affiliation(s)
- B M Pearson
- Genetics & Microbiology Department, Norwich Research Park, Colney, U.K
| | | | | | | | | | | |
Collapse
|
37
|
Casari G, Sander C, Valencia A. A method to predict functional residues in proteins. NATURE STRUCTURAL BIOLOGY 1995; 2:171-8. [PMID: 7749921 DOI: 10.1038/nsb0295-171] [Citation(s) in RCA: 294] [Impact Index Per Article: 10.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Abstract
The biological activity of a protein typically depends on the presence of a small number of functional residues. Identifying these residues from the amino acid sequences alone would be useful. Classically, strictly conserved residues are predicted to be functional but often conservation patterns are more complicated. Here, we present a novel method that exploits such patterns for the prediction of functional residues. The method uses a simple but powerful representation of entire proteins, as well as sequence residues as vectors in a generalised 'sequence space'. Projection of these vectors onto a lower-dimensional space reveals groups of residues specific for particular subfamilies that are predicted to be directly involved in protein function. Based on the method we present testable predictions for sets of functional residues in SH2 domains and in the conserved box of cyclins.
Collapse
|
38
|
Voss H, Tamames J, Teodoru C, Valencia A, Sensen C, Wiemann S, Schwager C, Zimmermann J, Sander C, Ansorge W. Nucleotide sequence and analysis of the centromeric region of yeast chromosome IX. Yeast 1995; 11:61-78. [PMID: 7762303 DOI: 10.1002/yea.320110109] [Citation(s) in RCA: 23] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023] Open
Abstract
We have determined the nucleotide sequence of a cosmid (pIX338) containing the centromere region of yeast (Saccharomyces cerevisiae) chromosome IX. The complete nucleotide sequence of 33.8 kb was obtained by using an efficient directed sequencing strategy in combination with automated DNA sequencing on the A.L.F. DNA sequencer. Sequence analysis revealed the presence of 17 open reading frames (ORFs), four of them previously known yeast genes (sly12, pan1, sts1 and prl1), a tRNA gene and the centromere motif. Exhaustive database searches detected sequence homologues of known function for as many as 14 of the 17 ORFs. These include a mammalian tyrosine kinase substrate; the Escherichia coli cell cycle protein MinD; the human inositol polyphosphate-5-phosphatase (gene OCRL) involved in Lowe's syndrome, a developmental disorder; and helicases, for which the new yeast member defines a distinct DEAD/H-box subfamily. A surprisingly large fraction of the ORFs (at least six out of 17) in the centromeric region are apparently involved in RNA or DNA binding.
Collapse
Affiliation(s)
- H Voss
- Biological Structures and Biocomputing Programmes, European Molecular Biology Laboratory, Heidelberg, Germany
| | | | | | | | | | | | | | | | | | | |
Collapse
|
39
|
Abstract
Currently, the prediction of three-dimensional (3D) protein structure from sequence alone is an exceedingly difficult task. As an intermediate step, a much simpler task has been pursued extensively: predicting 1D strings of secondary structure. Here, we present an analysis of another 1D projection from 3D structure: the relative solvent accessibility of each residue. We show that solvent accessibility is less conserved in 3D homologues than is secondary structure, and hence is predicted less accurately from automatic homology modeling; the correlation coefficient of relative solvent accessibility between 3D homologues is only 0.77, and the average accuracy of predictions based on sequence alignments is only 0.68. The latter number provides an effective upper limit on the accuracy of predicting accessibility from sequence when homology modeling is not possible. We introduce a neural network system that predicts relative solvent accessibility (projected onto ten discrete states) using evolutionary profiles of amino acid substitutions derived from multiple sequence alignments. Evaluated in a cross-validation test on 238 unique proteins, the correlation between predicted and observed relative accessibility is 0.54. Interpreted in terms of a three-state (buried, intermediate, exposed) description of relative accessibility, the fraction of correctly predicted residue states is about 58%. In absolute terms this accuracy appears poor, but given the relatively low conservation of accessibility in 3D families, the network system is not far from its likely optimal performance. The most reliably predicted fraction of the residues (50%) is predicted as accurately as by automatic homology modeling. Prediction is best for buried residues, e.g., 86% of the completely buried sites are correctly predicted as having 0% relative accessibility.
Collapse
Affiliation(s)
- B Rost
- Protein Design Group, European Molecular Biology Laboratory, Heidelberg, Germany
| | | |
Collapse
|
40
|
Emmert DB, Stoehr PJ, Stoesser G, Cameron GN. The European Bioinformatics Institute (EBI) databases. Nucleic Acids Res 1994; 22:3445-9. [PMID: 7937043 PMCID: PMC308299 DOI: 10.1093/nar/22.17.3445] [Citation(s) in RCA: 78] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023] Open
Abstract
This paper describes the databases and services of the European Bioinformatics Institute (EBI). In collaboration with DDBJ and GenBank/NCBI, the EBI maintains and distributes the EMBL Nucleotide Sequence Database, Europe's primary nucleotide sequence data resource. The EBI also maintains and distributes the SWISS-PROT Protein Sequence Database, in collaboration with Amos Bairoch of the University of Geneva. Over thirty additional specialist molecular biology databases, as well as software and documentation of interest to molecular biologists, are also available. The EBI network services include database searching, entry retrieval, and sequence similarity searching facilities.
Collapse
Affiliation(s)
- D B Emmert
- European Bioinformatics Institute, Cambridge, UK
| | | | | | | |
Collapse
|
41
|
Abstract
Although the 'structure from sequence' prediction problem remains fundamentally unsolved, new and promising methods in one, two and three dimensions have reopened the field. Significantly improved one-dimensional prediction of secondary structure from multiple sequence alignments is now in routine use. In the two-dimensional approach, inter-residue contacts can be detected by analysis of correlated mutations, albeit with low accuracy. Finally, three-dimensional methods, in which pseudopotentials or information values are derived from the databases, are proving their value for distinguishing between correct and incorrect models.
Collapse
Affiliation(s)
- B Rost
- European Molecular Biology Laboratory, Heidelberg, Germany
| | | |
Collapse
|
42
|
|
43
|
Rost B, Sander C. Combining evolutionary information and neural networks to predict protein secondary structure. Proteins 1994; 19:55-72. [PMID: 8066087 DOI: 10.1002/prot.340190108] [Citation(s) in RCA: 1157] [Impact Index Per Article: 38.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Using evolutionary information contained in multiple sequence alignments as input to neural networks, secondary structure can be predicted at significantly increased accuracy. Here, we extend our previous three-level system of neural networks by using additional input information derived from multiple alignments. Using a position-specific conservation weight as part of the input increases performance. Using the number of insertions and deletions reduces the tendency for overprediction and increases overall accuracy. Addition of the global amino acid content yields a further improvement, mainly in predicting structural class. The final network system has sustained overall accuracy of 71.6% in a multiple cross-validation test on 126 unique protein chains. A test on a new set of 124 recently solved protein structures that have no significant sequence similarity to the learning set confirms the high level of accuracy. The average cross-validated accuracy for all 250 sequence-unique chains is above 72%. Using various data sets, the method is compared to alternative prediction methods, some of which also use multiple alignments: the performance advantage of the network system is at least 6 percentage points in three-state accuracy. In addition, the network estimates secondary structure content from multiple sequence alignments about as well as circular dichroism spectroscopy on a single protein and classifies 75% of the 250 proteins correctly into one of four protein structural classes. Of particular practical importance is the definition of a position-specific reliability index. For 40% of all residues the method has a sustained three-state accuracy of 88%, as high as the overall average for homology modelling. A further strength of the method is greatly increased accuracy in predicting the placement of secondary structure segments.
Collapse
Affiliation(s)
- B Rost
- European Molecular Biology Laboratory, Heidelberg, Germany
| | | |
Collapse
|
44
|
Abstract
Through the comprehensive analysis of protein sequence and structural data, relationships can be established that suggest, with varying degrees of success, structural models for a protein for which only the sequence is known. The certainty with which a model can be proposed depends on the degree of similarity between the sequence of unknown structure and the sequence of a protein of known structure. Methods are being developed to detect remote similarities between sequences or structures, and to predict protein structure based on such small levels of similarity.
Collapse
Affiliation(s)
- W R Taylor
- Laboratory of Mathematical Biology, National Institute for Medical Research, London, UK
| |
Collapse
|