1
|
Dokholyan NV. Experimentally-driven protein structure modeling. J Proteomics 2020; 220:103777. [PMID: 32268219 PMCID: PMC7214187 DOI: 10.1016/j.jprot.2020.103777] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2019] [Revised: 03/17/2020] [Accepted: 04/02/2020] [Indexed: 11/25/2022]
Abstract
Revolutions in natural and exact sciences started at the dawn of last century have led to the explosion of theoretical, experimental, and computational approaches to determine structures of molecules, complexes, as well as their rich conformational dynamics. Since different experimental methods produce information that is attributed to specific time and length scales, corresponding computational methods have to be tailored to these scales and experiments. These methods can be then combined and integrated in scales, hence producing a fuller picture of molecular structure and motion from the "puzzle pieces" offered by various experiments. Here, we describe a number of computational approaches to utilize experimental data to glance into structure of proteins and understand their dynamics. We will also discuss the limitations and the resolution of the constraints-based modeling approaches. SIGNIFICANCE: Experimentally-driven computational structure modeling and determination is a rapidly evolving alternative to traditional approaches for molecular structure determination. These new hybrid experimental-computational approaches are proving to be a powerful microscope to glance into the structural features of intrinsically or partially disordered proteins, dynamics of molecules and complexes. In this review, we describe various approaches in the field of experimentally-driven computational structure modeling.
Collapse
Affiliation(s)
- Nikolay V Dokholyan
- Department of Pharmacology, Penn State University College of Medicine, Hershey, PA 17033, USA; Department of Biochemistry & Molecular Biology, Penn State College of Medicine, Hershey, PA 17033, USA.; Department of Chemistry, Pennsylvania State University, University Park, PA 16802, USA.; Department of Biomedical Engineering, Pennsylvania State University, University Park, PA 16802, USA.
| |
Collapse
|
2
|
Nikolaev D, Shtyrov AA, Panov MS, Jamal A, Chakchir OB, Kochemirovsky VA, Olivucci M, Ryazantsev MN. A Comparative Study of Modern Homology Modeling Algorithms for Rhodopsin Structure Prediction. ACS OMEGA 2018; 3:7555-7566. [PMID: 30087916 PMCID: PMC6068592 DOI: 10.1021/acsomega.8b00721] [Citation(s) in RCA: 32] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/13/2018] [Accepted: 06/21/2018] [Indexed: 06/08/2023]
Abstract
Rhodopsins are seven α-helical membrane proteins that are of great importance in chemistry, biology, and modern biotechnology. Any in silico study on rhodopsin properties and functioning requires a high-quality three-dimensional structure. Due to particular difficulties with obtaining membrane protein structures from the experiment, in silico prediction of the three-dimensional rhodopsin structure based only on its primary sequence is an especially important task. For the last few years, significant progress was made in the field of protein structure prediction, especially for methods based on comparative modeling. However, the majority of this progress was made for soluble proteins and further investigations are needed to achieve similar progress for membrane proteins. In this paper, we evaluate the performance of modern protein structure prediction methodologies (implemented in the Medeller, I-TASSER, and Rosetta packages) for their ability to predict rhodopsin structures. Three widely used methodologies were considered: two general methodologies that are commonly applied to soluble proteins and a methodology that uses constraints that are specific for membrane proteins. The test pool consisted of 36 target-template pairs with different sequence similarities that was constructed on the basis of 24 experimental rhodopsin structures taken from the RCSB database. As a result, we showed that all three considered methodologies allow obtaining rhodopsin structures with the quality that is close to the crystallographic one (root mean square deviation (RMSD) of the predicted structure from the corresponding X-ray structure up to 1.5 Å) if the target-template sequence identity is higher than 40%. Moreover, all considered methodologies provided structures of average quality (RMSD < 4.0 Å) if the target-template sequence identity is higher than 20%. Such structures can be subsequently used for further investigation of molecular mechanisms of protein functioning and for the development of modern protein-based biotechnologies.
Collapse
Affiliation(s)
- Dmitrii
M. Nikolaev
- Nanotechnology
Research and Education Centre RAS, Saint-Petersburg
Academic University, 8/3 Khlopina Street, St. Petersburg 194021, Russia
| | - Andrey A. Shtyrov
- Nanotechnology
Research and Education Centre RAS, Saint-Petersburg
Academic University, 8/3 Khlopina Street, St. Petersburg 194021, Russia
| | - Maxim S. Panov
- Institute
of Chemistry, Saint Petersburg State University, 7/9 Universitetskaya emb., St. Petersburg 199034, Russia
| | - Adeel Jamal
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Oleg B. Chakchir
- Nanotechnology
Research and Education Centre RAS, Saint-Petersburg
Academic University, 8/3 Khlopina Street, St. Petersburg 194021, Russia
| | - Vladimir A. Kochemirovsky
- Institute
of Chemistry, Saint Petersburg State University, 7/9 Universitetskaya emb., St. Petersburg 199034, Russia
| | - Massimo Olivucci
- Department
of Biotechnology, Chemistry and Pharmacy, Università di Siena, via A. Moro 2, Siena I-53100, Italy
| | - Mikhail N. Ryazantsev
- Institute
of Chemistry, Saint Petersburg State University, 7/9 Universitetskaya emb., St. Petersburg 199034, Russia
- Institute
of Macromolecular Compounds of the Russian Academy of Sciences, 31 Bolshoy pr., St. Petersburg 199004, Russia
| |
Collapse
|
3
|
Gupta SK, Chaudhary KK, Mishra N. Bioinformatics and Its Therapeutic Applications. PHARMACEUTICAL SCIENCES 2017. [DOI: 10.4018/978-1-5225-1762-7.ch016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open
Abstract
Bioinformatics has emerged as a major element in contemporary biomedical and pharmaceutical region. Bioinformatics deals with growth in biological data and has led to development of many databases. Bioinformatics deals with collection of data that is relevant clinically and these days separate term clinical information has come up. Data mimics are another field which is gaining importance. This chapter shall deal with introduction of bioinformatics and its applications in medicine and health care.
Collapse
|
4
|
Three-dimensional protein structure prediction: Methods and computational strategies. Comput Biol Chem 2014; 53PB:251-276. [DOI: 10.1016/j.compbiolchem.2014.10.001] [Citation(s) in RCA: 121] [Impact Index Per Article: 12.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2014] [Revised: 10/03/2014] [Accepted: 10/07/2014] [Indexed: 01/01/2023]
|
5
|
Chen YC. The Molecular Dynamic Simulation of Zolpidem Interaction with Gamma Aminobutyric Acid Type A Receptor. J CHIN CHEM SOC-TAIP 2013. [DOI: 10.1002/jccs.200700093] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
|
6
|
Johansson MU, Zoete V, Michielin O, Guex N. Defining and searching for structural motifs using DeepView/Swiss-PdbViewer. BMC Bioinformatics 2012; 13:173. [PMID: 22823337 PMCID: PMC3436773 DOI: 10.1186/1471-2105-13-173] [Citation(s) in RCA: 204] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2011] [Accepted: 07/06/2012] [Indexed: 11/10/2022] Open
Abstract
Background Today, recognition and classification of sequence motifs and protein folds is a mature field, thanks to the availability of numerous comprehensive and easy to use software packages and web-based services. Recognition of structural motifs, by comparison, is less well developed and much less frequently used, possibly due to a lack of easily accessible and easy to use software. Results In this paper, we describe an extension of DeepView/Swiss-PdbViewer through which structural motifs may be defined and searched for in large protein structure databases, and we show that common structural motifs involved in stabilizing protein folds are present in evolutionarily and structurally unrelated proteins, also in deeply buried locations which are not obviously related to protein function. Conclusions The possibility to define custom motifs and search for their occurrence in other proteins permits the identification of recurrent arrangements of residues that could have structural implications. The possibility to do so without having to maintain a complex software/hardware installation on site brings this technology to experts and non-experts alike.
Collapse
Affiliation(s)
- Maria U Johansson
- Vital-IT Group, SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | | | | | | |
Collapse
|
7
|
Karakaş M, Woetzel N, Meiler J. BCL::contact-low confidence fold recognition hits boost protein contact prediction and de novo structure determination. J Comput Biol 2010; 17:153-68. [PMID: 19772383 DOI: 10.1089/cmb.2009.0030] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Knowledge of all residue-residue contacts within a protein allows determination of the protein fold. Accurate prediction of even a subset of long-range contacts (contacts between amino acids far apart in sequence) can be instrumental for determining tertiary structure. Here we present BCL::Contact, a novel contact prediction method that utilizes artificial neural networks (ANNs) and specializes in the prediction of medium to long-range contacts. BCL::Contact comes in two modes: sequence-based and structure-based. The sequence-based mode uses only sequence information and has individual ANNs specialized for helix-helix, helix-strand, strand-helix, strand-strand, and sheet-sheet contacts. The structure-based mode combines results from 32-fold recognition methods with sequence information to a consensus prediction. The two methods were presented in the 6(th) and 7(th) Critical Assessment of Techniques for Protein Structure Prediction (CASP) experiments. The present work focuses on elucidating the impact of fold recognition results onto contact prediction via a direct comparison of both methods on a joined benchmark set of proteins. The sequence-based mode predicted contacts with 42% accuracy (7% false positive rate), while the structure-based mode achieved 45% accuracy (2% false positive rate). Predictions by both modes of BCL::Contact were supplied as input to the protein tertiary structure prediction program Rosetta for a benchmark of 17 proteins with no close sequence homologs in the protein data bank (PDB). Rosetta created higher accuracy models, signified by an improvement of 1.3 A on average root mean square deviation (RMSD), when driven by the predicted contacts. Further, filtering Rosetta models by agreement with the predicted contacts enriches for native-like fold topologies.
Collapse
Affiliation(s)
- Mert Karakaş
- Department of Chemistry, Center for Structural Biology, Vanderbilt University, Nashville, Tennessee, USA
| | | | | |
Collapse
|
8
|
Identification of protein functions using a machine-learning approach based on sequence-derived properties. Proteome Sci 2009; 7:27. [PMID: 19664241 PMCID: PMC2731080 DOI: 10.1186/1477-5956-7-27] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2009] [Accepted: 08/09/2009] [Indexed: 02/07/2023] Open
Abstract
Background Predicting the function of an unknown protein is an essential goal in bioinformatics. Sequence similarity-based approaches are widely used for function prediction; however, they are often inadequate in the absence of similar sequences or when the sequence similarity among known protein sequences is statistically weak. This study aimed to develop an accurate prediction method for identifying protein function, irrespective of sequence and structural similarities. Results A highly accurate prediction method capable of identifying protein function, based solely on protein sequence properties, is described. This method analyses and identifies specific features of the protein sequence that are highly correlated with certain protein functions and determines the combination of protein sequence features that best characterises protein function. Thirty-three features that represent subtle differences in local regions and full regions of the protein sequences were introduced. On the basis of 484 features extracted solely from the protein sequence, models were built to predict the functions of 11 different proteins from a broad range of cellular components, molecular functions, and biological processes. The accuracy of protein function prediction using random forests with feature selection ranged from 94.23% to 100%. The local sequence information was found to have a broad range of applicability in predicting protein function. Conclusion We present an accurate prediction method using a machine-learning approach based solely on protein sequence properties. The primary contribution of this paper is to propose new PNPRD features representing global and/or local differences in sequences, based on positively and/or negatively charged residues, to assist in predicting protein function. In addition, we identified a compact and useful feature subset for predicting the function of various proteins. Our results indicate that sequence-based classifiers can provide good results among a broad range of proteins, that the proposed features are useful in predicting several functions, and that the combination of our and traditional features may support the creation of a discriminative feature set for specific protein functions.
Collapse
|
9
|
Faure G, Bornot A, de Brevern AG. Analysis of protein contacts into Protein Units. Biochimie 2009; 91:876-87. [PMID: 19383526 DOI: 10.1016/j.biochi.2009.04.008] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2008] [Accepted: 04/13/2009] [Indexed: 11/18/2022]
Abstract
Three-dimensional structures of proteins are the support of their biological functions. Their folds are maintained by inter-residue interactions which are one of the main focuses to understand the mechanisms of protein folding and stability. Furthermore, protein structures can be composed of single or multiple functional domains that can fold and function independently. Hence, dividing a protein into domains is useful for obtaining an accurate structure and function determination. In previous studies, we enlightened protein contact properties according to different definitions and developed a novel methodology named Protein Peeling. Within protein structures, Protein Peeling characterizes small successive compact units along the sequence called protein units (PUs). The cutting done by Protein Peeling maximizes the number of contacts within the PUs and minimizes the number of contacts between them. This method is so a relevant tool in the context of the protein folding research and particularly regarding the hierarchical model proposed by George Rose. Here, we accurately analyze the PUs at different levels of cutting, using a non-redundant protein databank. Distribution of PU sizes, number of PUs or their accessibility are screened to determine their common and different features. Moreover, we highlight the preferential amino acid interactions inside and between PUs. Our results show that PUs are clearly an intermediate level between secondary structures and protein structural domains.
Collapse
Affiliation(s)
- Guilhem Faure
- INSERM UMR-S 726, Equipe de Bioinformatique Génomique et Moléculaire (EBGM), DSIMB, Université Paris Diderot - Paris 7, case 7113, 2 place Jussieu, 75251 Paris, France
| | | | | |
Collapse
|
10
|
Faure G, Bornot A, de Brevern AG. Protein contacts, inter-residue interactions and side-chain modelling. Biochimie 2008; 90:626-39. [DOI: 10.1016/j.biochi.2007.11.007] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2007] [Accepted: 11/22/2007] [Indexed: 10/22/2022]
|
11
|
Ye K, Kosters WA, Ijzerman AP. An efficient, versatile and scalable pattern growth approach to mine frequent patterns in unaligned protein sequences. Bioinformatics 2007; 23:687-93. [PMID: 17237070 DOI: 10.1093/bioinformatics/btl665] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Pattern discovery in protein sequences is often based on multiple sequence alignments (MSA). The procedure can be computationally intensive and often requires manual adjustment, which may be particularly difficult for a set of deviating sequences. In contrast, two algorithms, PRATT2 (http//www.ebi.ac.uk/pratt/) and TEIRESIAS (http://cbcsrv.watson.ibm.com/) are used to directly identify frequent patterns from unaligned biological sequences without an attempt to align them. Here we propose a new algorithm with more efficiency and more functionality than both PRATT2 and TEIRESIAS, and discuss some of its applications to G protein-coupled receptors, a protein family of important drug targets. RESULTS In this study, we designed and implemented six algorithms to mine three different pattern types from either one or two datasets using a pattern growth approach. We compared our approach to PRATT2 and TEIRESIAS in efficiency, completeness and the diversity of pattern types. Compared to PRATT2, our approach is faster, capable of processing large datasets and able to identify the so-called type III patterns. Our approach is comparable to TEIRESIAS in the discovery of the so-called type I patterns but has additional functionality such as mining the so-called type II and type III patterns and finding discriminating patterns between two datasets. AVAILABILITY The source code for pattern growth algorithms and their pseudo-code are available at http://www.liacs.nl/home/kosters/pg/.
Collapse
Affiliation(s)
- Kai Ye
- Division of Medicinal Chemistry, Leiden/Amsterdam Center for Drug Research and Leiden Institute of Advanced Computer Science, Leiden University, Leiden, The Netherlands.
| | | | | |
Collapse
|
12
|
Qiu J, Elber R. SSALN: an alignment algorithm using structure-dependent substitution matrices and gap penalties learned from structurally aligned protein pairs. Proteins 2006; 62:881-91. [PMID: 16385554 DOI: 10.1002/prot.20854] [Citation(s) in RCA: 68] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
In template-based modeling of protein structures, the generation of the alignment between the target and the template is a critical step that significantly affects the accuracy of the final model. This paper proposes an alignment algorithm SSALN that learns substitution matrices and position-specific gap penalties from a database of structurally aligned protein pairs. In addition to the amino acid sequence information, secondary structure and solvent accessibility information of a position are used to derive substitution scores and position-specific gap penalties. In a test set of CASP5 targets, SSALN outperforms sequence alignment methods such as a Smith-Waterman algorithm with BLOSUM50 and PSI_BLAST. SSALN also generates better alignments than PSI_BLAST in the CASP6 test set. LOOPP server prediction based on an SSALN alignment is ranked the best for target T0280_1 in CASP6. SSALN is also compared with several threading methods and sequence alignment methods on the ProSup benchmark. SSALN has the highest alignment accuracy among the methods compared. On the Fischer's benchmark, SSALN performs better than CLUSTALW and GenTHREADER, and generates more alignments with accuracy >50%, >60% or >70% than FUGUE, but fewer alignments with accuracy >80% than FUGUE. All the supplemental materials can be found at http://www.cs.cornell.edu/ approximately jianq/research.htm.
Collapse
Affiliation(s)
- Jian Qiu
- Department of Computer Science, Cornell University, Ithaca, New York 14853, USA
| | | |
Collapse
|
13
|
Analysis of protein homology by assessing the (dis)similarity in protein loop regions. Proteins 2005; 57:539-47. [PMID: 15382231 DOI: 10.1002/prot.20237] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Two proteins are considered to have a similar fold if sufficiently many of their secondary structure elements are positioned similarly in space and are connected in the same order. Such a common structural scaffold may arise due to either divergent or convergent evolution. The intervening unaligned regions ("loops") between the superimposable helices and strands can exhibit a wide range of similarity and may offer clues to the structural evolution of folds. One might argue that more closely related proteins differ less in their nonconserved loop regions than distantly related proteins and, at the same time, the degree of variability in the loop regions in structurally similar but unrelated proteins is higher than in homologs. Here we introduce a new measure for structural (dis)similarity in loop regions that is based on the concept of the Hausdorff metric. This measure is used to gauge protein relatedness and is tested on a benchmark of homologous and analogous protein structures. It has been shown that the new measure can distinguish homologous from analogous proteins with the same or higher accuracy than the conventional measures that are based on comparing proteins in structurally aligned regions. We argue that this result can be attributed to the higher sensitivity of the Hausdorff (dis)similarity measure in detecting particularly evident dissimilarities in structures and draw some conclusions about evolutionary relatedness of proteins in the most populated protein folds.
Collapse
|
14
|
Gromiha MM, Selvaraj S. Inter-residue interactions in protein folding and stability. PROGRESS IN BIOPHYSICS AND MOLECULAR BIOLOGY 2004; 86:235-77. [PMID: 15288760 DOI: 10.1016/j.pbiomolbio.2003.09.003] [Citation(s) in RCA: 225] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
Abstract
During the process of protein folding, the amino acid residues along the polypeptide chain interact with each other in a cooperative manner to form the stable native structure. The knowledge about inter-residue interactions in protein structures is very helpful to understand the mechanism of protein folding and stability. In this review, we introduce the classification of inter-residue interactions into short, medium and long range based on a simple geometric approach. The features of these interactions in different structural classes of globular and membrane proteins, and in various folds have been delineated. The development of contact potentials and the application of inter-residue contacts for predicting the structural class and secondary structures of globular proteins, solvent accessibility, fold recognition and ab initio tertiary structure prediction have been evaluated. Further, the relationship between inter-residue contacts and protein-folding rates has been highlighted. Moreover, the importance of inter-residue interactions in protein-folding kinetics and for understanding the stability of proteins has been discussed. In essence, the information gained from the studies on inter-residue interactions provides valuable insights for understanding protein folding and de novo protein design.
Collapse
Affiliation(s)
- M Michael Gromiha
- Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology, Aomi Frontier Building 17F, 2-43 Aomi, Koto-ku, Tokyo 135-0064, Japan.
| | | |
Collapse
|
15
|
Przybylski D, Rost B. Improving Fold Recognition Without Folds. J Mol Biol 2004; 341:255-69. [PMID: 15312777 DOI: 10.1016/j.jmb.2004.05.041] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2004] [Revised: 05/18/2004] [Accepted: 05/18/2004] [Indexed: 11/21/2022]
Abstract
The most reliable way to align two proteins of unknown structure is through sequence-profile and profile-profile alignment methods. If the structure for one of the two is known, fold recognition methods outperform purely sequence-based alignments. Here, we introduced a novel method that aligns generalised sequence and predicted structure profiles. Using predicted 1D structure (secondary structure and solvent accessibility) significantly improved over sequence-only methods, both in terms of correctly recognising pairs of proteins with different sequences and similar structures and in terms of correctly aligning the pairs. The scores obtained by our generalised scoring matrix followed an extreme value distribution; this yielded accurate estimates of the statistical significance of our alignments. We found that mistakes in 1D structure predictions correlated between proteins from different sequence-structure families. The impact of this surprising result was that our method succeeded in significantly out-performing sequence-only methods even without explicitly using structural information from any of the two. Since AGAPE also outperformed established methods that rely on 3D information, we made it available through. If we solved the problem of CPU-time required to apply AGAPE on millions of proteins, our results could also impact everyday database searches.
Collapse
Affiliation(s)
- Dariusz Przybylski
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA.
| | | |
Collapse
|
16
|
Cherkasov A, Jones SJM. An approach to large scale identification of non-obvious structural similarities between proteins. BMC Bioinformatics 2004; 5:61. [PMID: 15147578 PMCID: PMC434491 DOI: 10.1186/1471-2105-5-61] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2004] [Accepted: 05/17/2004] [Indexed: 11/13/2022] Open
Abstract
Background A new sequence independent bioinformatics approach allowing genome-wide search for proteins with similar three dimensional structures has been developed. By utilizing the numerical output of the sequence threading it establishes putative non-obvious structural similarities between proteins. When applied to the testing set of proteins with known three dimensional structures the developed approach was able to recognize structurally similar proteins with high accuracy. Results The method has been developed to identify pathogenic proteins with low sequence identity and high structural similarity to host analogues. Such protein structure relationships would be hypothesized to arise through convergent evolution or through ancient horizontal gene transfer events, now undetectable using current sequence alignment techniques. The pathogen proteins, which could mimic or interfere with host activities, would represent candidate virulence factors. The developed approach utilizes the numerical outputs from the sequence-structure threading. It identifies the potential structural similarity between a pair of proteins by correlating the threading scores of the corresponding two primary sequences against the library of the standard folds. This approach allowed up to 64% sensitivity and 99.9% specificity in distinguishing protein pairs with high structural similarity. Conclusion Preliminary results obtained by comparison of the genomes of Homo sapiens and several strains of Chlamydia trachomatis have demonstrated the potential usefulness of the method in the identification of bacterial proteins with known or potential roles in virulence.
Collapse
Affiliation(s)
- Artem Cherkasov
- Division of Infectious Diseases, Department of Medicine, Faculty of Medicine, University of British Columbia, Vancouver, British Columbia, Canada
| | - Steven JM Jones
- Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, Canada
| |
Collapse
|
17
|
Cherkasov A, Jones SJM. Structural characterization of genomes by large scale sequence-structure threading. BMC Bioinformatics 2004; 5:37. [PMID: 15061866 PMCID: PMC419331 DOI: 10.1186/1471-2105-5-37] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2003] [Accepted: 04/03/2004] [Indexed: 12/02/2022] Open
Abstract
Background Using sequence-structure threading we have conducted structural characterization of complete proteomes of 37 archaeal, bacterial and eukaryotic organisms (including worm, fly, mouse and human) totaling 167,888 genes. Results The reported data represent first rather general evaluation of performance of full sequence-structure threading on multiple genomes providing opportunity to evaluate its general applicability for large scale studies. According to the estimated results the sequence-structure threading has assigned protein folds to more then 60% of eukaryotic, 68% of archaeal and 70% of bacterial proteomes. The repertoires of protein classes, architectures, topologies and homologous superfamilies (according to the CATH 2.4 classification) have been established for distant organisms and superkingdoms. It has been found that the average abundance of CATH classes decreases from "alpha and beta" to "mainly beta", followed by "mainly alpha" and "few secondary structures". 3-Layer (aba) Sandwich has been characterized as the most abundant protein architecture and Rossman fold as the most common topology. Conclusion The analysis of genomic occurrences of CATH 2.4 protein homologous superfamilies and topologies has revealed the power-law character of their distributions. The corresponding double logarithmic "frequency – genomic occurrence" dependences characteristic of scale-free systems have been established for individual organisms and for three superkingdoms. Supplementary materials to this works are available at [1].
Collapse
Affiliation(s)
- Artem Cherkasov
- Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, Canada
- Faculty of Medicine, University of British Columbia, Vancouver, British Columbia, Canada
| | - Steven JM Jones
- Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, Canada
| |
Collapse
|
18
|
Grigoriev IV, Choi IG. Target selection for structural genomics: a single genome approach. OMICS : A JOURNAL OF INTEGRATIVE BIOLOGY 2003; 6:349-62. [PMID: 12626094 DOI: 10.1089/153623102321112773] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
We describe our strategy for selecting targets for protein structure determination in context of structural genomics of a single genome. In the course of target selection, we have studied two of the smallest microbial genomes, Mycoplasma genitalium and Mycoplasma pneumoniae. To our surprise, we found that only 71 Mycoplasma genes or their orthologues can be considered as easy targets for high-throughput structural studies--far fewer than expected. We discuss the methods and criteria used for target selection and the reasons explaining rarity of easy targets. First, despite the common opinion that protein folds can be predicted for only 30-50% of genes, the number of "truly unknown" structures is less than one-third. Second, due to the different codon usage, two thirds of Mycoplasma proteins cannot be directly expressed in E. coli in high-throughput manner and require substitution by their homologues from other organisms. Third, membrane or large multi-domain proteins are difficult targets because of solubility and size issues and often require identification and structure determination of protein domains. Finally, we propose different approaches to address the difficult targets.
Collapse
Affiliation(s)
- Igor V Grigoriev
- Department of Chemistry and E.O. Lawrence Berkeley National Laboratory, University of California, Berkeley, CA, USA.
| | | |
Collapse
|
19
|
Bird S, Zou J, Wang T, Munday B, Cunningham C, Secombes CJ. Evolution of interleukin-1beta. Cytokine Growth Factor Rev 2002; 13:483-502. [PMID: 12401481 DOI: 10.1016/s1359-6101(02)00028-x] [Citation(s) in RCA: 203] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
All jawed vertebrates possess a complex immune system, which is capable of anticipatory and innate immune responses. Jawless vertebrates possess an equally complex immune system but with no evidence of an anticipatory immune response. From these findings it has been speculated that the initiation and regulation of the immune system within vertebrates will be equally complex, although very little has been done to look at the evolution of cytokine genes, despite well-known biological activities within vertebrates. In recent years, cytokines, which have been well characterised within mammals, have begun to be cloned and sequenced within non-mammalian vertebrates, with the number of cytokine sequences available from primitive vertebrates growing rapidly. The identification of cytokines, which are mammalian homologues, will give a better insight into where immune system communicators arose and may also reveal molecules, which are unique to certain organisms. Work has focussed on interleukin-1 (IL-1), a major mediator of inflammation which initiates and/or increases a wide variety of non-structural, function associated genes that are characteristically expressed during inflammation. Other than mammalian IL-1beta sequences there are now full cDNA sequences and genomic organisations available from bird, amphibian, bony fish and cartilaginous fish, with many of these genes having been obtained using an homology cloning approach. This review considers how the IL-1beta gene has changed through vertebrate evolution and whether its role and regulation are conserved within selected non-mammalian vertebrates.
Collapse
Affiliation(s)
- Steve Bird
- Department of Zoology, University of Aberdeen, Tillydrone Avenue, Aberdeen AB24 2TZ, UK
| | | | | | | | | | | |
Collapse
|
20
|
Abstract
The protein databank contains a vast wealth of structural and functional information. The analysis of this macromolecular information has been the subject of considerable work in order to advance knowledge beyond the collection of molecular coordinates. This article presents a method that determines local structural information within proteins using mathematical data mining techniques. The mine program described returns many known configurations of residues such as the catalytic triad, metal binding sites and the N-linked glycosylation site; as well as many other multiple residue interactions not previously categorized. Because mathematical constructs are used as targets, this method can identify new information not previously known, and also provide unbiased results of typical structure and their expected deviations. Because the results are defined mathematically, they cannot indicate the biological implications of the results. Therefore two support programs are described that provide insight into the biological context for the mine results. The first allows a weighted RMSD search between a template set of coordinates and a list of PDB files, and the second allows the labeling of a protein with the template results from mining to aid in the classification of this protein.
Collapse
Affiliation(s)
- T J Oldfield
- Accelrys Inc., Department of Chemistry, University of York, Heslington, York, Yorkshire, United Kingdom.
| |
Collapse
|
21
|
Kumarevel TS, Gromiha MM, Selvaraj S, Gayatri K, Kumar PKR. Influence of medium- and long-range interactions in different folding types of globular proteins. Biophys Chem 2002; 99:189-98. [PMID: 12377369 DOI: 10.1016/s0301-4622(02)00183-7] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
Recognition of protein fold from amino acid sequence is a challenging task. The structure and stability of proteins from different fold are mainly dictated by inter-residue interactions. In our earlier work, we have successfully used the medium- and long-range contacts for predicting the protein folding rates, discriminating globular and membrane proteins and for distinguishing protein structural classes. In this work, we analyze the role of inter-residue interactions in commonly occurring folds of globular proteins in order to understand their folding mechanisms. In the medium-range contacts, the globin fold and four-helical bundle proteins have more contacts than that of DNA-RNA fold although they all belong to all-alpha class. In long-range contacts, only the ribonuclease fold prefers 4-10 range and the other folding types prefer the range 21-30 in alpha/beta class proteins. Further, the preferred residues and residue pairs influenced by these different folds are discussed. The information about the preference of medium- and long-range contacts exhibited by the 20 amino acid residues can be effectively used to predict the folding type of each protein.
Collapse
Affiliation(s)
- T S Kumarevel
- National Institute of Advanced Industrial Science and Technology (AIST), Institute of Molecular and Cell Biology, Functional Nucleic Acids Group, Tsukuba Central 6, 1-1 Higashi, Tsukuba Science City, Ibaraki, Japan.
| | | | | | | | | |
Collapse
|
22
|
Abstract
Typically, protein spatial structures are more conserved in evolution than amino acid sequences. However, the recent explosion of sequence and structure information accompanied by the development of powerful computational methods led to the accumulation of examples of homologous proteins with globally distinct structures. Significant sequence conservation, local structural resemblance, and functional similarity strongly indicate evolutionary relationships between these proteins despite pronounced structural differences at the fold level. Several mechanisms such as insertions/deletions/substitutions, circular permutations, and rearrangements in beta-sheet topologies account for the majority of detected structural irregularities. The existence of evolutionarily related proteins that possess different folds brings new challenges to the homology modeling techniques and the structure classification strategies and offers new opportunities for protein design in experimental studies.
Collapse
Affiliation(s)
- N V Grishin
- Howard Hughes Medical Institute, Department of Biochemistry, University of Texas Southwestern Medical Center, 5323 Harry Hines Boulevard, Dallas, Texas 75390-9050, USA
| |
Collapse
|
23
|
D'Alfonso G, Tramontano A, Lahm A. Structural conservation in single-domain proteins: implications for homology modeling. J Struct Biol 2001; 134:246-56. [PMID: 11551183 DOI: 10.1006/jsbi.2001.4351] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Large-scale sequencing projects are widening the gap between the known protein universe and the fraction for which structural information has been experimentally obtained. Through the application of homology (comparative) modeling and more general structure prediction techniques, this gap can, however, be narrowed, providing indirect structural information for a considerable number of proteins. Moreover, the estimated number of existing protein folds seems to be limited and many of these yet unknown folds should be discovered by dedicated large-scale structural genomics projects. Within this perspective, homology (comparative) modeling will gain in importance, as will the use of models derived by this technique. Here we discuss how well a sequence alignment, the most common starting point for generating a model, reflects the structural conservation between homologous proteins and we show that sequence information is able to direct construction of acceptable models as far as the structural core is concerned. We also show here that the regions surrounding insertions and deletions are much less conserved than the core and discuss the implications of this observation for loop modeling.
Collapse
|
24
|
Grishin NV. KH domain: one motif, two folds. Nucleic Acids Res 2001; 29:638-43. [PMID: 11160884 PMCID: PMC30387 DOI: 10.1093/nar/29.3.638] [Citation(s) in RCA: 225] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2000] [Revised: 12/01/2000] [Accepted: 12/01/2000] [Indexed: 11/14/2022] Open
Abstract
The K homology (KH) module is a widespread RNA-binding motif that has been detected by sequence similarity searches in such proteins as heterogeneous nuclear ribonucleoprotein K (hnRNP K) and ribosomal protein S3. Analysis of spatial structures of KH domains in hnRNP K and S3 reveals that they are topologically dissimilar and thus belong to different protein folds. Thus KH motif proteins provide a rare example of protein domains that share significant sequence similarity in the motif regions but possess globally distinct structures. The two distinct topologies might have arisen from an ancestral KH motif protein by N- and C-terminal extensions, or one of the existing topologies may have evolved from the other by extension, displacement and deletion. C-terminal extension (deletion) requires ss-sheet rearrangement through the insertion (removal) of a ss-strand in a manner similar to that observed in serine protease inhibitors serpins. Current analysis offers a new look on how proteins can change fold in the course of evolution.
Collapse
Affiliation(s)
- N V Grishin
- Howard Hughes Medical Institute and Department of Biochemistry, University of Texas Southwestern Medical Center, 5323 Harry Hines Boulevard, Dallas, TX 75390-9050, USA.
| |
Collapse
|
25
|
Dietmann S, Park J, Notredame C, Heger A, Lappe M, Holm L. A fully automatic evolutionary classification of protein folds: Dali Domain Dictionary version 3. Nucleic Acids Res 2001; 29:55-7. [PMID: 11125048 PMCID: PMC29815 DOI: 10.1093/nar/29.1.55] [Citation(s) in RCA: 148] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
The Dali Domain Dictionary (http://www.ebi.ac.uk/dali/domain) is a numerical taxonomy of all known structures in the Protein Data Bank (PDB). The taxonomy is derived fully automatically from measurements of structural, functional and sequence similarities. Here, we report the extension of the classification to match the traditional four hierarchical levels corresponding to: (i) supersecondary structural motifs (attractors in fold space), (ii) the topology of globular domains (fold types), (iii) remote homologues (functional families) and (iv) homologues with sequence identity above 25% (sequence families). The computational definitions of attractors and functional families are new. In September 2000, the Dali classification contained 10 531 PDB entries comprising 17 101 chains, which were partitioned into five attractor regions, 1375 fold types, 2582 functional families and 3724 domain sequence families. Sequence families were further associated with 99 582 unique homologous sequences in the HSSP database, which increases the number of effectively known structures several-fold. The resulting database contains the description of protein domain architecture, the definition of structural neighbours around each known structure, the definition of structurally conserved cores and a comprehensive library of explicit multiple alignments of distantly related protein families.
Collapse
Affiliation(s)
- S Dietmann
- Structural Genomics Group, EMBL-EBI, Cambridge CB10 1SD, UK
| | | | | | | | | | | |
Collapse
|
26
|
Abstract
The threading approach to protein fold recognition attempts to evaluate how well a query sequence fits into an already-solved fold. 3D-1D threaders rely on matching 1-dimensional strings of 3-dimensional information predicted from the query sequence with corresponding features of the target structure. In many cases this is combined with a sequence comparison. The combination of sequence and structure information has been shown to improve the accuracy of fold recognition, relative to the exclusive use of sequence or structure. In this paper, we review progress made since the introduction of threading methods a decade ago, highlighting recent advances. We focus on two emerging methods that are unconventional 3D-1D threaders: proximity correlation matrices and parallel cascade identification.
Collapse
Affiliation(s)
- R David
- Department of Mechanical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
| | | | | |
Collapse
|
27
|
Friedberg I, Kaplan T, Margalit H. Evaluation of PSI-BLAST alignment accuracy in comparison to structural alignments. Protein Sci 2000; 9:2278-84. [PMID: 11152139 PMCID: PMC2144484 DOI: 10.1110/ps.9.11.2278] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
The PSI-BLAST algorithm has been acknowledged as one of the most powerful tools for detecting remote evolutionary relationships by sequence considerations only. This has been demonstrated by its ability to recognize remote structural homologues and by the greatest coverage it enables in annotation of a complete genome. Although recognizing the correct fold of a sequence is of major importance, the accuracy of the alignment is crucial for the success of modeling one sequence by the structure of its remote homologue. Here we assess the accuracy of PSI-BLAST alignments on a stringent database of 123 structurally similar, sequence-dissimilar pairs of proteins, by comparing them to the alignments defined on a structural basis. Each protein sequence is compared to a nonredundant database of the protein sequences by PSI-BLAST. Whenever a pair member detects its pair-mate, the positions that are aligned both in the sequential and structural alignments are determined, and the alignment sensitivity is expressed as the percentage of these positions out of the structural alignment. Fifty-two sequences detected their pair-mates (for 16 pairs the success was bi-directional when either pair member was used as a query). The average percentage of correctly aligned residues per structural alignment was 43.5+/-2.2%. Other properties of the alignments were also examined, such as the sensitivity vs. specificity and the change in these parameters over consecutive iterations. Notably, there is an improvement in alignment sensitivity over consecutive iterations, reaching an average of 50.9+/-2.5% within the five iterations tested in the current study.
Collapse
Affiliation(s)
- I Friedberg
- Department of Molecular Genetics and Biotechnology, The Hebrew University, Hadassah Medical School, Jerusalem, Israel
| | | | | |
Collapse
|
28
|
Prlić A, Domingues FS, Sippl MJ. Structure-derived substitution matrices for alignment of distantly related sequences. PROTEIN ENGINEERING 2000; 13:545-50. [PMID: 10964983 DOI: 10.1093/protein/13.8.545] [Citation(s) in RCA: 83] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
Sequence alignment is a standard method to infer evolutionary, structural, and functional relationships among sequences. The quality of alignments depends on the substitution matrix used. Here we derive matrices based on superimpositions from protein pairs of similar structure, but of low or no sequence similarity. In a performance test the matrices are compared with 12 other previously published matrices. It is found that the structure-derived matrices are applicable for comparisons of distantly related sequences. We investigate the influence of evolutionary relationships of protein pairs on the alignment accuracy.
Collapse
Affiliation(s)
- A Prlić
- Center of Applied Molecular Engineering, Institute for Chemistry and Biochemistry, University of Salzburg, Jakob-Haringerstrasse 3, A-5020 Salzburg, Austria
| | | | | |
Collapse
|
29
|
Jung J, Lee B. Use of residue pairs in protein sequence-sequence and sequence-structure alignments. Protein Sci 2000; 9:1576-88. [PMID: 10975579 PMCID: PMC2144723 DOI: 10.1110/ps.9.8.1576] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
Two new sets of scoring matrices are introduced: H2 for the protein sequence comparison and T2 for the protein sequence-structure correlation. Each element of H2 or T2 measures the frequency with which a pair of amino acid types in one protein, k-residues apart in the sequence, is aligned with another pair of residues, of given amino acid types (for H2) or in given structural states (for T2), in other structurally homologous proteins. There are four types, corresponding to the k-values of 1 to 4, for both H2 and T2. These matrices were set up using a large number of structurally homologous protein pairs, with little sequence homology between the pair, that were recently generated using the structure comparison program SHEBA. The two scoring matrices were incorporated into the main body of the sequence alignment program SSEARCH in the FASTA package and tested in a fold recognition setting in which a set of 107 test sequences were aligned to each of a panel of 3,539 domains that represent all known protein structures. Six procedures were tested; the straight Smith-Waterman (SW) and FASTA procedures, which used the Blosum62 single residue type substitution matrix; BLAST and PSI-BLAST procedures, which also used the Blosum62 matrix; PASH, which used Blosum62 and H2 matrices; and PASSC, which used Blosum62, H2, and T2 matrices. All procedures gave similar results when the probe and target sequences had greater than 30% sequence identity. However, when the sequence identity was below 30%, a similar structure could be found for more sequences using PASSC than using any other procedure. PASH and PSI-BLAST gave the next best results.
Collapse
Affiliation(s)
- J Jung
- Laboratory of Molecular Biology, Division of Basic Sciences, National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | | |
Collapse
|
30
|
Grishin NV. C-terminal domains of Escherichia coli topoisomerase I belong to the zinc-ribbon superfamily. J Mol Biol 2000; 299:1165-77. [PMID: 10873443 DOI: 10.1006/jmbi.2000.3841] [Citation(s) in RCA: 45] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Detection of remote evolutionary connections is increasingly difficult with sequence and structural divergence. A combination of sequence and structural analysis, in which statistically supported sequence similarity had a crucial impact, revealed that Escherichia coli topoisomerase I C-terminal fragment is evolutionarily related to the three tetracysteine zinc-binding domains of the enzyme. Spatial structure analysis of this C-terminal fragment indicates that it consists of two structurally similar domains and suggests homology between them. Sequence similarity between the zinc-binding domains of type Ia topoisomerases and transcription regulators of known spatial structure helps to conclude that E. coli topo I contains five copies of a zinc ribbon domain at the C terminus. Two of these domains, corresponding to the C-terminal fragment, lost their cysteine residues and are probably not able to bind zinc. Present analyses lead to the classification of the C-terminal fragment of E. coli topoisomerase I as a member of zinc ribbon superfamily, despite the absence of zinc-binding sites.
Collapse
Affiliation(s)
- N V Grishin
- Biochemistry Department, University of Texas Southwestern Medical Center, 5323, Harry Hines Blvd, Dallas, TX, 75390-9038, USA.
| |
Collapse
|
31
|
Kelley LA, MacCallum RM, Sternberg MJ. Enhanced genome annotation using structural profiles in the program 3D-PSSM. J Mol Biol 2000; 299:499-520. [PMID: 10860755 DOI: 10.1006/jmbi.2000.3741] [Citation(s) in RCA: 1198] [Impact Index Per Article: 49.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
A method (three-dimensional position-specific scoring matrix, 3D-PSSM) to recognise remote protein sequence homologues is described. The method combines the power of multiple sequence profiles with knowledge of protein structure to provide enhanced recognition and thus functional assignment of newly sequenced genomes. The method uses structural alignments of homologous proteins of similar three-dimensional structure in the structural classification of proteins (SCOP) database to obtain a structural equivalence of residues. These equivalences are used to extend multiply aligned sequences obtained by standard sequence searches. The resulting large superfamily-based multiple alignment is converted into a PSSM. Combined with secondary structure matching and solvation potentials, 3D-PSSM can recognise structural and functional relationships beyond state-of-the-art sequence methods. In a cross-validated benchmark on 136 homologous relationships unambiguously undetectable by position-specific iterated basic local alignment search tool (PSI-Blast), 3D-PSSM can confidently assign 18 %. The method was applied to the remaining unassigned regions of the Mycoplasma genitalium genome and an additional 13 regions were assigned with 95 % confidence. 3D-PSSM is available to the community as a web server: http://www.bmm.icnet.uk/servers/3dpssm
Collapse
Affiliation(s)
- L A Kelley
- Biomolecular Modelling Laboratory, Imperial Cancer Research Fund, 44 Lincoln's Inn Fields, London, WC2A 3PX, England
| | | | | |
Collapse
|
32
|
Abstract
A number of recent advances have been made in deriving function information from protein structure. A fold relationship to an already characterized protein will often allow general information about function to be deduced. More detailed information can be obtained using sequence relationships to already studied proteins. Methods of deducing function directly from structure, without the use of evolutionary relationships, are developing rapidly. All such methods may be used with models of protein structure, rather than with experimentally determined ones, but model accuracy imposes limitations. The rapid expansion of the structural genomics field has created a new urgency for improved methods of structure-based annotation of function.
Collapse
Affiliation(s)
- J Moult
- Center for Advanced Research in Biotechnology, University of Maryland, Biotechnology Institute, Rockville, MD 20850, USA.
| | | |
Collapse
|
33
|
Grishin NV. Two tricks in one bundle: helix-turn-helix gains enzymatic activity. Nucleic Acids Res 2000; 28:2229-33. [PMID: 10871343 PMCID: PMC102627 DOI: 10.1093/nar/28.11.2229] [Citation(s) in RCA: 21] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2000] [Revised: 04/04/2000] [Accepted: 04/04/2000] [Indexed: 11/13/2022] Open
Abstract
Many examples of enzymes that have lost their catalytic activity and perform other biological functions are known. The opposite situation is rare. A previously unnoticed structural similarity between the lambda integrase family (Int) proteins and the AraC family of transcriptional activators implies that the Int family evolved by duplication of an ancient DNA-binding homeodomain-like module, which acquired enzymatic activity. The two helix-turn-helix (HTH) motifs in Int proteins incorporate catalytic residues and participate in DNA binding. The active site of Int proteins, which include the type IB topoisomerases, is formed at the domain interface and the catalytic tyrosine residue is located in the second helix of the C-terminal HTH motif. Structural analysis of other 'tyrosine' DNA-breaking/rejoining enzymes with similar enzyme mechanisms, namely prokaryotic topoisomerase I, topoisomerase II and archaeal topoisomerase VI, reveals that the catalytic tyrosine is placed in a HTH domain as well. Surprisingly, the location of this tyrosine residue in the structure is not conserved, suggesting independent, parallel evolution leading to the same catalytic function by homologous HTH domains. The 'tyrosine' recombinases give a rare example of enzymes that evolved from ancient DNA-binding modules and present a unique case for homologous enzymatic domains with similar catalytic mechanisms but different locations of catalytic residues, which are placed at non-homologous sites.
Collapse
Affiliation(s)
- N V Grishin
- Biochemistry Department, University of Texas Southwestern Medical Center, Dallas, TX 75390-9038, USA.
| |
Collapse
|
34
|
Domingues FS, Lackner P, Andreeva A, Sippl MJ. Structure-based evaluation of sequence comparison and fold recognition alignment accuracy. J Mol Biol 2000; 297:1003-13. [PMID: 10736233 DOI: 10.1006/jmbi.2000.3615] [Citation(s) in RCA: 72] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
The biological role, biochemical function, and structure of uncharacterized protein sequences is often inferred from their similarity to known proteins. A constant goal is to increase the reliability, sensitivity, and accuracy of alignment techniques to enable the detection of increasingly distant relationships. Development, tuning, and testing of these methods benefit from appropriate benchmarks for the assessment of alignment accuracy.Here, we describe a benchmark protocol to estimate sequence-to-sequence and sequence-to-structure alignment accuracy. The protocol consists of structurally related pairs of proteins and procedures to evaluate alignment accuracy over the whole set. The set of protein pairs covers all the currently known fold types. The benchmark is challenging in the sense that it consists of proteins lacking clear sequence similarity. Correct target alignments are derived from the three-dimensional structures of these pairs by rigid body superposition. An evaluation engine computes the accuracy of alignments obtained from a particular algorithm in terms of alignment shifts with respect to the structure derived alignments. Using this benchmark we estimate that the best results can be obtained from a combination of amino acid residue substitution matrices and knowledge-based potentials.
Collapse
Affiliation(s)
- F S Domingues
- Center for Applied Molecular Engineering, Institute for Chemistry and Biochemistry, University of Salzburg, Jakob Haringer Strasse 3, Salzburg, A-5020, Austria
| | | | | | | |
Collapse
|
35
|
Panchenko AR, Marchler-Bauer A, Bryant SH. Combination of threading potentials and sequence profiles improves fold recognition. J Mol Biol 2000; 296:1319-31. [PMID: 10698636 DOI: 10.1006/jmbi.2000.3541] [Citation(s) in RCA: 102] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Using a benchmark set of structurally similar proteins, we conduct a series of threading experiments intended to identify a scoring function with an optimal combination of contact-potential and sequence-profile terms. The benchmark set is selected to include many medium-difficulty fold recognition targets, where sequence similarity is undetectable by BLAST but structural similarity is extensive. The contact potential is based on the log-odds of non-local contacts involving different amino acid pairs, in native as opposed to randomly compacted structures. The sequence profile term is that used in PSI-BLAST. We find that combination of these terms significantly improves the success rate of fold recognition over use of either term alone, with respect to both recognition sensitivity and the accuracy of threading models. Improvement is greatest for targets between 10 % and 20 % sequence identity and 60 % to 80 % superimposable residues, where the number of models crossing critical accuracy and significance thresholds more than doubles. We suggest that these improvements account for the successful performance of the combined scoring function at CASP3. We discuss possible explanations as to why sequence-profile and contact-potential terms appear complementary.
Collapse
Affiliation(s)
- A R Panchenko
- National Center for Biotechnology Information, National Institutes of Health, Building 38A, Room 8N805, Bethesda, MD 20894, USA
| | | | | |
Collapse
|
36
|
Grigoriev IV, Kim SH. Detection of protein fold similarity based on correlation of amino acid properties. Proc Natl Acad Sci U S A 1999; 96:14318-23. [PMID: 10588703 PMCID: PMC24434 DOI: 10.1073/pnas.96.25.14318] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
An increasing number of proteins with weak sequence similarity have been found to assume similar three-dimensional fold and often have similar or related biochemical or biophysical functions. We propose a method for detecting the fold similarity between two proteins with low sequence similarity based on their amino acid properties alone. The method, the proximity correlation matrix (PCM) method, is built on the observation that the physical properties of neighboring amino acid residues in sequence at structurally equivalent positions of two proteins of similar fold are often correlated even when amino acid sequences are different. The hydrophobicity is shown to be the most strongly correlated property for all protein fold classes. The PCM method was tested on 420 proteins belonging to 64 different known folds, each having at least three proteins with little sequence similarity. The method was able to detect fold similarities for 40% of the 420 sequences. Compared with sequence comparison and several fold-recognition methods, the method demonstrates good performance in detecting fold similarities among the proteins with low sequence identity. Applied to the complete genome of Methanococcus jannaschii, the method recognized the folds for 22 hypothetical proteins.
Collapse
Affiliation(s)
- I V Grigoriev
- Department of Chemistry and E. O. Lawrence Berkeley National Laboratory, University of California, Berkeley, CA 94720, USA
| | | |
Collapse
|
37
|
Gromiha MM, Selvaraj S. Influence of medium and long range interactions in protein folding. Prep Biochem Biotechnol 1999; 29:339-51. [PMID: 10548251 DOI: 10.1080/10826069908544933] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Abstract
Protein structures are stabilized by both local and long range interactions. In this work, we analyze the residue-residue contacts and the role of medium- and long-range interactions in globular proteins belonging to different structural classes. The results show that while medium range interactions predominate in all-alpha class proteins, long-range interactions predominate in all-beta class. Based on this, we analyze the performance of several structure prediction methods in different structural classes of globular proteins and found that all the methods predict the secondary structures of all-alpha proteins more accurately than other classes. Also, we observed that the residues occurring in the range of 21-30 residues apart contributes more towards long-range contacts and about 85% of residues are involved in long-range contacts. Further, the preference of residue pairs to the folding and stability of globular proteins is discussed.
Collapse
Affiliation(s)
- M M Gromiha
- RIKEN Life Science Center, The Institute of Physical and Chemical Research, Tsukuba, Ibaraki, Japan
| | | |
Collapse
|
38
|
Abstract
BACKGROUND Several methods of structural classification have been developed to introduce some order to the large amount of data present in the Protein Data Bank. Such methods facilitate structural comparisons and provide a greater understanding of structure and function. The most widely used and comprehensive databases are SCOP, CATH and FSSP, which represent three unique methods of classifying protein structures: purely manual, a combination of manual and automated, and purely automated, respectively. In order to develop reliable template libraries and benchmarks for protein-fold recognition, a systematic comparison of these databases has been carried out to determine their overall agreement in classifying protein structures. RESULTS Approximately two-thirds of the protein chains in each database are common to all three databases. Despite employing different methods, and basing their systems on different rules of protein structure and taxonomy, SCOP, CATH and FSSP agree on the majority of their classifications. Discrepancies and inconsistencies are accounted for by a small number of explanations. Other interesting features have been identified, and various differences between manual and automatic classification methods are presented. CONCLUSIONS Using these databases requires an understanding of the rules upon which they are based; each method offers certain advantages depending on the biological requirements and knowledge of the user. The degree of discrepancy between the systems also has an impact on reliability of prediction methods that employ these schemes as benchmarks. To generate accurate fold templates for threading, we extract information from a consensus database, encompassing agreements between SCOP, CATH and FSSP.
Collapse
Affiliation(s)
- C Hadley
- Protein Structure Group Department of Biological Sciences University of Warwick Coventry, CV4 7AL, UK
| | | |
Collapse
|
39
|
Geetha V, Di Francesco V, Garnier J, Munson PJ. Comparing protein sequence-based and predicted secondary structure-based methods for identification of remote homologs. PROTEIN ENGINEERING 1999; 12:527-34. [PMID: 10436078 DOI: 10.1093/protein/12.7.527] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
We have compared a novel sequence-structure matching technique, FORESST, for detecting remote homologs to three existing sequence based methods, including local amino acid sequence similarity by BLASTP, hidden Markov models (HMMs) of sequences of protein families using SAM, HMMs based on sequence motifs identified using meta-MEME. FORESST compares predicted secondary structures to a library of structural families of proteins, using HMMs. Altogether 45 proteins from nine structural families in the database CATH were used in a cross-validated test of the fold assignment accuracy of each method. Local sequence similarity of a query sequence to a protein family is measured by the highest segment pair (HSP) score. Each of the HMM-based approaches (FORESST, MEME, amino acid sequence-based HMM) yielded log-odds score for the query sequence. In order to make a fair comparison among these methods, the scores for each method were converted to Z-scores in a uniform way by comparing the raw scores of a query protein with the corresponding scores for a set of unrelated proteins. Z-Scores were analyzed as a function of the maximum pairwise sequence identity (MPSID) of the query sequence to sequences used in training the model. For MPSID above 20%, the Z-scores increase linearly with MPSID for the sequence-based methods but remain roughly constant for FORESST. Below 15%, average Z-scores are close to zero for the sequence-based methods, whereas the FORESST method yielded average Z-scores of 1.8 and 1.1, using observed and predicted secondary structures, respectively. This demonstrates the advantage of the sequence-structure method for detecting remote homologs.
Collapse
Affiliation(s)
- V Geetha
- ABS/MSCL/CIT, National Institutes of Health, Bethesda, MD 20892, USA
| | | | | | | |
Collapse
|
40
|
Villoutreix BO, Blom AM, Webb J, Dahlbäck B. The complement regulator C4b-binding protein analyzed by molecular modeling, bioinformatics and computer-aided experimental design. IMMUNOPHARMACOLOGY 1999; 42:121-34. [PMID: 10408373 DOI: 10.1016/s0162-3109(99)00022-3] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Molecular modeling and bioinformatics have gained recognition as scientific disciplines of importance in the field of biomedical research. Molecular modeling not only allows to predict the three-dimensional structure of a protein but also helps to define its function. Careful incorporation of the experimental findings in the structural/theoretical data provides means to understand molecular mechanisms for highly complex biological systems. C4b-binding protein (C4BP) is composed of one beta-chain and seven alpha-chains essentially built from three- and eight-complement control protein (CCP) modules, respectively, followed by a non-repeat carboxy-terminal region involved in polymerization of the chains. C4BP is involved in the regulation of the complement system and interacts with many molecules such as C4b, Arp, protein S and heparin. Here, we report experimental and computer data obtained for C4BP. Protein modeling together with site directed mutagenesis indicate that R39, R64 and R66 from the C4BP alpha-chain form a key binding site for heparin, suggesting that this region could be of major importance for interaction with C4b. We also propose that the first CCP of the C4BP beta-chain displays a key hydrophobic surface of major importance for the interaction with the coagulation cofactor protein S.
Collapse
Affiliation(s)
- B O Villoutreix
- Lund University, The Wallenberg Laboratory, Department of Clinical Chemistry, University Hospital, Malmö, Sweden.
| | | | | | | |
Collapse
|
41
|
Ayers DJ, Gooley PR, Widmer-Cooper A, Torda AE. Enhanced protein fold recognition using secondary structure information from NMR. Protein Sci 1999; 8:1127-33. [PMID: 10338023 PMCID: PMC2144327 DOI: 10.1110/ps.8.5.1127] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
NMR offers the possibility of accurate secondary structure for proteins that would be too large for structure determination. In the absence of an X-ray crystal structure, this information should be useful as an adjunct to protein fold recognition methods based on low resolution force fields. The value of this information has been tested by adding varying amounts of artificial secondary structure data and threading a sequence through a library of candidate folds. Using a literature test set, the threading method alone has only a one-third chance of producing a correct answer among the top ten guesses. With realistic secondary structure information, one can expect a 60-80% chance of finding a homologous structure. The method has then been applied to examples with published estimates of secondary structure. This implementation is completely independent of sequence homology, and sequences are optimally aligned to candidate structures with gaps and insertions allowed. Unlike work using predicted secondary structure, we test the effect of differing amounts of relatively reliable data.
Collapse
Affiliation(s)
- D J Ayers
- Research School of Chemistry, Australian National University, Canberra ACT
| | | | | | | |
Collapse
|
42
|
de la Cruz X, Thornton JM. Factors limiting the performance of prediction-based fold recognition methods. Protein Sci 1999; 8:750-9. [PMID: 10211821 PMCID: PMC2144320 DOI: 10.1110/ps.8.4.750] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Abstract
In the past few years, a new generation of fold recognition methods has been developed, in which the classical sequence information is combined with information obtained from secondary structure and, sometimes, accessibility predictions. The results are promising, indicating that this approach may compete with potential-based methods (Rost B et al., 1997, J Mol Biol 270:471-480). Here we present a systematic study of the different factors contributing to the performance of these methods, in particular when applied to the problem of fold recognition of remote homologues. Our results indicate that secondary structure and accessibility prediction methods have reached an accuracy level where they are not the major factor limiting the accuracy of fold recognition. The pattern degeneracy problem is confirmed as the major source of error of these methods. On the basis of these results, we study three different options to overcome these limitations: normalization schemes, mapping of the coil state into the different zones of the Ramachandran plot, and post-threading graphical analysis.
Collapse
Affiliation(s)
- X de la Cruz
- Department of Biochemistry and Molecular Biology, University College, London, United Kingdom
| | | |
Collapse
|
43
|
Abstract
Long-range interactions play an active role in the stability of protein molecules. In this work, we have analyzed the importance of long-range interactions in different structural classes of globular proteins in terms of residue distances. We found that 85% of residues are involved in long-range contacts. The residues occurring in the range of 4-10 residues apart contribute more towards long-range contacts in all-alpha proteins while the range is 11-20 in all-beta proteins. The hydrophobic residues Cys, Ile and Val prefer the 11-20 range and all other residues prefer the 4-10 range. The residues in all-beta proteins have an average of 3-8 long-range contacts whereas the residues in other classes have 1-4 long-range contracts. Furthermore, the preference of residue pairs to the folding and stability will be discussed.
Collapse
Affiliation(s)
- M M Gromiha
- Institute of Physical and Chemical Research (RIKEN), Tsukuba Life Science Center, Ibaraki, Japan.
| | | |
Collapse
|
44
|
|
45
|
Russell RB, Sasieni PD, Sternberg MJ. Supersites within superfolds. Binding site similarity in the absence of homology. J Mol Biol 1998; 282:903-18. [PMID: 9743635 DOI: 10.1006/jmbi.1998.2043] [Citation(s) in RCA: 162] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
A method is presented to assess the significance of binding site similarities within superimposed protein three-dimensional (3D) structures and applied to all similar structures in the Protein Data Bank. For similarities between 3D structures lacking significant sequence similarity, the important distinction was made between remote homology (an ancient common ancestor) and analogy (likely convergence to a folding motif) according to the structural classification of proteins (SCOP) database. Supersites were defined as structural locations on groups of analogous proteins (i.e. superfolds) showing a statistically significant tendency to bind substrates despite little evidence of a common ancestor for the proteins considered. We identify three potentially new superfolds containing supersites: ferredoxin-like folds, four-helical bundles and double-stranded beta helices. In addition, the method quantifies binding site similarities within homologous proteins and previously identified supersites such as that found in the beta/alpha (TIM) barrels. For the nine superfolds, the accuracy of predictions of binding site locations is assessed. Implications for protein evolution, and the prediction of protein function either through fold recognition or tertiary structure comparison, are discussed.
Collapse
Affiliation(s)
- R B Russell
- Biomolecular Modelling Laboratory, Lincoln's Inn Fields, PO Box 123, London WC2A 3PX, UK
| | | | | |
Collapse
|
46
|
Turcotte M, Muggleton SH, Sternberg MJE. Application of inductive logic programming to discover rules governing the three-dimensional topology of protein structure. INDUCTIVE LOGIC PROGRAMMING 1998. [DOI: 10.1007/bfb0027310] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
|