1
|
Li W, Kinch LN, Karplus PA, Grishin NV. ChSeq: A database of chameleon sequences. Protein Sci 2015; 24:1075-86. [PMID: 25970262 DOI: 10.1002/pro.2689] [Citation(s) in RCA: 37] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2015] [Revised: 04/15/2015] [Accepted: 04/24/2015] [Indexed: 11/11/2022]
Abstract
Chameleon sequences (ChSeqs) refer to sequence strings of identical amino acids that can adopt different conformations in protein structures. Researchers have detected and studied ChSeqs to understand the interplay between local and global interactions in protein structure formation. The different secondary structures adopted by one ChSeq challenge sequence-based secondary structure predictors. With increasing numbers of available Protein Data Bank structures, we here identify a large set of ChSeqs ranging from 6 to 10 residues in length. The homologous ChSeqs discovered highlight the structural plasticity involved in biological function. When compared with previous studies, the set of unrelated ChSeqs found represents an about 20-fold increase in the number of detected sequences, as well as an increase in the longest ChSeq length from 8 to 10 residues. We applied secondary structure predictors on our ChSeqs and found that methods based on a sequence profile outperformed methods based on a single sequence. For the unrelated ChSeqs, the evolutionary information provided by the sequence profile typically allows successful prediction of the prevailing secondary structure adopted in each protein family. Our dataset will facilitate future studies of ChSeqs, as well as interpretations of the interplay between local and nonlocal interactions. A user-friendly web interface for this ChSeq database is available at prodata.swmed.edu/chseq.
Collapse
Affiliation(s)
- Wenlin Li
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, 75390-9050.,Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, Texas, 75390-9050
| | - Lisa N Kinch
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas, 75390-9050
| | - P Andrew Karplus
- Department of Biochemistry and Biophysics, Oregon State University, Corvallis, Oregon, 97331
| | - Nick V Grishin
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, 75390-9050.,Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, Texas, 75390-9050.,Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas, 75390-9050
| |
Collapse
|
2
|
Abstract
Motivation: Subcellular localization is one aspect of protein function. Despite advances in high-throughput imaging, localization maps remain incomplete. Several methods accurately predict localization, but many challenges remain to be tackled. Results: In this study, we introduced a framework to predict localization in life's three domains, including globular and membrane proteins (3 classes for archaea; 6 for bacteria and 18 for eukaryota). The resulting method, LocTree2, works well even for protein fragments. It uses a hierarchical system of support vector machines that imitates the cascading mechanism of cellular sorting. The method reaches high levels of sustained performance (eukaryota: Q18=65%, bacteria: Q6=84%). LocTree2 also accurately distinguishes membrane and non-membrane proteins. In our hands, it compared favorably with top methods when tested on new data. Availability: Online through PredictProtein (predictprotein.org); as standalone version at http://www.rostlab.org/services/loctree2. Contact:localization@rostlab.org Supplementary Information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Tatyana Goldberg
- TUM, Bioinformatik-I12, Informatik, Boltzmannstrasse 3, Garching 85748, Germany.
| | | | | |
Collapse
|
3
|
Ding W, Xie J, Dai D, Zhang H, Xie H, Zhang W. CNNcon: improved protein contact maps prediction using cascaded neural networks. PLoS One 2013; 8:e61533. [PMID: 23626696 PMCID: PMC3634008 DOI: 10.1371/journal.pone.0061533] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2012] [Accepted: 03/11/2013] [Indexed: 11/18/2022] Open
Abstract
BACKGROUNDS Despite continuing progress in X-ray crystallography and high-field NMR spectroscopy for determination of three-dimensional protein structures, the number of unsolved and newly discovered sequences grows much faster than that of determined structures. Protein modeling methods can possibly bridge this huge sequence-structure gap with the development of computational science. A grand challenging problem is to predict three-dimensional protein structure from its primary structure (residues sequence) alone. However, predicting residue contact maps is a crucial and promising intermediate step towards final three-dimensional structure prediction. Better predictions of local and non-local contacts between residues can transform protein sequence alignment to structure alignment, which can finally improve template based three-dimensional protein structure predictors greatly. METHODS CNNcon, an improved multiple neural networks based contact map predictor using six sub-networks and one final cascade-network, was developed in this paper. Both the sub-networks and the final cascade-network were trained and tested with their corresponding data sets. While for testing, the target protein was first coded and then input to its corresponding sub-networks for prediction. After that, the intermediate results were input to the cascade-network to finish the final prediction. RESULTS The CNNcon can accurately predict 58.86% in average of contacts at a distance cutoff of 8 Å for proteins with lengths ranging from 51 to 450. The comparison results show that the present method performs better than the compared state-of-the-art predictors. Particularly, the prediction accuracy keeps steady with the increase of protein sequence length. It indicates that the CNNcon overcomes the thin density problem, with which other current predictors have trouble. This advantage makes the method valuable to the prediction of long length proteins. As a result, the effective prediction of long length proteins could be possible by the CNNcon.
Collapse
Affiliation(s)
- Wang Ding
- School of Computer Engineering and Science, Shanghai University, Shanghai, People’s Republic of China
| | - Jiang Xie
- School of Computer Engineering and Science, Shanghai University, Shanghai, People’s Republic of China
- Institute of Systems Biology, Shanghai University, Shanghai, People’s Republic of China
- Department of Mathematics, University of California Irvine, Irvine, California, United States of America
| | - Dongbo Dai
- School of Computer Engineering and Science, Shanghai University, Shanghai, People’s Republic of China
| | - Huiran Zhang
- School of Computer Engineering and Science, Shanghai University, Shanghai, People’s Republic of China
| | - Hao Xie
- College of Stomatology, Wuhan University, Wuhan, People’s Republic of China
| | - Wu Zhang
- School of Computer Engineering and Science, Shanghai University, Shanghai, People’s Republic of China
- Institute of Systems Biology, Shanghai University, Shanghai, People’s Republic of China
- * E-mail:
| |
Collapse
|
4
|
Hamp T, Kassner R, Seemayer S, Vicedo E, Schaefer C, Achten D, Auer F, Boehm A, Braun T, Hecht M, Heron M, Hönigschmid P, Hopf TA, Kaufmann S, Kiening M, Krompass D, Landerer C, Mahlich Y, Roos M, Rost B. Homology-based inference sets the bar high for protein function prediction. BMC Bioinformatics 2013; 14 Suppl 3:S7. [PMID: 23514582 PMCID: PMC3584931 DOI: 10.1186/1471-2105-14-s3-s7] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Any method that de novo predicts protein function should do better than random. More challenging, it also ought to outperform simple homology-based inference. METHODS Here, we describe a few methods that predict protein function exclusively through homology. Together, they set the bar or lower limit for future improvements. RESULTS AND CONCLUSIONS During the development of these methods, we faced two surprises. Firstly, our most successful implementation for the baseline ranked very high at CAFA1. In fact, our best combination of homology-based methods fared only slightly worse than the top-of-the-line prediction method from the Jones group. Secondly, although the concept of homology-based inference is simple, this work revealed that the precise details of the implementation are crucial: not only did the methods span from top to bottom performers at CAFA, but also the reasons for these differences were unexpected. In this work, we also propose a new rigorous measure to compare predicted and experimental annotations. It puts more emphasis on the details of protein function than the other measures employed by CAFA and may best reflect the expectations of users. Clearly, the definition of proper goals remains one major objective for CAFA.
Collapse
Affiliation(s)
- Tobias Hamp
- TUM, Department of Informatics, Bioinformatics & Computational Biology - I12 Boltzmannstr, 3, 85748 Garching/Munich, Germany
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
5
|
Yan J, Marcus M, Kurgan L. Comprehensively designed consensus of standalone secondary structure predictors improves Q3 by over 3%. J Biomol Struct Dyn 2013; 32:36-51. [PMID: 23298369 DOI: 10.1080/07391102.2012.746945] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Protein fold is defined by a spatial arrangement of three types of secondary structures (SSs) including helices, sheets, and coils/loops. Current methods that predict SS from sequences rely on complex machine learning-derived models and provide the three-state accuracy (Q3) at about 82%. Further improvements in predictive quality could be obtained with a consensus-based approach, which so far received limited attention. We perform first-of-its-kind comprehensive design of a SS consensus predictor (SScon), in which we consider 12 modern standalone SS predictors and utilize Support Vector Machine (SVM) to combine their predictions. Using a large benchmark data-set with 10 random training-test splits, we show that a simple, voting-based consensus of carefully selected base methods improves Q3 by 1.9% when compared to the best single predictor. Use of SVM provides additional 1.4% improvement with the overall Q3 at 85.6% and segment overlap (SOV3) at 83.7%, when compared to 82.3 and 80.9%, respectively, obtained by the best individual methods. We also show strong improvements when the consensus is based on ab-initio methods, with Q3 = 82.3% and SOV3 = 80.7% that match the results from the best template-based approaches. Our consensus reduces the number of significant errors where helix is confused with a strand, provides particularly good results for short helices and strands, and gives the most accurate estimates of the content of individual SSs in the chain. Case studies are used to visualize the improvements offered by the consensus at the residue level. A web-server and a standalone implementation of SScon are available at http://biomine.ece.ualberta.ca/SSCon/ .
Collapse
Affiliation(s)
- Jing Yan
- a Department of Electrical and Computer Engineering , University of Alberta , Edmonton , Canada
| | | | | |
Collapse
|
6
|
PSS-3D1D: an improved 3D1D profile method of protein fold recognition for the annotation of twilight zone sequences. ACTA ACUST UNITED AC 2011; 12:181-9. [DOI: 10.1007/s10969-011-9119-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2011] [Accepted: 11/24/2011] [Indexed: 10/14/2022]
|
7
|
Wrzeszczynski KO, Rost B. Cell cycle kinases predicted from conserved biophysical properties. Proteins 2009; 74:655-68. [PMID: 18704950 DOI: 10.1002/prot.22181] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Machine-learning techniques can classify functionally related proteins where homology-transfer as well as sequence and structure motifs fail. Here, we present a method that aimed at complementing homology-transfer in the identification of cell cycle control kinases from sequence alone. First, we identified functionally significant residues in cell cycle proteins through their high sequence conservation and biophysical properties. We then incorporated these residues and their features into support vector machines (SVM) to identify new kinases and more specifically to differentiate cell cycle kinases from other kinases and other proteins. As expected, the most informative residues tend to be highly conserved and tend to localize in the ATP binding regions of the kinases. Another observation confirmed that ATP binding regions are typically not found on the surface but in partially buried sites, and that this fact is correctly captured by accessibility predictions. Using these highly conserved, semi-buried residues and their biophysical properties, we could distinguish cell cycle S/T kinases from other kinase families at levels around 70-80% accuracy and 62-81% coverage. An application to the entire human proteome predicted at least 97 human proteins with limited previous annotations to be candidates for cell cycle kinases.
Collapse
Affiliation(s)
- Kazimierz O Wrzeszczynski
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, USA
| | | |
Collapse
|
8
|
Swanson R, Kagiampakis I, Tsai JW. An information measure of the quality of protein secondary structure prediction. J Comput Biol 2008; 15:65-79. [PMID: 18199024 DOI: 10.1089/cmb.2007.0199] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
We describe an information-theory-based measure of the quality of secondary structure prediction (RELINFO). RELINFO has a simple yet intuitive interpretation: it represents the factor by which secondary structure choice at a residue has been restricted by a prediction scheme. As an alternative interpretation of secondary structure prediction, RELINFO complements currently used methods by providing an information-based view as to why a prediction succeeds and fails. To demonstrate this score's capabilities, we applied RELINFO to an analysis of a large set of secondary structure predictions obtained from the first five rounds of the Critical Assessment of Structure Prediction (CASP) experiment. RELINFO is compared with two other common measures: percent correct (Q3) and secondary structure overlap (SOV). While the correlation between Q3 and RELINFO is approximately 0.85, RELINFO avoids certain disadvantages of Q3, including overestimating the quality of a prediction. The correlation between SOV and RELINFO is approximately 0.75. The valuable SOV measure unfortunately suffers from a saturation problem, and perhaps has unfairly given the general impression that secondary structure prediction has reached its limit since SOV hasn't improved much over the recent rounds of CASP. Although not a replacement for SOV, RELINFO has greater dispersion. Over the five rounds of CASP assessed here, RELINFO shows that predictions targets have been more difficult in successive CASP experiments, yet the predictions quality has continued to improve measurably over each round. In terms of information, the secondary structure prediction quality has almost doubled from CASP1 to CASP5. Therefore, as a different perspective of accuracy, RELINFO can help to improve prediction of protein secondary structure by providing a measure of difficulty as well as final quality of a prediction.
Collapse
Affiliation(s)
- Rosemarie Swanson
- Department of Biochemistry and Biophysics, Texas A&M University, Texas Agricultural Experiment Station, College Station, Texas 77843-2128, USA.
| | | | | |
Collapse
|
9
|
Abstract
Is there any reason why we should predict contact maps (CMs)? The question is one of the several 'NP-hard' questions that arise when striving for feasible solutions of the protein folding problem. At some point, theoreticians started thinking that a possible alternative to an unsolvable problem was to predict a simplified version of the protein structure: a CM. In this chapter, we will clarify that whenever problems are difficult they remain at least as difficult in the process of finding approximate solutions or heuristic approaches. However, humans rarely give up, as it is stimulating to find solutions in the face of difficulties. CMs of proteins are an interesting and useful representation of protein structures. These two-dimensional representations capture all the important features of a protein fold. We will review the general characteristics of CMs and the methods developed to study and predict them, and we will highlight some new ideas on how to improve CM predictions.
Collapse
|
10
|
Homaeian L, Kurgan LA, Ruan J, Cios KJ, Chen K. Prediction of protein secondary structure content for the twilight zone sequences. Proteins 2007; 69:486-98. [PMID: 17623861 DOI: 10.1002/prot.21527] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Secondary protein structure carries information about local structural arrangements, which include three major conformations: alpha-helices, beta-strands, and coils. Significant majority of successful methods for prediction of the secondary structure is based on multiple sequence alignment. However, multiple alignment fails to provide accurate results when a sequence comes from the twilight zone, that is, it is characterized by low (<30%) homology. To this end, we propose a novel method for prediction of secondary structure content through comprehensive sequence representation, called PSSC-core. The method uses a multiple linear regression model and introduces a comprehensive feature-based sequence representation to predict amount of helices and strands for sequences from the twilight zone. The PSSC-core method was tested and compared with two other state-of-the-art prediction methods on a set of 2187 twilight zone sequences. The results indicate that our method provides better predictions for both helix and strand content. The PSSC-core is shown to provide statistically significantly better results when compared with the competing methods, reducing the prediction error by 5-7% for helix and 7-9% for strand content predictions. The proposed feature-based sequence representation uses a comprehensive set of physicochemical properties that are custom-designed for each of the helix and strand content predictions. It includes composition and composition moment vectors, frequency of tetra-peptides associated with helical and strand conformations, various property-based groups like exchange groups, chemical groups of the side chains and hydrophobic group, auto-correlations based on hydrophobicity, side-chain masses, hydropathy, and conformational patterns for beta-sheets. The PSSC-core method provides an alternative for predicting the secondary structure content that can be used to validate and constrain results of other structure prediction methods. At the same time, it also provides useful insight into design of successful protein sequence representations that can be used in developing new methods related to prediction of different aspects of the secondary protein structure.
Collapse
Affiliation(s)
- Leila Homaeian
- Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Alberta, Canada
| | | | | | | | | |
Collapse
|
11
|
Graña O, Baker D, MacCallum RM, Meiler J, Punta M, Rost B, Tress ML, Valencia A. CASP6 assessment of contact prediction. Proteins 2006; 61 Suppl 7:214-224. [PMID: 16187364 DOI: 10.1002/prot.20739] [Citation(s) in RCA: 73] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Here we present the evaluation results of the Critical Assessment of Protein Structure Prediction (CASP6) contact prediction category. Contact prediction was assessed with standard measures well known in the field and the performance of specialist groups was evaluated alongside groups that submitted models with 3D coordinates. The evaluation was mainly focused on long range contact predictions for the set of new fold targets, although we analyzed predictions for all targets. Three groups with similar levels of accuracy and coverage performed a little better than the others. Comparisons of the predictions of the three best methods with those of CASP5/CAFASP3 suggested some improvement, although there were not enough targets in the comparisons to make this statistically significant.
Collapse
Affiliation(s)
- Osvaldo Graña
- Protein Design Group, Centro Nacional de Biotecnologia (CNB-CSIC), C/Darwin 3, Cantoblanco, Madrid, Spain
| | | | | | | | | | | | | | | |
Collapse
|
12
|
Ferré S, King RD. Finding Motifs in Protein Secondary Structure for Use in Function Prediction. J Comput Biol 2006; 13:719-31. [PMID: 16706721 DOI: 10.1089/cmb.2006.13.719] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
This paper presents a novel algorithm for the discovery of biological sequence motifs. Our motivation is the prediction of gene function. We seek to discover motifs and combinations of motifs in the secondary structure of proteins for application to the understanding and prediction of functional classes. The motifs found by our algorithm allow both flexible length structural elements and flexible length gaps and can be of arbitrary length. The algorithm is based on neither top-down nor bottom-up search, but rather is dichotomic. It is also "anytime," so that fixed termination of the search is not necessary. We have applied our algorithm to yeast sequence data to discover rules predicting function classes from secondary structure. These resultant rules are informative, consistent with known biology, and a contribution to scientific knowledge. Surprisingly, the rules also demonstrate that secondary structure prediction algorithms are effective for membrane proteins and suggest that the association between secondary structure and function is stronger in membrane proteins than globular ones. We demonstrate that our algorithm can successfully predict gene function directly from predicted secondary structure; e.g., we correctly predict the gene YGL124c to be involved in the functional class "cytoplasmic and nuclear degradation." Datasets and detailed results (generated motifs, rules, evaluation on test dataset, and predictions on unknown dataset) are available at www.aber.ac.uk/compsci/Research/bio/dss/yeast.ss.mips/, and www.genepredictions.org.
Collapse
Affiliation(s)
- Sébastien Ferré
- Irisa/Université de Rennes 1, Campus de Beaulieu, 35042 Rennes cedex, France.
| | | |
Collapse
|
13
|
Bodén M, Yuan Z, Bailey TL. Prediction of protein continuum secondary structure with probabilistic models based on NMR solved structures. BMC Bioinformatics 2006; 7:68. [PMID: 16478545 PMCID: PMC1386714 DOI: 10.1186/1471-2105-7-68] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2005] [Accepted: 02/14/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The structure of proteins may change as a result of the inherent flexibility of some protein regions. We develop and explore probabilistic machine learning methods for predicting a continuum secondary structure, i.e. assigning probabilities to the conformational states of a residue. We train our methods using data derived from high-quality NMR models. RESULTS Several probabilistic models not only successfully estimate the continuum secondary structure, but also provide a categorical output on par with models directly trained on categorical data. Importantly, models trained on the continuum secondary structure are also better than their categorical counterparts at identifying the conformational state for structurally ambivalent residues. CONCLUSION Cascaded probabilistic neural networks trained on the continuum secondary structure exhibit better accuracy in structurally ambivalent regions of proteins, while sustaining an overall classification accuracy on par with standard, categorical prediction methods.
Collapse
Affiliation(s)
- Mikael Bodén
- School of Information Technology and Electrical Engineering, The University of Queensland, QLD 4072, St Lucia, Australia
| | - Zheng Yuan
- Institute of Molecular Bioscience, The University of Queensland, QLD 4072, St Lucia, Australia
| | - Timothy L Bailey
- Institute of Molecular Bioscience, The University of Queensland, QLD 4072, St Lucia, Australia
| |
Collapse
|
14
|
Graña O, Eyrich VA, Pazos F, Rost B, Valencia A. EVAcon: a protein contact prediction evaluation service. Nucleic Acids Res 2005; 33:W347-51. [PMID: 15980486 PMCID: PMC1160172 DOI: 10.1093/nar/gki411] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open
Abstract
Here we introduce EVAcon, an automated web service that evaluates the performance of contact prediction servers. Currently, EVAcon is monitoring nine servers, four of which are specialized in contact prediction and five are general structure prediction servers. Results are compared for all newly determined experimental structures deposited into PDB (∼5–50 per week). EVAcon allows for a precise comparison of the results based on a system of common protein subsets and the commonly accepted evaluation criteria that are also used in the corresponding category of the CASP assessment. EVAcon is a new service added to the functionality of the EVA system for the continuous evaluation of protein structure prediction servers. The new service is accesible from any of the three EVA mirrors: PDG (CNB-CSIC, Madrid) (); CUBIC (Columbia University, NYC) (); and Sali Lab (UCSF, San Francisco) ().
Collapse
Affiliation(s)
| | - Volker A. Eyrich
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University650 West 168th Street BB217, New York, NY 10032, USA
| | | | - Burkhard Rost
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University650 West 168th Street BB217, New York, NY 10032, USA
| | - Alfonso Valencia
- To whom correspondence should be addressed. Tel: +34 91 585 4570; Fax: +34 91 585 4506;
| |
Collapse
|
15
|
Przybylski D, Rost B. Improving Fold Recognition Without Folds. J Mol Biol 2004; 341:255-69. [PMID: 15312777 DOI: 10.1016/j.jmb.2004.05.041] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2004] [Revised: 05/18/2004] [Accepted: 05/18/2004] [Indexed: 11/21/2022]
Abstract
The most reliable way to align two proteins of unknown structure is through sequence-profile and profile-profile alignment methods. If the structure for one of the two is known, fold recognition methods outperform purely sequence-based alignments. Here, we introduced a novel method that aligns generalised sequence and predicted structure profiles. Using predicted 1D structure (secondary structure and solvent accessibility) significantly improved over sequence-only methods, both in terms of correctly recognising pairs of proteins with different sequences and similar structures and in terms of correctly aligning the pairs. The scores obtained by our generalised scoring matrix followed an extreme value distribution; this yielded accurate estimates of the statistical significance of our alignments. We found that mistakes in 1D structure predictions correlated between proteins from different sequence-structure families. The impact of this surprising result was that our method succeeded in significantly out-performing sequence-only methods even without explicitly using structural information from any of the two. Since AGAPE also outperformed established methods that rely on 3D information, we made it available through. If we solved the problem of CPU-time required to apply AGAPE on millions of proteins, our results could also impact everyday database searches.
Collapse
Affiliation(s)
- Dariusz Przybylski
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA.
| | | |
Collapse
|
16
|
Lett D, Hsing M, Pio F. Interaction profile-based protein classification of death domain. BMC Bioinformatics 2004; 5:75. [PMID: 15189571 PMCID: PMC459208 DOI: 10.1186/1471-2105-5-75] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2004] [Accepted: 06/09/2004] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The increasing number of protein sequences and 3D structure obtained from genomic initiatives is leading many of us to focus on proteomics, and to dedicate our experimental and computational efforts on the creation and analysis of information derived from 3D structure. In particular, the high-throughput generation of protein-protein interaction data from a few organisms makes such an approach very important towards understanding the molecular recognition that make-up the entire protein-protein interaction network. Since the generation of sequences, and experimental protein-protein interactions increases faster than the 3D structure determination of protein complexes, there is tremendous interest in developing in silico methods that generate such structure for prediction and classification purposes. In this study we focused on classifying protein family members based on their protein-protein interaction distinctiveness. Structure-based classification of protein-protein interfaces has been described initially by Ponstingl et al. 1 and more recently by Valdar et al. 2 and Mintseris et al. 3, from complex structures that have been solved experimentally. However, little has been done on protein classification based on the prediction of protein-protein complexes obtained from homology modeling and docking simulation. RESULTS We have developed an in silico classification system entitled HODOCO (Homology modeling, Docking and Classification Oracle), in which protein Residue Potential Interaction Profiles (RPIPS) are used to summarize protein-protein interaction characteristics. This system applied to a dataset of 64 proteins of the death domain superfamily was used to classify each member into its proper subfamily. Two classification methods were attempted, heuristic and support vector machine learning. Both methods were tested with a 5-fold cross-validation. The heuristic approach yielded a 61% average accuracy, while the machine learning approach yielded an 89% average accuracy. CONCLUSION We have confirmed the reliability and potential value of classifying proteins via their predicted interactions. Our results are in the same range of accuracy as other studies that classify protein-protein interactions from 3D complex structure obtained experimentally. While our classification scheme does not take directly into account sequence information our results are in agreement with functional and sequence based classification of death domain family members.
Collapse
Affiliation(s)
- Drew Lett
- Department of Computer Science, Simon Fraser University, 8888 University Drive, Burnaby, B.C. Canada, V5A 1S6
| | - Michael Hsing
- Department of Molecular Biology and Biochemistry, Simon Fraser University, 8888 University Drive, Burnaby, B.C. Canada, V5A 1S6
| | - Frederic Pio
- Department of Molecular Biology and Biochemistry, Simon Fraser University, 8888 University Drive, Burnaby, B.C. Canada, V5A 1S6
| |
Collapse
|